Databricks vs Snowflake: The race to build one-stop-shop for your data

The heated competition between enterprise data leaders Databricks and Snowflake continued today, after Snowflake doubled down on its core strength: industry partnerships.

Snowflake announced it is bringing Amazon.com’s sales channel data directly into customers’ Snowflake data warehouse instances, as part of its new data cloud for the retail industry. And this comes just days after Snowflake launched a data cloud for the health industry.

With enterprises large and small racing to build out their data infrastructure, one foundational piece these enterprise companies all need is an easy place to store their data.

To address this need, Databricks and Snowflake have emerged as the best one-stop shops for this. They are locked in a duel, espousing different approaches, and having different cultures.

In the one corner is Databricks, which innovated what is called a data lake, a place where you can dump all of your data – no matter the format – and was built and is still run by researchers and academics who dream of “changing the world,” says the company’s CEO Ali Ghodsi, who was an academic for seven years before founding Databricks. It’s tech focused, and engineering-led.

In the other corner is Snowflake, which innovated what is called the data warehouse, a place that, simply put, starts with more structure, to allow more easy analytics on the data. And it’s run not by researchers and academics, but a CEO Frank Slootman, who’s had more than a decade of experience as a business executive running large companies as CEO or president.

And now, while they come from different ends of the spectrum, they are now branching out into each other’s territory, with the goal to build the one-stop-shop for all things enterprise data or what many refer to as a ‘lakehouse’.

Their recent moves continue to show how different they are. “Snowflake’s innovation is its investment in its ecosystem and partnerships – and its PR and sales machines,” said Andrew Brust, founder of strategy and advisory firm Blue Badge Insights. “They are great sellers and are building a data marketplace that adds real value. On the other hand, Databricks is very focused on technological excellence, performance, features and high-end machine learning capabilities.”

The move to the lakehouse

Cloud data lakes and warehouses have become a critical element in answering enterprise data management needs. Organizations typically take enterprise data from various sources and operational processes, and first store it in a raw data lake. Then they can perform a round of ETL (extract, transform, load) procedures to shift critical parts of this data into a form that can be stored in a data warehouse. This is where business and other users can more easily generate useful business insights from the data.

While the process has been useful, companies often find it difficult and costly to maintain consistency between their data lake and their data warehouse infrastructures. Their teams need to employ continuous data engineering tactics to ETL/ELT data between the two systems, which can affect the overall quality of the data. Plus, because the data is constantly changing (depending on the pipeline), the information stored in a warehouse may not be as current as that in a data lake.

That is why Databricks has been working hard to make its data lake more compatible with the features of a warehouse, and Snowflake has been adapting its warehouse to allow more features of a data lake. Both really now look like lakehouses.

The rise of Snowflake 

While the data industry has seen and continues to see many data platforms, including offerings from Amazon and Google and a bunch of startups, Databricks and Snowflake have left a particular mark.

To understand, we have to go back to Hadoop. Nearly two decades ago, the open source Java-based framework took the initial steps to solve the storage and processing layer for big data, but it failed to gain widespread adoption due to technical complexities.

Snowflake, founded in 2012 by former Oracle data architects Benoit Dageville and Thierry Cruanes, came to the scene as a better, faster alternative to Hadoop. In no time, the company became the go-to choice for a cloud database that would give customers a single platform to store, access, analyze and share large amounts of structured data from anywhere (AWS, Azure, or any other source). It transformed the warehousing space by offering highly scalable and distributed computation capability. Today, Snowflake customers can easily connect business intelligence tools such as Tableau and conduct historical data analyses using SQL on their datasets.

The ease of use and scale of the platform has driven massive adoption of Snowflake over the years. It went public in 2020, and rocketed to a market value of $100 billion, as the pandemic pushed enterprises to invest more into their data infrastructure to allow for things like hybrid work. Its value has since come down to around $73 billion (as of March 29). The revenue of the company has grown 106% from $592 million in FY21 to $1219 million in FY22, while the customer base has surged to over 5900 – including about two-fifths of Fortune 500 companies.

The concurrent rise of Databricks

Databricks, meanwhile, was founded in 2013, although the groundwork for it was laid way before in 2009 with the open source Apache Spark project – a multi-language engine for data engineering, data science, and machine learning. Spark drew widespread attention with its in-memory processing, which allowed faster and more efficient handling of workloads. So, the team at the backend, academics at UC Berkeley, commercialized the project by founding Databricks, which offered enterprises a cloud SaaS platform (data lake) primarily aimed at storing and processing large amounts of unstructured data for training AI/ML applications for predictive analytics.

Since then, the company has roped in over 5000 enterprises as customers, including Condé Nast, H&M Group, and ABN AMRO. It has also raised significant capital, with the most recent round of $1.6 billion in August of 2021 valuing the company at $38 billion. It is still not public but clearly continues to gain traction.

Product-focus vs customer-focus

Initially, Databricks and Snowflake stayed clear of each other, focusing on growing in their respective markets: Snowflake was building the best data warehouse and Databricks was building the best data lake. Ali Ghodsi, the CEO of Databricks and an adjunct professor at UC Berkeley, worked with his fellow co-founders (who were also academics) and developed a largely engineering and product-driven culture for the data science community.

“We took three key bets: 100% cloud, open source and machine learning. I think we are today the largest commercial open source-only vendor that’s independent. That has a lot to do with because we were looking far into the future,” Ghodsi told Venturebeat.

“As academics and researchers, you think about how you can change the world while as a business person you look at how much money you can make this year or next year or what Wall Street will say in the next three years,” he said.

Snowflake, however, took its early steps in the market under the leadership of a strong product leader, Bob Muglia. A veteran from Microsoft and Juniper Networks, Muglia joined as CEO two years after Snowflake was founded. However, in 2019, he stepped down, and Frank Slootman, an experienced business executive who had led companies for a decade and a half, took over. With a master’s degree in economics and the experience of successfully leading three tech giants to IPO (including Snowflake), Slootman has promoted a sales-driven culture at the company with aggressive business execution through partnerships and marketing.

“He is a commercial pro that takes companies from zero to 60 in three seconds. Doesn’t really matter what the company is. For Ali, on the other hand, this is his life’s work. He’s a scientist,” said a senior executive at a company that does extensive business with both companies, but who requested anonymity to avoid offending either company.

Snowflake’s customer-focused approach was also reiterated by Christian Kleinerman, the company’s SVP of Product. According to him, it began from the early days of the company under Muglia who shaped the culture of the company.

“His fingerprints are in many areas of what we do. We’ve been public in aspects like our business model, how we think about customers, the obsession with customers, which to this day is a key value. We’re honest about it, spend all of our time on the needs of our customers, not anything else, not intellectual stimulation of interesting problems or competitors. Now it’s all customers. Of course, our founders play a big role in that as well. But, Bob was a big part of that,” Kleinerman said.

Converging from different directions

Now, after building successful businesses in different corners of the data space and on the back of these very different cultures, Snowflake and Databricks are on a collision course. Databricks has been moving towards offering the capabilities and performance of a data warehouse with its core artificial intelligence (AI) offering, while Snowflake has been inching towards adding data science workloads (among other things).

Databricks is marketing its SQL analytics and business intelligence (BI) tool integrations (like Tableau or Microsoft Power BI) for structured data by using the term lakehouse more widely. Meanwhile, Snowflake has launched new data lake-like features, including support for unstructured data and the ability to build AI/ML projects. Although, instead of lakehouse, it is using the term “Data Cloud” to define its broader, more comprehensive offering.

“We think first about data structure and data governance… that’s where we start. And now, our vectors of expansion are how do you bring more capability into our platform without compromising the promise that we have for customers? For us, AI and [machine learning] ML is one such expansion,” Kleinerman told VentureBeat.

“We have plenty of customers leveraging us for data transformation and what you would hear is, I don’t want to have to copy my data out to another system and compromise my governance just to do machine learning training. So that’s what we’re bringing in,” he said.

In addition to this, both companies have also debuted dedicated vertical-specific offerings to cater to retailhealthcare and other growing sectors.

Billion-dollar question: Who will win?

While Ghodsi cited a UC Berkeley study to note that adding machine learning algorithms on top of a data warehouse is basically a ‘large hack’ and not very feasible due to performance and support issues, Snowflake claims otherwise.

“Data is the most important part of any ML system, and Snowflake is continually innovating in how to mobilize the world’s data to best enable ML. Snowflake was designed from the ground up to be a single source of data truth, and to deliver the performance, speed, and elasticity needed to process the growing volume of data that powers ML workflows,” Tal Shaked, ML Architect at Snowflake, told VentureBeat.

The two companies have also been engaged in a PR battle, with Databricks claiming that its SQL lakehouse platform provides superior performance and price-performance over Snowflake, even on data warehousing workloads (TPC-DS), while the latter disputing it blatantly.

On November 2, Databricks shared a third-party benchmark by the Barcelona Supercomputing Center to note that its SQL lakehouse performs 2.7 times faster than a similarly sized Snowflake setup (which took 8397 seconds). However, about ten days later, Snowflake published a blog saying that the claim lacks integrity and is wildly incongruent with its internal benchmarks and customer experiences. Instead, it claimed to have run the same benchmark in 3,760 seconds. The company even asked users to test for themselves.

In response, Databricks suggested that the improved performance was the result of Snowflake’s pre-baked TPC-DS dataset, which had been created two days after the announcement of the results. With the official TPC-DS dataset, the performance was nowhere near what Snowflake claimed.