A Paradigm Shift to More Affordable Data Stacks

Ju Data Engineering Weekly - Ep 32

Sep 12, 2023

The Cloud Data Warehouse (Snowflake, BigQuery, and Databricks) sits at the heart and is the largest expenditure in the modern data stack.

Using these warehouses in conjunction with DBT has empowered data teams to extend analytical tools to a growing number of users within their companies.

However, this democratization has given rise to a significant issue: soaring credit costs.

A significant selling point of warehouses is scalability; While this is undoubtedly true, it raises a critical question:

is an infinitely scalable solution necessary for the majority of your queries?

In reality, most companies require the capabilities of platforms like Snowflake, Databricks, or BigQuery for a very limited number of queries as described here:

“Very often when a data warehousing customer moves from an environment where they didn’t have separation of storage and compute into one where they do have it, their storage usage grows tremendously, but their compute needs tend to not really change.”

Constantly using Snowflake credits without actually leveraging its scalability implies a disproportionate expense.

In fact, a substantial part of your payment is likely subsidizing the sales teams at these companies rather than covering the actual computing resources needed.

This is hardly an efficient use of resources.

Alternative solutions are emerging as companies intensify their efforts to limit analytical expenses (cf Instacart case) by leveraging the latest open-source advancements.

Multi-Compute Engine Data Stack

The separation of storage and compute (cf post) doesn't necessarily mean that the same vendor must provision them.

An emerging approach to constructing data stacks leverages Apache Iceberg as a unified storage, and combines on top of it the optimal compute solution for each workload:

Let’s have a look at each element of this stack.

Unified Storage: Apache Iceberg

Apache Iceberg is an open table format: it is an open-source standard used to define tables.

As explained here:

Imagine you have a thousand Parquet files in a cloud storage bucket. If you have three different readers of that data, they all need to know which files correspond to a table.
Table formats like Iceberg are designed to solve this problem—they provide rich table metadata on top of files.

Iceberg usage has increased a lot and is emerging as the standard of the data lake as tables can then be used by different computing engines.

This allows companies to craft future-proof data stacks by picking the compute engine to be tailored to specific workloads.

Light Compute Engine: Duck DB

A new compute engine emerged these last months: DuckDB.

It’s an open-source embedded OLAP database that can run in-process rather than relying on the traditional client-server architecture.

As with SQLite, there’s no need to install a database server to get started. Installing DuckDB instantly turns your laptop into an OLAP engine.

This makes it the perfect candidate to handle small workloads at a much lower cost than Snowflake, Databricks, and BigQuery.

New players are entering the market to facilitate better integration of DuckDB into the modern data stack, including initiatives like MotherDuck:

Motherduck simplifies the ingestion of files located in a remote bucket and makes collaboration around Duck db easier:

Support for the Iceberg tables in Duck DB is at its beginning (extension) but will certainly improve a lot in the next months.

Scalable Engine for Large Payload: Snowflake

Snowflake has recently announced a new feature that allows the creation of Iceberg tables directly within its platform.

Essentially, these Iceberg tables function as Snowflake tables but use open formats and cloud storage provided by the customer.

This development is intriguing because it opens the way to using Snowflake exclusively for handling compute-intensive workflows.

New large workflow processing engines are also emerging to compete in this field:

These platforms promise significant improvements in price-performance ratios, claiming gains of 5 to 20 times when compared to existing data platforms.

DBT as a unified data interface

Since all these engines offer an SQL API, you can use DBT as a unified interface for your data.

Each engine would live in its own DBT project and orchestrated by a tool such as Dagster:

This would make the choice of engine transparent to the user and simplify the management of these different engines.

In conclusion, the rise of open-source formats like Apache Iceberg and light compute engines such as DuckDB open a future where organizations might not be bound to a single vendor for their data warehouse needs.

This opens the possibility of a multi-compute engine ecosystem, where different engines can be leveraged for different tasks, optimizing costs drastically.

Thanks for reading,

-Ju

I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. you can reply to this email; it will get to me.

Ju Data Engineering Newsletter

Discussion about this post