Cannibalization in the Data Stack

Ju Data Engineering Weekly - Ep 47

Jan 31, 2024

The last ten years have seen a movement toward decentralizing the data stack. Many tools have been created to solve niche problems.

While this has allowed a lot of new features and possibilities, it has resulted in:

Fragmented development experience
High setup costs (engineering)
High maintenance costs (engineering)

Many vendors have identified this problem and are trying to expand horizontally to solve it (centralization).

Let's dig into some examples in this blog post.

1- Rebundling via Orchestrators

Orchestrators are in the perfect position to offer a central control plane over the complete stack.

While their original purpose is to coordinate various jobs via DAGs, they are doing more and more.

There infra is usually split between :

a scheduler planning the tasks.
ephemeral runner/worker executing them.

This separation is a good incentive to run more and more computations (data transformations) with the “runners”, as they are capable for example of executing Python code.

This has become a usual trade-off for engineers: should I execute my code in the orchestrator or only trigger its execution from the orchestrator?

Orchestrators are expanding as well on the ELT side!

First on the “E” (extract), they now offer a library of connectors, enabling direct connections to data sources.

Dagster integrations
Kestra plugins

And for the "T" (Transform) component, they are increasingly integrating with dbt.

For instance, Dagster can parse a dbt DAG and execute models independently, providing end-to-end lineage from the source to the distribution layer.

We can therefore understand why orchestration is expanding horizontally: they can connect with all the tools and provide a unique interface to run them all.

2- Rebundling via Cloud Warehouses

Cloud warehouses are also expanding horizontally.

The direction they are pushing the most is toward orchestration.

Let's take Snowflake, they have released many features around Snowflake tasks.

With Snowflake, you can create tasks and orchestrate them via a basic DAG.

They also offer declarative features with dynamic tables that will refresh automatically at a rate the user can specify (freshness-based).

With these 2 tools, users can build advanced workflow directly from within Snowflake, with no need for an extra tool.

On the ELT side, there seems to be less innovation or development. Instead, cloud data warehouses often refer directly to existing ELT/ETL tools in their documentation, highlighting these tools' connector libraries for integration purposes.

3- Rebundling via ELTs

ELTs are also expanding horizontally.

Let's take Airbyte and Meltano. Originally, they are data integration tools providing an extensive list of connectors that allow you to connect to almost any data source.

However, they offer as well interesting transformation features.

They integrate both with dbt and can execute dbt dag in the destination DB.

This reduces from my perspective the need for an orchestrator.

Each pipeline is scheduled (with a CRON) and triggers downstream materializations automatically when the destination table is refreshed (push). No need for a control plane.

This setup is effective for simple configurations with minimal transformations. However, it might be limiting when building more advanced workflows (complex dependencies, python transformation, ML).

5- Rebundling via public cloud

Azure has also recognized this movement toward the aggregation of data stacks.

They have released Microsoft Fabric, which essentially bundles all their existing tools into one platform.

Users can do everything in one SaaS:

Ingest data and orchestrate pipelines with Data Factory.
Transform them with Synapse.
Expose them with PowerBI.

The promise is indeed interesting.

However, I lack experience with Fabric to form a clear opinion on its strengths and weaknesses.

Frontiers are evolving quite fast between orchestrators, ELTs, and Cloud Warehouses.

All these tools have increasingly overlapping features with often the same promise: do everything from one place.

In this cannibalization game, I see the warehouse winning in the long term as they own the data.

Owning the data confers an unfair advantage: you have to buy them first before buying all the others…

Thanks for reading,

-Ju

I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. you can reply to this email; it will get to me.

Hugo Lu

Feb 1, 2024

Awesome article and great observations, by the way. No-one wants to write about this, you are one of the first

Expand full comment

2 replies by Julien Hurault and others

I agree with a lot of the sentiment here. For a bundling of Orchestration features that makes sense of course check-out Orchestra. Personally I cannot stand the idea of having OSS Orchestrators running all my ETL code since my dbt repo is big enough.

Do I really want another 100 files in there for all my data sources?

What about CI?

No thank you :D

2 more comments...

Ju Data Engineering Newsletter

Discussion about this post