Bonjour!
I'm Julien, freelance data engineer based in Geneva π¨π.
Every week, I research and share ideas about the data engineering craft.
Not subscribed yet?
I wanted to enhance the newsletter experience by sharing the song Iβm listening to while writing this piece.
Please put on your headset, crank up the volume, and letβs go!
The compute category has evolved a lot these past years
We started with warehouses where compute and storage were tightly coupled.
Next, we transitioned to solutions with separate compute and storage within the same provider.
Now, we are adopting fully decoupled storage using Iceberg.
The most significant advantage of this transition is the ease of integrating single-node engines (like DuckDB, Polars, and others) with the rest of the stack.
Previously, taking parts of a pipeline outside the warehouse required a separate ETL to move data around.
Building and maintaining this ETL often killed the ROI of using cheaper compute.
With Iceberg as a bridge, we can now run workflows on different engines without copying data.
Iβve been considering how to integrate these engines into a pipeline over the past few weeks.
Letβs explore some potential approaches.
Option 1: one project = one worker
This is the most straightforward approach: run your entire dbt project in a single Docker container using DuckDB as the sole engine.
The workflow looks like this:
1. dbt sources are read via the DuckDB Iceberg extension
2. The dbt DAG is executed with DuckDB as the engine.
3. Target tables are materialized to Iceberg using PyIceberg.
Pros:
β’ Simplicity
β’ Efficient data passing across models
Cons:
β’ All models must fit in the instanceβs memory
β’ No materialization of intermediate models
Option 2: one model = one worker
In this model, each transformation is handled by a separate worker.
This worker loads the source table into memory, performs the query, and writes the results to a destination Iceberg table.
The code using DuckDB would look like this:
Instead of running a full materialization, we could switch to an incremental load.
However, PyIceberg isnβt yet optimized for MERGE INTO operations.
Currently, the only option is to use OVERWRITE with an overwrite_filter matching updated rows, but Iβve experienced performance degradation with this method.
Pros:
β’ All models are materialized to Iceberg.
β’ You can control the resources used by each model (instance size).
Cons:
β’ It only works as long as table fits in workerβs memory.
β’ Inefficient incremental models.
Option 3: one partition = one worker
One possible approach to counter the abovementioned limitations is to work at the partition level.
I discussed this approach in a previous article:
The idea is to divide a dataset into partitions that exist independently.
Each partition is immutable and consistently overwritten, avoiding the row-level limitations of PyIceberg discussed earlier.
Each worker is then sized to handle a single partition.
Add more workers to process additional partitions for large backfills or initial loadsβthereβs no need for a particular βlargeβ instance.
Limiting the number of concurrent workers can be beneficial in this setup, as too many can lead to excessive write conflicts and retries.
This can be achieved by adding a buffer/queue from which each worker pulls partitions to process.
To boost throughput, you can then allocate more memory to a worker and process multiple partitions in a single run (one Iceberg commit).
You might ask: what if I want to perform cross-partition aggregations?
In that case, if a single worker isnβt large enough, our favorite data warehouses are here to help :)
Pros:
β’ Can handle large tables.
β’ Offers better control over worker sizing as the tables grow.
β’ Capable of handling large backfills and continuous loads with the same setup.
Cons:
β’ Requires an orchestrator to track which partitions to process.
β’ Must control the number of concurrent writers.β
AWS Implementation
If your partition processing fits within the Lambda limitations of 10GB and 15 minutes, you can directly leverage the SQS + Lambda pattern.
But thatβs probably not the case, and youβll probably need to move to ECS Fargate.
ECS Fargate is an AWS service that allows you to run containers in a serverless manner.
You can get large runners (up to 120GB, 16vpcu) that run containers hosted in ECR.
This comes with some limitations:
β’ Tasks must run in a VPC (a distinct IP is assigned to each task, so ensure enough IPs are available).
β’ Has a longer cold start time than Lambda (~ 1 min).
β’ Does not support SQS triggers.
Since there is no direct SQS to Fargate trigger, we need to implement an additional SQS-watching service that will start new tasks based on the SQS queue depth:
Orchestration
Ideally, we would like to use a tool like dbt to handle model lineage, but dbt does not currently support remote execution in an ECS task or Iceberg (for now).
Therefore, we must rely on an orchestrator to manage the various models.
This orchestrator needs to perform the following tasks for each model:
1: Get the partitions to update
2: Trigger the processing of each partition (ECS task) while controlling concurrency.
3: Wait for the processing of all the partitions
This can be done with various orchestrators:
Step function with Map + ECS RunTask state
Airflow with EcsRunTaskOperator + Dynamic Task Mapping
Dagster via PipesECSClient + partitioned asset
Iβll likely share a detailed implementation with one of these orchestrators in a future post.
I will also include a benchmark comparing the price and performance of this setup to their warehouse counterparts.
Stay tuned!
Iβve been implementing Iceberg data lakes for clients over the last few months and really enjoying it.
Adopting Iceberg has reduced their costs by allowing them to select the optimal engine for each scenario.
Theyβve broken free from reliance on a singleβoften costlyβwarehouse vendor.
If youβre Iceberg-curious and want to explore if itβs a fit for your organizationβs data stack, drop me a DM!
Thanks for reading,
Ju
I would be grateful if you could help me to improve this newsletter. Donβt hesitate to share with me what you liked/disliked and the topic you would like to be tackled.a
P.S. You can reply to this email; it will get to me.