Building data pipelines is easy.
Maintaining them is not.
In my first article of 2024, I wrote a small manifesto detailing how I believe the development experience for data engineers could be improved.
Many of these points addressed the challenges that come with maintaining pipelines.
Pipelines always break.
There are just too many external factors that you cannot control: upstream bug fixes, dirty or incomplete data, human error etc
However, we often make the mistake of designing them assuming they will never fail.
This can lead to super stressful bug-fixing situations.
Let's explore how we can mitigate this risk.
This is what a smooth pipeline bug-fixing workflow looks like:
Step 1: Get the execution context
The first step is to revert your data to its state when the pipeline broke.
Nowadays, we have “time travel” capabilities across the entire stack.
In the warehouse:
In the data lake via Iceberg:
Combined with your git history, you can get a 360-degree overview of your bug context (code + data) and iterate a fix from there.
Step 2: Reproduce the bug
The next step is to reproduce the error to understand what happened.
You probably don't want to test queries directly in production but in a safe, sandboxed environment.
Fortunately, many tools enable us to design virtual data environments:
In the warehouse with 0-copy clones.
Via transformation framework with pointer views (SQLmesh)
In the data lake with multi-table Iceberg branches (Nessie)
In your cloud environment via IaaC stages
All these features allow us to create virtual clones of the production objects, each with its own lifecycle.
The developer can then replay the failing code in this sandbox, understand what happened, and fix the issue.
And all of this without impacting the production flow.
If you want to know more about virtual data environments, check this article:
Step 3: Backfilling
The bug has now been identified, and the code is corrected.
The final step is to backfill the data.
The backfilling process can become quite stressful and challenging if not thought out from the start.
Many times, backfilling is done via ad-hoc queries:
UPDATE source_table SET column_name = <new_value> WHERE date between {start_date} and {end_date}
This approach may work well for simple use cases but can become extremely stressful with complex tables and numerous downstream dependencies.
You'll have to make assumptions about what needs to be re-executed and comprehend the potential impact.
This task is super challenging, especially for larger projects:
The recommended approach for backfilling is not to write ad-hoc SQL but to re-run the pipeline over a specified interval.
This is done by designing a pipeline with the following patterns
1: Partition the dataset into units of work.
A unit of work refers to a table partition that exists independently from the others.
This unit is immutable and is consistently overwritten.
Running the pipeline means executing one or several of these work units.
Just as we isolate specific logic to write unit tests, partitioning a dataset forces the developer to write reproducible code that can be tested and isolated.
2: Write idempotent transformation code
The transformation code should be written to consistently deliver the same output regardless of when it is run.
If a pipeline is idempotent, you can safely replay a unit of work when backfilling because you control its output.
Let’s take an example and have a closer look at incremental loads.
Incremental loads are typically performed by comparing source and destination records.
The objective is to identify records present in the source table but absent in the target.
This is traditionally achieved by comparing a timestamp column in the source and destination.
Such a model would typically look like this in dbt:
SELECT * FROM source_table
{% if is_incremental() %}
WHERE ts >= (SELECT MAX(ts) FROM destination_table)
{% end if %}
This method works well until there is missing data in the source.
In such cases, if you want to backfill, your only option is to build an ad-hoc query.
As mentioned before, partitioning your dataset solves this issue.
Each load executes a new partition of the dataset.
Your code is written and tested to transform a partition independently of the external context.
SELECT * FROM source_table
WHERE ts between {START_DATE} and {END_DATE}
Being idempotent, you are sure you can replay it without understanding the history of events surrounding its execution.
In that case, backfilling means only selecting a bunch of partitions that need rerun.
airflow dags backfill \
--start-date START_DATE \
--end-date END_DATE \
dag_id
There's no need to create any additional logic.
Just press play and wait.
Here are some other examples of ways to keep a pipeline idempotent:
Eliminate duplicates (QUALIFY) systematically
Never use INSERT alone; do MERGE or DELETE + INSERT
Use a WHERE clause on bounded time intervals and without any CURRENT_TIMESTAMP() reference
Check that the source table is complete before running the transformation
While building the pipeline using this approach may require more time for initial implementation (one-time cost), the return on investment will quickly pay off by reducing the time spent on maintenance tasks (recurring cost).
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. You can reply to this email; it will get to me.
Great article! I love the fact we didn't touch data engineer's hell - debugging PySpark jobs.
Great article!!!