Welcome to the exciting world of data engineering!
Every week, you'll get a chance to learn from real-world engineering as you delve into the fascinating world of building and maintaining data platforms.
From the technical challenges of design and implementation to the business considerations of working with clients, you'll get a behind-the-scenes look at what it takes to be a successful data engineer. So sit back, relax, and get ready to be inspired as we explore the world of data together!
Change Data Capture (CDC)
Intro
Change Data Capture (CDC) is a crucial aspect of data engineering that has gained significant importance in recent years. With the growing volume of data being generated every day, it has become imperative for organizations to keep track of changes in their data split across different systems in real time.
To keep two systems in sync, you have the option of either fully replicating the tables at a set time interval or capturing and transmitting the changes to the downstream system. The latter approach is known as Change Data Capture (CDC).
How changes are captured?
There are several different methods for capturing Change Data Capture (CDC) data from databases, including:
Log-based CDC: In this method, changes made to the database are recorded in a log file, such as a transaction log or binary log. The log file contains a record of every change made to the database, including the type of change (insert, update, delete), the time the change was made, and the data that was affected.
Trigger-based CDC: In this method, database triggers are used to detect and record changes made to the database. A database trigger is a special type of stored procedure that is automatically executed when a specific event occurs within the database, such as the insertion, update, or deletion of a row.
Query-based CDC: In this method, database changes are detected by periodically querying the database for changes. The query is designed to return only the data that has changed since the last query was run, and the changed data is then stored in a separate location, such as a change log table or a change data capture table.
Data format of the CDC
The format of the data in the source system can vary based on whether changes to records over time are being tracked. For example, it may be necessary to maintain historical information of specific dimensions such as changes in a product's price or a customer's address.
The term "Slow Changing Dimension" (SCD) is used in database management to describe how historical information is handled. Depending on the format used for the SCD, the content of the updated data in the Change Data Capture may vary.
How to integrate CDC in the target system?
A new trend in data integration is shifting from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform), where the data is first loaded into the target system and then transformed. This approach provides greater flexibility with transformations, as it allows for the re-processing of past records without having to re-ingest them from the source system.
In an ELT design, the target system typically includes a landing layer that collects all of the records. The data is then loaded into a staging layer, where the records are transformed and the latest version of each record is made available. This design provides a high degree of flexibility.
Integration frequency of a CDC
The Change Data Capture (CDC) data can be transmitted to the destination system in two ways: batch processing or streaming.
Integrating sources with varying update frequencies can pose challenges in terms of architecture. For example, integrating a batch source and a streaming source means using different landing layers, as the underlying tools may be different (e.g. Kafka for streaming vs. file-based integration for a batch). This can lead to a complex architecture, with a lot of code duplication, known as a "lambda architecture."
A more modern solution to this problem is to convert all batch integrations to streaming integrations. This results in a simpler architecture and a unified serving layer.
However, it can also present difficulties in terms of cost and time when dealing with large data loads, such as historical loads.
In conclusion, Change Data Capture (CDC) is a crucial concept in data engineering that helps maintain data consistency and accuracy across multiple systems. Understanding the various CDC formats available is key to making informed decisions when designing your integration pattern to ensure that it is both generic and robust. The goal is to achieve a simple, flexible, and efficient integration architecture that meets the specific needs of each project.
thank you for reading.
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.