Welcome to the exciting world of data engineering!
Every week, you'll get a chance to learn from real-world engineering as you delve into the fascinating world of building and maintaining data platforms.
From the technical challenges of design and implementation to the business considerations of working with clients, you'll get a behind-the-scenes look at what it takes to be a successful data engineer. So sit back, relax, and get ready to be inspired as we explore the world of data together!
During my conversations this week with data scientists interested in transitioning to data engineering, I realized that many important terms were not properly understood.
As a result, I decided to create a short glossary of terms that are essential for someone starting a career in data engineering to comprehend.
Please do not hesitate to respond to this email or contact me if I have missed any important concepts or terms that you feel should be included in the glossary.
What is a Modern data stack?
It’s a collection of (usually cloud-based) software commonly used together to manage and process data. It is usually composed of the following blocks:
1 ETL tool. Ex: Fivetran / Stitch / Airbyte (open source)
Goal: bring data from an external system into your warehouse
1 Data warehouse: Ex: Snowflake / Databricks / BigQuery (Google) / Redshift (AWS)
Goal: transform, combine and organize the data
1 BI tool: Tableau / Looker / Metabase / Superset (open source)
Goal: create dashboards to present data
1 scheduling tool: Airflow / Prefect / Dagster
Goal: manages and coordinates the execution of multiple data pipelines
Let's review some keywords and concepts that will help you understand each of these blocks.
ETL
The term ETL stands for Extract, Transform, Load, which refers to a process used for extracting data from various sources such as databases, APIs, or Excel files, transforming it through operations like renaming, aggregation, and filtering, and finally loading it into a target system.
ETL vs ELT
ETL: Extract Transform Load
ELT: Extract Load Transform
With ELT, you can store all of your historical loads directly in the target system, enabling you to easily reprocess old data in case of changes in your business logic, among other scenarios.
Change Data Capture
A technique used in data processing to capture changes made to a database.
Batch processing
Process and ingest data in bulk.
Stream Processing
Process and ingest data record by record.
Lambda architecture
Build 2 concurrent ingestion systems: one for streaming processing and one for batch processing.
Kappa architecture
Eliminate the batch processing system and process all the incoming records in a stream.
Incremental Load
When loading change data capture into your destination table, you have several options:
- insert/update: you calculate which records already exist in your DB, update them, and insert the new ones.
- delete/insert: you delete existing records and insert new ones.
- append: you add all the new records at the end of the table.
Data Warehouse
Materialization
A data warehouse typically consists of two main types of objects: views and tables. Views are computed each time they are invoked, whereas tables store results physically. Materialization is the process of calculating and storing tables at a particular point in time.
Landing vs Staging
The landing area in a data warehouse contains all of the raw data collected from various sources, while the staging area is where the raw data is cleaned, combined, and transformed before loading into the target system. The main reason for splitting the two is to allow for easy reprocessing of staging objects in case of any changes in the business logic.
Snapshots
A data snapshot refers to a version of data that was captured or taken at a specific point in time in the past.
Backfilling
Retroactively applying changes or updates to existing data. It is a way to update historical data with new information that was not available at the time the data was initially collected or processed.
Dimensional modeling: fact vs dimension
Defines the structure of your warehouse: data is organized into two types of tables: fact tables and dimension tables.
fact = measure of an event. Ex: an order in an e-commerce website
dimension = context of a fact. Ex: the user who placed the order, the products compositing the order
Data modeling plays a critical role in ensuring the efficiency of your data warehouse. It involves defining how your data is organized and stored in your database, including which tables contain which information and how the tables are linked together.
Different modeling strategies exist star schema / snowflake schema / data vault. More about it in one of the next newsletter releases.
Slow changing dimension (SCD 1/2/3)
Design the method used to capture changes made to a record.
SCD 1= only the latest version of the record is retained
SCD 2= all the historical version of a record is retained
SCD 3= captures the last and current version of a record
Orchestrator
Data orchestrator is a software tool that enables the management and automation of data pipelines and workflows.
It will be used for example:
trigger a regular call to an API
trigger a materialization in a warehouse
run some tests in your warehouse
DAG (Directed Acyclic Graph):
DAG is a term that is commonly used in the context of orchestrators, which coordinate tasks in workflows, specifying the order in which tasks should be executed. DAG, or Directed Acyclic Graph, is a structure that links tasks together, forming a directed graph without any loops or cycles.
I hope that this short glossary has helped you understand some of the basic concepts usually used in the context of a data stack (modern or not^^). By familiarizing yourself with these key terms and concepts, you'll be better equipped to communicate effectively with others in the field and navigate the various aspects of data engineering.
thank you for reading.
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.