In today's post, I'll be stepping into the shoes of an organization starting to build a data platform from the ground up.
Having constructed several cloud data platforms for various companies, I have accumulated some insights from both successes and mistakes.
This article elaborates on how I would embark on a new data project from scratch, highlighting the design mistakes I would avoid.
Focus 1: Build trust
Just like launching a new startup, building a new platform involves the challenge of creating value within the constraints of limited resources from the start:
you are probably working alone
get limited inputs from users (people don’t know what they need … for now…)
and with budget constraints
Hence, the primary focus in the initial stages should be on gaining the trust of your users by delivering maximum value with limited resources.
This is where drawing a parallel between the well-known entrepreneurial concept of a Minimal Viable Product (MVP) and a Minimal Viable Data Stack comes into play.
Let’s see how I imagine the construction of this minimal data stack.
Prioritized data catalog
Your company likely has a mature production database containing transactional data (e.g., users, products, purchases) and business users with read-only access. These users typically run ad-hoc queries to extract information for use in tools like Excel.
The company probably needs to operationalize the existing analytics workflow and establish a more effective method for analyzing this data.
Contrary to a common misunderstanding, developing a Minimal Viable X doesn't imply hurriedly creating a flawed product. Instead, it involves concentrating on a pivotal feature and building it perfectly.
In this context, having a functional platform with 2 robust data flows can prove more valuable than integrating 10 data sources with multiple data quality issues.
I would begin by conducting interviews with stakeholders to create a comprehensive overview of requirements and a data roadmap:
Gather insights on the information they require.
Identify the location, format, and historical span of the necessary data.
Determine the data's update frequency.
This process will yield a prioritized list of data sources to focus on.
Minimal viable data stack
Once you have agreed on the minimal subset of data sources to load, you can start building your platform. Here's the design I would opt for:
Public cloud for the ingestion process
Before proceeding to load the extracted data from APIs or internal databases, I would suggest storing them in a data storage bucket. This strategy offers flexibility, enabling you to switch warehouse providers if necessary or to initiate a historical data load in case of any issues within the warehouse.
If you need to get data from a prod database, I would trigger a lambda function (or equivalent) at a regular time interval to capture changes from the source DB. See my post on Change Data Capture:
Important here from my experience:
operate as few transformations as possible in this bucket: all the business logic should be implemented in the warehouse
Ensure your Lambda code is portable with Docker.
It's often challenging to predict whether you'll require short-term computing resources or long-running processes. By containerizing your code, you gain the flexibility to seamlessly transition between AWS Lambda compute and longer-running container environments as needed. See my post on when to use a lambda function or not.
Cloud Data Warehouse + DBT
For data transformation, I recommend choosing a modern cloud data warehouse like Snowflake, BigQuery, or Databricks coupled with DBT.
DBT has become a standard tool for modern data engineering workflows. Its more important features are:
out-of-the-box incremental loads
out-of-the-box lineage between the tables/views
easy data testing (unicity, null columns ect)
I would organize your data flow into three separate schemas:
Landing Schema: This schema will maintain a historical record of all your data loads through an ELT process.
Staging Schema: Use this schema to store the latest version of each record (incremental tables) and implement your business logic.
Visualization Schema: Expose data to your users through a BI tool in this schema, making it accessible and understandable.
Important here from my experience:
Start with DBT: Incorporate DBT into your process from the outset.
Leverage DBT macros to create reusable components for your business logic.
Ensure Retro Compatibility: Keep tables retro-compatible. Avoid altering existing columns or using wildcard SELECT * queries. Instead, create new columns for changes and manage deprecations carefully.
Track Processing Timestamps: Store processing timestamps in your tables. Document when files arrived in the bucket when they were loaded into the landing schema and other relevant timestamps.
Scheduling
You require a scheduler to synchronize all components of your stack, such as triggering API calls for data ingestion and materialization with DBT.
There are several tools available for orchestrating data workflows:
Airflow (most commonly used) or its managed versions like MWAA in AWS or Astronomer.
Newcomers:
Prefect
Dagster
Important here from my experience:
no transformation or processing should occur within the scheduler
MWAA can present challenges: complex dependency management and version lag compared to the official Apache Airflow release.
Don’t forget Observability…
Being able to track and understand what has happened in the past, across the entire stack, and correlate it with the observed data is essential for troubleshooting and quickly fixing your pipelines.
Storing logs from different sources in a central location, like an S3 bucket, and then ingesting them into your data warehouse is for me the best approach.
This enables you to query and analyze the logs using SQL, providing developers with a faster and more seamless way to diagnose issues and identify the context in which they occurred.
Linking logs to specific data records can enhance problem-solving and allow for more precise debugging.
Setuping up right from the beginning will allow you to quickly fix errors and get trust from your first users.
… And alerting
Not being aware of pipeline failures can lead to serious problems. It's crucial to dedicate time early on to create a simple alerting system. This system will notify you whenever there's an error in your data processing pipeline.
To sum it up, when you start building a data platform, you need to balance your business thinking with your technical skills. Begin with limited resources by focusing on delivering real value to users to build trust.
Fortunately, there are powerful tools available that make it possible for even a one-person team to create a strong platform.
What’s important is to leverage and coordinate these tools (cloud, warehouse, scheduler) and make sure to invest time in essential building blocks: observability, alerting, and data testing.
Thanks for reading,
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.
I'm more comfortable with the Python approach, (and I haven't worked on projects using DBT yet)
I was just wondering about alternative options in the processing layer for those who aren't fond of SQL-based data pipelines.
**DBT is undoubtedly an amazing tool!!
DBT 😅
I think Spark + delta lake is also a good alternative