SQLMesh does thing similar by using views and then "pointing" the view to the underlying physical table you want to use for that environment.
Another approach we used at my old company was having multiple versions of our pipeline run in parallel (production + pre-release) but based off of the same input data source. Then we would have jobs compare the results of the two versions and if things aligned with our expectations we'd then push our code to production. We would only run the pre-release version when needed since it was an expensive process but worked well for our use case. It also worked well since the input data source we received didn't change much and we rarely needed to make changes to that step of the pipeline.
What do you mean by "do they build views for all of them, or do they process models one by one?"
I don't have a ton of experience there but they will create views on top of all models and rotate them around to the underlying physical tables as the schema and logic evolve. They also do a pretty nice job of parallelizing when it makes sense to do so.
Thanks Julien for the explanation here! Definitely something to explore
Still, I'm wondering how complicated it is to implement in real life 🤔
As pointed out in the post you pointed out:
"And in fact, we cannot stress this enough, moving to a full Lakehouse architecture is a lot of work. Like a lot!"
Is this work really valuable when a lot of subjects like governance, GDPR, data culture, etc. are very late compared to data warehouse architecture and tooling around quite mature now (dbt, SQLMesh, etc.)
Don't get me wrong: lakehouse architectures are the future; but might be a bit early to move on now. WDYT?
Data usually lands in a bucket before being loaded into the warehouse, so I would start from there and simply build Iceberg tables on top of these files.
Then, you have the possibility to gradually move warehouse transformations to the data lake as the ecosystem evolves.
It's too early to fully transition to a lakehouse architecture, but it's probably time to lay the foundations.
SQLMesh does thing similar by using views and then "pointing" the view to the underlying physical table you want to use for that environment.
Another approach we used at my old company was having multiple versions of our pipeline run in parallel (production + pre-release) but based off of the same input data source. Then we would have jobs compare the results of the two versions and if things aligned with our expectations we'd then push our code to production. We would only run the pre-release version when needed since it was an expensive process but worked well for our use case. It also worked well since the input data source we received didn't change much and we rarely needed to make changes to that step of the pipeline.
Interesting approach by SQLMesh.
If you run several models, do they build views for all of them, or do they process models one by one?
Yes, the approach you propose is a good way to solve this problem, especially if you want to ensure that you did not "break" anything.
I used to do it like that, but as you mentioned, you end up building models twice, which can get quite expensive.
What do you mean by "do they build views for all of them, or do they process models one by one?"
I don't have a ton of experience there but they will create views on top of all models and rotate them around to the underlying physical tables as the schema and logic evolve. They also do a pretty nice job of parallelizing when it makes sense to do so.
Yes it was about parallelizing indeed, you answered my point, thanks.
I definitively need to spend more time with SQLMesh :)
Thanks Julien for the explanation here! Definitely something to explore
Still, I'm wondering how complicated it is to implement in real life 🤔
As pointed out in the post you pointed out:
"And in fact, we cannot stress this enough, moving to a full Lakehouse architecture is a lot of work. Like a lot!"
Is this work really valuable when a lot of subjects like governance, GDPR, data culture, etc. are very late compared to data warehouse architecture and tooling around quite mature now (dbt, SQLMesh, etc.)
Don't get me wrong: lakehouse architectures are the future; but might be a bit early to move on now. WDYT?
Thanks Benoit !
Indeed the eco-system is quite new.
Data usually lands in a bucket before being loaded into the warehouse, so I would start from there and simply build Iceberg tables on top of these files.
Then, you have the possibility to gradually move warehouse transformations to the data lake as the ecosystem evolves.
It's too early to fully transition to a lakehouse architecture, but it's probably time to lay the foundations.
Very interesting, thanks 🙏
Would you happen to have tested LakeFS for that pattern?
It seems to provide similar cross-table branching as Nessie, so I'm not sure how both overlap...
im not very dbt expert, but seems wouldn't be more clear if commmand dbt clone to specify both source and target ?
example on clone from audit, the command doesn't have the word audit on it:
dbt clone —target prod —state manifest_prod.json —full-refresh