9 Comments
Apr 17·edited Apr 17Liked by Julien Hurault

SQLMesh does thing similar by using views and then "pointing" the view to the underlying physical table you want to use for that environment.

Another approach we used at my old company was having multiple versions of our pipeline run in parallel (production + pre-release) but based off of the same input data source. Then we would have jobs compare the results of the two versions and if things aligned with our expectations we'd then push our code to production. We would only run the pre-release version when needed since it was an expensive process but worked well for our use case. It also worked well since the input data source we received didn't change much and we rarely needed to make changes to that step of the pipeline.

Expand full comment
author

Interesting approach by SQLMesh.

If you run several models, do they build views for all of them, or do they process models one by one?

Yes, the approach you propose is a good way to solve this problem, especially if you want to ensure that you did not "break" anything.

I used to do it like that, but as you mentioned, you end up building models twice, which can get quite expensive.

Expand full comment
Apr 18Liked by Julien Hurault

What do you mean by "do they build views for all of them, or do they process models one by one?"

I don't have a ton of experience there but they will create views on top of all models and rotate them around to the underlying physical tables as the schema and logic evolve. They also do a pretty nice job of parallelizing when it makes sense to do so.

Expand full comment
author

Yes it was about parallelizing indeed, you answered my point, thanks.

I definitively need to spend more time with SQLMesh :)

Expand full comment

Thanks Julien for the explanation here! Definitely something to explore

Still, I'm wondering how complicated it is to implement in real life 🤔

As pointed out in the post you pointed out:

"And in fact, we cannot stress this enough, moving to a full Lakehouse architecture is a lot of work. Like a lot!"

Is this work really valuable when a lot of subjects like governance, GDPR, data culture, etc. are very late compared to data warehouse architecture and tooling around quite mature now (dbt, SQLMesh, etc.)

Don't get me wrong: lakehouse architectures are the future; but might be a bit early to move on now. WDYT?

Expand full comment
author

Thanks Benoit !

Indeed the eco-system is quite new.

Data usually lands in a bucket before being loaded into the warehouse, so I would start from there and simply build Iceberg tables on top of these files.

Then, you have the possibility to gradually move warehouse transformations to the data lake as the ecosystem evolves.

It's too early to fully transition to a lakehouse architecture, but it's probably time to lay the foundations.

Expand full comment

Very interesting, thanks 🙏

Would you happen to have tested LakeFS for that pattern?

It seems to provide similar cross-table branching as Nessie, so I'm not sure how both overlap...

Expand full comment

im not very dbt expert, but seems wouldn't be more clear if commmand dbt clone to specify both source and target ?

Expand full comment

example on clone from audit, the command doesn't have the word audit on it:

dbt clone —target prod —state manifest_prod.json —full-refresh

Expand full comment