8 Comments

Great post! I love the idea behind DuckDB + Lambda but also want to see if anyone has used it for more complex cases. I'm thinking more about joins of high cardinality datasets and whether you can orchestrate that somehow. If you know the join key and the cardinality of the data you can do a sort of split at the high level (split dataset using something like "JOIN_ID % 10") and then kick off a variety of parallel tasks.

Expand full comment
author

yes that's interesting.

I guess the first step would be to parse the query with SQLGlot. Do you know if there is an open source source framework that is able to map/reduce an AST query ?

Expand full comment

I'm not aware of any that can do that but intuitively you'd also need to know the table stats themselves to be able to do that effectively. You'd need to know the cardinality/distribution/types of the different columns and that would be part of the DB/DW itself that they use for query planning. There's something interesting here though!

Expand full comment
author

Yes Iceberg would probably help with that: map/reduce on top of Iceberg metadata.

Expand full comment

Excellent post as always. Do you think it'll evolve into the multi-engine stack v1 or similar?

Expand full comment
author

Thank Julia yeIceberg in this context makes a lot of sense. Would be interesting to get insights on how it behaves at such scale!

Expand full comment

Jake mentioned the cost savings in the Data Council presentation: 80% of the cost was saved, equating to $400,000 a month.

Expand full comment
author

Hi Nouras, oups missed it, thanks for mentioning it!

Expand full comment