Great post! I love the idea behind DuckDB + Lambda but also want to see if anyone has used it for more complex cases. I'm thinking more about joins of high cardinality datasets and whether you can orchestrate that somehow. If you know the join key and the cardinality of the data you can do a sort of split at the high level (split dataset using something like "JOIN_ID % 10") and then kick off a variety of parallel tasks.
I guess the first step would be to parse the query with SQLGlot. Do you know if there is an open source source framework that is able to map/reduce an AST query ?
I'm not aware of any that can do that but intuitively you'd also need to know the table stats themselves to be able to do that effectively. You'd need to know the cardinality/distribution/types of the different columns and that would be part of the DB/DW itself that they use for query planning. There's something interesting here though!
Great post! I love the idea behind DuckDB + Lambda but also want to see if anyone has used it for more complex cases. I'm thinking more about joins of high cardinality datasets and whether you can orchestrate that somehow. If you know the join key and the cardinality of the data you can do a sort of split at the high level (split dataset using something like "JOIN_ID % 10") and then kick off a variety of parallel tasks.
yes that's interesting.
I guess the first step would be to parse the query with SQLGlot. Do you know if there is an open source source framework that is able to map/reduce an AST query ?
I'm not aware of any that can do that but intuitively you'd also need to know the table stats themselves to be able to do that effectively. You'd need to know the cardinality/distribution/types of the different columns and that would be part of the DB/DW itself that they use for query planning. There's something interesting here though!
Yes Iceberg would probably help with that: map/reduce on top of Iceberg metadata.
Excellent post as always. Do you think it'll evolve into the multi-engine stack v1 or similar?
Thank Julia yeIceberg in this context makes a lot of sense. Would be interesting to get insights on how it behaves at such scale!
Jake mentioned the cost savings in the Data Council presentation: 80% of the cost was saved, equating to $400,000 a month.
Hi Nouras, oups missed it, thanks for mentioning it!