Okta's Multi-Engine Data Stack

May 1, 2024

Ju Data Engineering Weekly - Ep 59

8 Comments

May 1, 2024

Great post! I love the idea behind DuckDB + Lambda but also want to see if anyone has used it for more complex cases. I'm thinking more about joins of high cardinality datasets and whether you can orchestrate that somehow. If you know the join key and the cardinality of the data you can do a sort of split at the high level (split dataset using something like "JOIN_ID % 10") and then kick off a variety of parallel tasks.

Expand full comment

Reply (1)

Julien Hurault

May 3, 2024

yes that's interesting.

I guess the first step would be to parse the query with SQLGlot. Do you know if there is an open source source framework that is able to map/reduce an AST query ?

Expand full comment

Reply (1)

Dan Goldin

May 3, 2024

I'm not aware of any that can do that but intuitively you'd also need to know the table stats themselves to be able to do that effectively. You'd need to know the cardinality/distribution/types of the different columns and that would be part of the DB/DW itself that they use for query planning. There's something interesting here though!

Expand full comment

Reply (1)