0$ Data Distribution

Julien Hurault

Dec 6, 2024

Ju Data Engineering Weekly - Ep 78

Read →

4 Comments

Sam Verhasselt

Dec 7

Great project! It introduces me to a couple technologies I don't get to try out in my day job.

Because of the lack of predicate pushdown, would you consider connecting to the iceberg table with spark docker container? I think you can specify a directory as your 'catalog'. I feel like the overhead of spinning up a spark session is worth it because of how much less data needs to be read with each subsequent query.

On the other hand, if data transfer is free, you're not paying extra to scan more rows. It still requires more compute though.

Expand full comment

Dan Goldin

Dec 6

Very cool - love the example. In my mind the complexity is still with the catalog and managing that across the different compute engines. It seems we're ending up in a world where maybe there's a single "write" catalog and then a variety of "read" catalogs that are copies of the write.

Expand full comment

Reply (1)

Julien Hurault

Dec 6

Yes, indeed, the current pattern we observe with Snowflake and GCP involves two types of tables: external and internal, each with various read/write possibilities.

So, we will likely need to mix catalogs together, unfortunately… OR I’m sure someone will invent a meta-catalog with cross-catalog synchronization!

Expand full comment

Reply (1)

Dan Goldin

Dec 6

Challenge is that every catalog wants to be THE catalog and don't want this cross catalog synchronization. And now it's even more complicated with the S3 Iceberg Tables!

Expand full comment

Ju Data Engineering Newsletter

0$ Data Distribution