Bonjour!
I'm Julien, freelance data engineer based in Geneva 🇨🇭.
Every week, I research and share ideas about the data engineering craft.
Not subscribed yet?
Snowflake Polaris, Databricks Unity: Big providers are pushing hard for their catalogs.
Indeed, controlling the catalog is a way for them to become the main gateway to your data.
But let’s take a step back.
Why do we actually need a catalog and an extra system on top of Iceberg files?
In this post, we’ll explore Iceberg's consistency model to understand its importance and how it relates to the recent S3 conditional writes feature.
Need a refresh ❄️ on Iceberg?
If you’re not familiar with the structure of the Iceberg format, check out this introduction I wrote a couple of months ago in Ben’s newsletter.
Now that you understand manifest, manifest List, and metadata file, let’s look closely at Iceberg’s write process.
Iceberg write process
Iceberg operates with an optimistic concurrency model.
This means a writer adds new data and updates metadata without considering other concurrent writers.
The process at a high level is relatively straightforward:
read/write data files
read/write manifest files
commit new snapshot
Iceberg adds two additional mechanisms represented by the two red arrows in the diagram below:
step 4: data conflict check
step 6: atomic metadata file commit
These additional conflict checks are fundamental to Iceberg’s ACID consistency, ensuring that the data remains consistent even with concurrent writers.
Data conflict Check
The data conflict checks verify the consistency of the manifest files.
Example of checks:
In the case of an UPDATE/DELETE/MERGE, a check will ensure that two writers are not updating (logically) the same data files/rows.
In the case of compaction, it will ensure that another process does not delete the same compacted files simultaneously.
These checks are performed based on the writer's operation and on a subset of the manifest files.
Depending on the isolation level chosen by the user (snapshot or serializable), a failed check will result (or not) in a data conflict.
For example, in the case of two concurrent writers where:
• Writer 1: performs an UPDATE
• Writer 2: INSERTs records that potentially match the UPDATE condition
This conflict would fail for the serializable isolation level but not for the snapshot isolation level.
Atomic metadata file commit
After validating that the new manifest files do not conflict with another writer, the writer should publish a new snapshot.
To do this, it creates a new metadata.json file and sends it to the catalog along with the version of the metadata it used for its operations.
But why do we need a catalog to do that?
Simply because S3 only (used to) implements strong consistency and a last-writer-wins model for concurrent writes.
This is a significant problem with Iceberg.
Imagine two writers start their writing processes at the same time.
Both perform the same initial scan.
Writer 1 is faster and commits its changes before Writer 2 finishes.
In this case, Writer 2 will commit and “win” (last-writer-wins), but based on metadata already updated by Writer 1.
Not good.
We need a locking mechanism to prevent a writer from committing changes based on an incorrect table state.
Iceberg catalog & retries
Iceberg solves this by adding an external transactional system via the catalog.
The goal of the catalog is to ensure an atomic metadata file commit.
When a writer tries to publish a new version, it performs a compare-and-swap (CAS) operation on the metadata.json file.
The process looks like this:
The commit is rejected if the writer's current metadata file location does not match the location known by the catalog.
In that case, the writer resynchronizes the metadata, checks for conflicts, and tries to commit to the catalog again.
New S3 If-Match
Recently, AWS released a new feature for S3:
By adding the If-Match header to a PUT operation, S3 will return an error if the object already exists.
This is precisely what we need for our Iceberg catalog.
Instead of asking the catalog for the current metadata version, the writer could perform a simple PUT with If-Match.
If there is a match, it indicates that another writer has committed in the meantime.
The writer should then refresh its metadata and retry the commit.
Even though conditional writes are not equivalent to CAS in terms of isolation, this new S3 feature could potentially allow us to bypass the catalog—if writers adapt their writing mechanisms accordingly.
I’m not completely certain about this take, but given the potential implications, we should get an answer soon enough…
Sources and interesting additional reads:
Jack Vanlightly’s series: part 1, part 2
https://medium.com/@gmurro/concurrent-writes-on-iceberg-tables-using-pyspark-fd30651b2c97
Thanks for reading,
Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. You can reply to this email; it will get to me.
This is a great post. Is this same applicable to Azure blob? Also i still feel even though the basic functionality of a catalog is to map the table to its latest metadata file, catalogs will play a role in much more than just mapping but access control, data governance, search and discovery and data lineage
Iceberg Catalog is more than just manage read/writes. Additionally Catalog will mange authorization/Governance and solve maintenance of the table. That will keep your Lakehouse really open and all new future tools/engines can consume data without forcing any migration neither data or rbac etc