First of all, I love your posts. Your contribution to the data community is very appreciated! Considered this stack, what would you introduce to source data from a database, such as SQL Server?
You would need CDC. Depending on your need you can opt for open-source (Debezium), AWS DMS or any other vendor in that space. Changes would then be writen to s3 (possiblity with compaction) before downstream consumption.
Embedding DuckDB in Lambdas as an API layer to serve Snowflake results is worth exploring. I wonder how well it scales compared to syncing data to Redis/PSQL or solutions like Airfold/Tinybird.
Very very excited that you're building content on multi-engine data stacks. Keep it up! Gets me 🧃 up
Is this still possible now that Snowflake discontinued version-hint.txt?
First of all, I love your posts. Your contribution to the data community is very appreciated! Considered this stack, what would you introduce to source data from a database, such as SQL Server?
thanks Kent !
You would need CDC. Depending on your need you can opt for open-source (Debezium), AWS DMS or any other vendor in that space. Changes would then be writen to s3 (possiblity with compaction) before downstream consumption.
Embedding DuckDB in Lambdas as an API layer to serve Snowflake results is worth exploring. I wonder how well it scales compared to syncing data to Redis/PSQL or solutions like Airfold/Tinybird.
Hi Julian,
I conducted some experiments in this blog post: https://juhache.substack.com/p/exploring-duckdb-aws-lambda.
The main drawback was the cold start of the Lambda function.
Indeed, benchmarking it with the tools you mentioned is a great idea.
Do you have any figures already at Airfold?
> I have no idea how well PyIceberg can read/write large data but I will test it in a future post!
Pretty well! See iceberg-python/#428 and iceberg-python/#444
https://github.com/apache/iceberg-python/issues/428
https://github.com/apache/iceberg-python/pull/444
Thanks a lot Kevin ! Looks really good indeed !