Welcome to the exciting world of data engineering!
Every week, you'll get a chance to learn from real-world engineering as you delve into the fascinating world of building and maintaining data platforms.
From the technical challenges of design and implementation to the business considerations of working with clients, you'll get a behind-the-scenes look at what it takes to be a successful data engineer. So sit back, relax, and get ready to be inspired as we explore the world of data together!
Don't Miss the Train to Modern Cloud Data Warehousing
In recent years, there has been a growing trend towards modern cloud data warehouses, with many organizations opting for them over traditional on-premise solutions.
Today, I want to focus on Snowflake, which is one of the most well-known cloud data warehouses on the market.
Snowflake has been a game changer for me since I started using it two years ago, and I think it's a technology that all engineers should be familiar with and have in their toolboxes.
1. Compute and storage separation
One of Snowflake's most impactful innovations is its separation of data compute and storage.
In traditional warehouses, as your database grows, you have to scale your storage and computing power together. This means that as your data increases, you also need more processing power to handle it.
Snowflake breaks this paradigm by allowing users to choose their computing power independently of their data storage.
Ok got it… but … what does this mean in practical terms?
1.1 Serverless: just pay when you need
It's an incredible feeling when you know that you're not wasting money on computing resources that you're not using.
With Snowflake, when nobody is using the warehouse, the cost is almost negligible as you only pay for storage which is cheap.
This is a stark contrast to traditional data warehouses where you have to pay for the computing power regardless of whether it's being used or not.
It's therefore truly a no-brainer to switch to Snowflake for a startup as well where the budget is limited and you need to quickly adapt to changing business needs.
1.2 Isolate processing workflows
In large traditional data warehouses, large computing tasks can be a complete nightmare, with the need for careful planning and execution during off-hours to avoid system disruptions.
But with Snowflake, you can allocate a separate computing unit to handle recomputing and easily scale it as needed - need it to be faster? Just pay more, and you're good to go!
This feature is particularly exciting when it comes to applying the ELT paradigm we discussed in our last episode.
Simply load data from your provider into a large landing table, and if you need to recompute a field, just scale your warehouse accordingly and reprocess the complete history. It's a game-changing approach that can save you time, energy, and a lot of headaches.
1.3 Create temporary warehouses
A significant challenge with traditional warehouses is the risk of negatively impacting overall database performance when experimenting with new features.
For instance, requests for too much data can cause blocks on tables, or lock out other users.
Snowflake addresses this challenge by offering the ability to replicate the entire database into a temporary database with a single command.
This is my favorite feature of Snowflake: this allows you to safely experiment without impacting the production database!
It's worth noting that databases are not synchronized between the production database and your instance, so you need to be careful to ensure that your new feature works only with unfresh data.
2. Marketplace
Furthermore, Snowflake offers a unique advantage of having a marketplace where users can directly purchase data from other providers.
This is possible due to Snowflake's distinction between storing costs for data providers and processing costs for consumers.
For instance, if someone wants to obtain ESG data in a warehouse, the provider is responsible for hosting the data and paying the storage cost.
On the other hand, if someone wants to query the data, they only have to pay the credits required to process the query in the provider's database.
This marketplace concept has actually been replicated by other providers:
Snowflake aims to take data sharing to the next level by not only allowing data to be shared across accounts but also applications.
As a SaaS provider, this represents an incredible opportunity. Currently, when selling a SaaS application to a corporate client, data needs to be extracted from the app, processed, and then re-injected, which presents several difficulties.
This process requires deploying the app in the client's environment, asking for permission to extract the data, and more. These challenges explain why many corporates build in-house tools.
However, Snowflake's future app store represents a complete paradigm shift in this regard. With the app store, you can deploy your SaaS app directly within the corporate warehouse, and it will coexist with the data in the Snowflake environment, eliminating any security hurdles.
3. The Cons
While Snowflake has many advantages, there are some potential downsides to consider.
Cost control can be a significant challenge when using Snowflake. Although it offers easy access to a vast amount of computing power, this can also lead to less incentive for developers to optimize the performance of their models. As a result, credit consumption can increase significantly.
In Snowflake, access to database objects such as tables, views, and functions is managed through users and roles. Users are individual accounts that can be granted permissions to perform specific actions on specific objects. Roles, on the other hand, are collections of permissions that can be assigned to one or more users.
While Snowflake provides a comprehensive and flexible security model, it only provides low-level objects to manage access control. This means that managing access to database objects requires manual configuration and setup, which can be time-consuming and error-prone.
Additionally, as organizations grow and the number of users and objects in the database increases, managing access control can become increasingly complex. This can lead to security vulnerabilities if access control is not carefully monitored and managed.
IaaC: the Terraform provider for infrastructure as code (IaaC) is a relatively new tool that is still undergoing maturation. However, it still has some bugs that need to be addressed, such as the tendency to delete and recreate resources rather than modify them.
4. The disruptors of the disruptors
In the last two years, a new generation of data warehouses has emerged that compete with Snowflake in terms of performance.
One such example is Firebolt, which claims of being "182x faster than any alternatives" which results in a “10-100x lower total cost“.
They offer:
More flexible cluster resizing options, allowing users greater control over their resources
Support for more concurrent queries per warehouse, allowing for more efficient usage of the data warehouse
More advanced indexing solutions, which can improve query performance
All of these features, combined with Firebolt's different pricing model, resulting in a significantly lower total cost compared to other data warehouses.
In conclusion, modern cloud data warehouses like Snowflake are revolutionizing the way we store and process data. With its unique separation of compute and storage, instant scalability, and secure data-sharing capabilities, Snowflake is quickly becoming a go-to solution for businesses of all sizes.
thank you for reading.
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.