Pandas, Scala, PySpark, Dask...
Forget about it!
Stay with SQL!
I have been working for 8 years in the data industry. Despite the emergence of various tools and frameworks for data preparation and transformation, I am always faster to write 99% of data transformations in SQL.
Let's have a look at why.
Data Transformation
Data scientists and engineers often spend a significant amount of time manipulating data and performing tasks such as changing data types, aggregating data, and joining datasets.
While the Python package Pandas has gained popularity in data science for these tasks, I prefer using SQL instead.
Complex API: I'm tired of browsing the Pandas documentation to perform a simple join. SQL code is simpler and faster to write. The syntax in SQL is well-established and widely known, making it easier to work with and maintain.
Pandas column type inference can be a nightmare: Dealing with inconsistent data types in Pandas DataFrames can be frustrating, whereas SQL handles data types more seamlessly.
Poor DataFrame performance: When dealing with large datasets, Pandas can struggle to provide efficient processing.
SQL after years of existence, is the faster way to write transformations. It comes as no surprise that DBT (Data Build Tool), one of the most popular data transformation tools today, is built on top of SQL.
In many data science use cases, the data may not be stored in a traditional database where you can directly run SQL queries. Data scientists often work with diverse data sources, such as CSV files, JSON files, APIs, or in-memory data structures like Pandas DataFrames. In these scenarios, using SQL directly may not be feasible.
Here is where DuckDB comes into the game.
DuckDB is a tool that allows you to query Pandas DataFrames directly using SQL
Simply install/import the Duckdb package and you can start writing SQL queries on your DataFrame.
Beyond data transformation: Declarative Data Pipelines
In addition to data transformation, you can now build entire data pipelines in SQL.
SQL is a declarative language, which means you define what you want to achieve rather than explicitly specifying how to achieve it.
Cloud data warehouses like Snowflake are embracing this paradigm and starting to offer declarative building blocks called Dynamic Tables.
These building blocks are abstractions on top of database objects that can be combined together to create complex data pipelines.
and guess what ... you define them in SQL :)
Dynamic Tables handle the automatic refreshing of a SQL object for you (and abstract streams/tasks objects that you would need to build by yourself) like in this example below:
My take is that cloud data warehouses are going to capture more and more of the data ecosystem, including:
data transformation
data pipelining
task scheduling
ML/AI
visualization
They will propose many abstractions natively built with the SQL API at their core.
That's why I believe SQL will play a significant role in the upcoming years as the number one language to interact with these abstractions and as the modern way to build data platforms.
Thanks for reading,
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.
This was a great article, Julien. Thanks!
I appreciate it being concise, but still containing actionable insights with code example!
Looks like the footer could be adjusted though, as it says “I hope this article ... into the world of **image generation**” and it does not necessarily match the content.