6 Comments
User's avatar
Daniel Beach's avatar

Just curious, in a real life scenario, would you try to convert the JSON to some other file format before query, or would you do it like this? I wonder how easy both DuckDB and Polars could convert the JSON to say parquet.

Expand full comment
Julien Hurault's avatar

For one-shot ad-hoc analysis, just like this.

For multiple downstream usages, I’d likely build a silver layer in Parquet, filtering only the attributes needed downstream using strategies mentioned in the article.

DuckDB can write parquet with COPY TO and Polars with write_parquet().

Expand full comment
Mihai's avatar

I’d add that if the first step of a workflow is to extract data by calling a stateless API that returns JSON, I’d consider keeping the raw responses if needed in the future for backfilling if the processing logic changes. This assumes there’s no way to get historical data from said API source

Expand full comment
mehdio's avatar

Agree with Julien here, I think it depends on :

- Size of the data (intermediate Parquet may be required if you don't want to blow up your memory while parsing JSON)

- Usage frequency on raw data

Expand full comment
Vincent gautier's avatar

Great article thanks for sharing. In my experience as long as i dont have to use unnest function on gb scale data it is all good. I m curious of your approach when mixing geospatial functions with json functions in duckdb, its one of the only things that makes my macbook fans make noise.

Expand full comment
Mihai's avatar

To be fair, working with geospatial functions on deeply nested JSON structures similar was a particular inspiration for this post, but we thought it’d be too extensive of a topic to cover in such a short format, hence we raised the level of abstraction so to speak. I definitely consider it worth a shot for exploration, though I’m fairly sure some fan noise will result from this :)

Expand full comment