Data Catalogs, LLMs in Production, and More

Data Eng Weekly - Ep 17

May 03, 2023

This week's release covers a variety of resources I've found interesting, with a focus on not just GPT but other topics as well.

Data Pipeline As A Service

I recently discovered an announcement about Cybersyn, a young company that has just raised a $60 million Series A funding round. Their vision is quite interesting; they aim to provide open-source data within Snowflake, a leading cloud data platform.

The company is very very young: they only provide 15 datasets for free in the Snowflase Marketplace for the moment.

I find this approach quite interesting as many companies use the same data and build the same pipeline.

Indeed, I think that recent advancements in AI, particularly with language models like ChatGPT, have the potential to significantly reduce the cost of building data pipelines for well-equipped teams.

However corporate adoption of these cutting-edge tools may take several years due to various factors, such as organizational inertia, resistance to change, and concerns over security and compliance.

That could be the chance for startups to get a competitive edge by leveraging AI technology to build data pipelines at a low cost and sell it “as a service” for corporates.

On top of that, the Snowflake Marketplace significantly simplifies the integration of pipelines for corporate accounts. Offering a one-click process removes many of the barriers to collaboration between startups and larger organizations.

However, I still have the following concerns:

Trust in startups to maintain critical data sources: Will corporations trust startups to maintain these critical data sources?
Dependence on Snowflake: this model is strongly coupled to Snowflake. What if Snowflake starts charging a significant fee per query as Apple does with the App Store?

Language Models (LLMs) and Data Catalogs: A Powerful Duo for Data Discovery

I read an interesting analogy about how LLM could be applied to data catalogs (I cannot find the source anymore unfortunately :( ).

The evolution of language models (LLMs) has the potential to revolutionize data cataloging, much like how Google disrupted Yahoo's manual approach to web indexing.

The current process of building data catalogs in companies is similar to Yahoo's early efforts—central teams tediously list and document all existing data, often with limited incentives for consistency.

Just as Google revolutionized and scaled search processes, LLMs (Large Language Models) excel at understanding documentation, opening the way for new opportunities in decentralized data catalogs.

Instead of a centralized, manual approach, data providers could maintain their own data product documentation. A central LLM could then scrape all available documentation and make it accessible through a unified chat interface.

I think this model could fit perfectly in the data-mesh trend: in a data-mesh organization, data producers are responsible for creating, maintaining, and documenting their data products. By keeping documentation close to the data producer, it becomes easier to maintain accuracy and relevance.

Language Models (LLMs) in production

After having produced so many cool demos across Linkedin and Twitter, the next step for the LLM industry is to transition these models from demonstrations to production.

And that's where data engineers come into the game!

I found this excellent article that explains some of the challenges surrounding the productionisation of these models:

It does a great job of explaining what does it take to get LLMs into production.

The authors discuss the following three challenges when it comes to implementing language models in production:

Prompt management: Currently, there is no framework for evaluating, versioning, and optimizing different prompts. As prompting becomes a new way of coding, we need to create code management systems similar to what we have today for code. Look at what a prompt nowadays looks like!

Cost: Estimating the cost of running models in production can be difficult, as the OpenAI API charges for both input and output tokens. Depending on the application the token flow may be too costly: “If you use GPT-3.5-turbo with 4k tokens for both input and output, it’ll be $0.004 / prediction. As a thought exercise, in 2021, DoorDash ML models made 10 billion predictions a day. If each prediction costs $0.004, that’d be $40 million a day!”
Prompting vs. fine-tuning dilemma. The challenge here is to choose the best approach to get the right balance between cost and quality/control for your use case.
- Prompting: For each sample, you explicitly tell your model how it should respond. This approach can be more flexible but may require a lot of trial and error to find the right prompt.
- Fine-tuning: You train a model on how to respond, so you don't have to specify that in your prompt. This approach can lead to more consistent responses but may require additional resources for training and maintaining the model.

If you're interested in the field of language models, this article provides a comprehensive overview of the current state.

The field is relatively new, and there's still much to be defined. It's likely that we'll see a similar evolution as what occurred in deep learning from 2015 to 2020. New tools and best practices will emerge to enable the cost-effective, safe, and secure operation of these models.

How Google’s perception has changed

On Monday, Geoffrey Hinton, VP and Engineering Fellow at Google announced his departure.

This event made me think of how quickly things can change in the corporate world.

Just a few months ago, Google was seen as a leading tech company. However, the perception of Google has shifted, with many people viewing the company as being overwhelmed by the latest AI innovations.

Here's an interesting article that explains the changing landscape of the tech industry and how it might affect Google's position:

Not Boring by Packy McCormick

The Unbearable Heaviness of Being Positioned

Welcome to the 1,382 newly Not Boring people who have joined us since last Monday! There are officially over 200,000 of us here now! 🥳 If you haven’t subscribed, join 200,244 smart, curious folks on our journey to 1 million: 🎧 Audio is back! Listen on…

2 years ago · 123 likes · 27 comments · Packy McCormick

Thanks for reading,

-Ju

I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. you can reply to this email; it will get to me.

Ju Data Engineering Newsletter

Discussion about this post