Build in Kubernetes vs in AWS / Airbyte, the open source ETL

Data Eng Weekly - Ep 4

Jan 27, 2023

Welcome to the exciting world of data engineering!

Every week, you'll get a chance to learn from real-world engineering as you delve into the fascinating world of building and maintaining data platforms.

From the technical challenges of design and implementation to the business considerations of working with clients, you'll get a behind-the-scenes look at what it takes to be a successful data engineer. So sit back, relax, and get ready to be inspired as we explore the world of data together!

Article of the week

A very interesting article explaining the distinction between an app deployed:

on Kubernetes
on AWS with containerized applications
in AWS with managed services

The author takes the example of a simple voting app coded in Ruby.

He writes:

“What stood out over the years from experience is that the amount of code that I am responsible for (e.g. the base container image, the ruby runtime and my own code) is incredibly disproportional when compared to the amount of business logic I need. In fact, if you think about that, the business logic for a voting application is very much A=A+1. That’s all the logic I need, really.”

He needs indeed to manage 224 MB of code liability (size of the docker image) for a simple A=A+1 business logic^^.

He explains in detail how he migrated from a Ruby app to a Step function in AWS and managed to reduce the code size from 224 MB to a simple json file configuring a step function workflow.

Quite impressive!

AWS Step Functions is a serverless scheduler that enables you to coordinate AWS services.

Instead of coding a full application, you configure and coordinate different services managed by AWS: an event queue, a lambda function, an s3 bucket, a write/update operation to DynamoDB ect.

Example for the “write to Dynamo DB” operation in the voting application.

{
  "StartAt": "restaurantdbupdate",
  "States": {
    "restaurantdbupdate": {
      "End": true,
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:updateItem",
      "Parameters": {
        "TableName": "StepFunctionsStack-yelbddbrestaurants70424F48-1NVUR508DKKL0",
        "Key": {
          "name": {
            "S.$": "$.restaurant_name"
          }
        },
        "UpdateExpression": "SET restaurantcount = restaurantcount + :incr",
        "ExpressionAttributeValues": {
          ":incr": {
            "N": "1"
          }
        }
      }
    }
  }
}

As a developer, it saves you the pain to handle a massive backend framework that would be completely overkilled to build such a simple application.

You get the scalability as well for “free” as AWS handles everything for you in the background.

The author introduces the MTBU (Mean Time Between Updates) metric: a larger MTBU means less time spent on maintaining the code and is therefore better.

With a containerized version of the code, the container image needs to be curated regularly, which leads to a MTBU measured in months.

When using a managed AWS service, where AWS handles the curation, the MTBU is on the contrary almost infinite.

On the downside, one could argue that you are more tightly coupled to AWS, which could lead to vendor lock-in.

By using a Docker container, you can move to another cloud provider relatively easily.

It then becomes a question of priority: do you prioritize speed of building over mitigating migration risk?

Tool of the week: Airbyte

Airbyte is an open-source data integration platform that allows users to easily transfer data between different sources and destinations.

It supports a wide range of data sources and destinations and provides a simple and flexible way to configure and schedule data transfers.

The key differentiator between ETL tools has always been (and will always be) the number of connectors.

Airbyte leverages the power of open-source to maintain and develop a collection of pre-built connectors:

300 sources: Google Analytics, Salesforce, Hubspot, Notion ect
30 destinations: MySQL, Redshift, Snowflake, S3 ect

Maintaining 5-10 pipelines is ok, but managing > 50 of them (the “long-tail”) quickly becomes a headache for any data eng team.

Airbyte addresses this issue by decentralizing maintenance efforts among all users of these connectors.

This is the benefit of using open-source products. If a connector breaks for one team, it breaks for everyone. And if one team fixes it, it fixes it for everyone.

Quite interesting in terms of risk mitigation….

On top of these connectors, you can even write data transformations using DBT directly within Airbyte and trigger typical data pipeline scenarios, such as full loads and incremental loads, etc.

As a developer, you can use the CLI or GUI provided by Airbyte to build and configure your pipeline.

Like most of the open-source projects, they offer two plans:

a free self-hosted version with all the connectors available
a cloud-hosted version (credit system)

Their vision is interesting: They aim to be the leading player in the ETL industry and also in the emerging field of reverse ETL. To further this goal, they acquired Grouparoo last year, to facilitate the movement of data from the analytics layer back to production systems.

Airbyte is a promising and helpful tool if you're starting from scratch to build data pipelines and integrate classical data sources, as listed in the connector marketplace.

However, if you already have specific integration patterns in place in your platform and/or you are not already using DBT, migrating to Airbyte may not be worth it.

thank you for reading.

-Ju

I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. you can reply to this email; it will get to me.

Ju Data Engineering Newsletter

Discussion about this post