“illustration of a data pipeline, black and blue, artistic, flow, simple, pure, big data,ux, ui, ux/ui background --v 4”
Welcome to the exciting world of data engineering!
Every week, you'll get a chance to learn from real-world engineering as you delve into the fascinating world of building and maintaining data platforms.
From the technical challenges of design and implementation to the business considerations of working with clients, you'll get a behind-the-scenes look at what it takes to be a successful data engineer. So sit back, relax, and get ready to be inspired as we explore the world of data together!
Design Pattern of the week: idempotency
What is idempotency?
In mathematics and computer science, the term "idempotent" describes an operation or function that produces the same result no matter how often applied. In other words, if a function f is idempotent, then applying it repeatedly to the same input will always yield the same output.
Why your lambda function should be idempotent?
It’s a common design pattern in data engineering to use SQS queues in combination with Lambda functions. This can provide several benefits, such as decoupling different parts of your system, increasing scalability and reliability, and simplifying the management of data processing tasks.
However, SQS guarantees "at least once" delivery, which means that messages may be delivered multiple times. If a Lambda function processes these messages, it is important to make sure that the function is idempotent, so that it returns the same output even if it receives the same message multiple times.
On top of it, your app may generate several times the same events in some edge cases: a user clicks several times on the pay button, network problems slowing your lambda, etc.
Being able to handle these use cases makes your platform more robust and reliable.
Design best practices to ensure idempotency
You can ensure idempotency by taking the following measures:
Lamba does not generate an error if it processes a duplicate event
It seems pretty obvious, but if it’s not the case, the SQS queue may try to retry the processing of the event until it lands in the dead-letter queue. You will then have unnecessary retries and messages in the dead-letter queue that should not be here (the first processing was correct).
Lambda does not reach timeout
If you process events in batches, you may at the end of the batch, reach the 15min lambda timeout. The batch will then be marked as failed and all the events sent back to the SQS queue.
A best practice here is to calculate and store the total processing time of the batch in a variable.
At the beginning of each iteration, you can estimate if enough time is still available and send the remaining events back to the queue (see my previous article about partial batch failure).
Use the SQS deduplication ID for SQS FIFO queues
A deduplication ID is a value that you can assign to a message that is used to detect and eliminate duplicates. When you send a message to an SQS queue, you can include a deduplication ID, which is a unique value that you choose.
When SQS receives a message with a deduplication ID, it checks the queue to see if a message with the same deduplication ID has already been received (and possibly deleted). If a matching message is found, SQS considers the new message to be a duplicate and doesn't enqueue it. This can help ensure that your messages are delivered to the queue only once and prevent unwanted duplication.
When you create a new SQS queue, you can specify the deduplication interval, which is the amount of time that SQS should keep track of deduplicated messages. The default deduplication interval is 300 seconds (5 minutes), but you can choose a different interval if needed, ranging from 60 seconds to 12 hours.
Lambda knows the status of an event
Using a persistent storage layer like DynamoDB is a good practice for keeping track of the state of events processed by your Lambda function.
By keeping track of the processing status of events in DynamoDB, you can ensure that your Lambda function only performs work that is necessary and avoids duplicating work that has already been done.
For example, when a Lambda function processes an event, it can check the status of that event in DynamoDB before performing any work. If the event is already in the "IN PROCESSING" state, the Lambda function can simply return without performing the API call or any other work.
Additionally, storing the state of the event in DynamoDB can also be useful for other purposes, such as tracking the progress of long-running tasks, auditing the execution of Lambda functions, and retrying failed tasks.
DynamoDB is a good fit for this kind of use case because it is highly scalable when the access pattern is predictable (which is the case here: you will query the event by timestamp/type).
Illustration of an idempotent sequence from this video:
Additionally, building this persistent layer is a good time investment you can do when building your platform. It participates to better observability and you will thank yourself when you will need to understand and fix urgently a lambda error in prod.
To conclude, by taking these measures, you can ensure that the Lambda function returns the same output even if it receives the same message multiple times.
This will help ensure that your data processing tasks are accurate, reliable, and scalable. My experience in building large-scale data platforms has taught me that ignoring these patterns quickly leads to scalability problems and should be integrated as soon as possible.
Next week, we will have a look at the library aws-lambda-powertools and how it helps to implement these design patterns in a couple of lines.
Video/Article of the week
Excellent presentation from AWS developer Benjamin Smith about the last evolution around the step function and how you can coordinate complex workflows.
Tool of the week
I came across Datafold this week.
This tool allows you to compare tables in different databases. It uses a dichotomy + checksum-based approach to identify differences. It divides the table into different segments and applies a checksum to each. If a delta is found for a segment, it is divided into subsegments until the problematic rows are found.
This tool can be really useful for migration projects where you want to compare the migrated data with the source table in the legacy system.
thank you for reading.
-Ju
I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.
P.S. you can reply to this email; it will get to me.
Subscribe for free to receive new posts and support my work.