Event Notifications vs Object Lambda
Enterprises are struggling to collect, store and manage the huge data volumes required for data lakehouses, analytics and AI/ML. In order to make this process less cumbersome, we add features to MinIO on an almost-weekly basis. Two of these features that we talk about often are MinIO’s Event Notifications and Object Lambda. We’ve even written a blog post a while back on how you can use Bucket Event Notifications with Kafka to automate ETL pipelines asynchronously and more recently how you can scrub sensitive data from an object using Object Lambda for Regulatory compliance.
As we were writing these blogs we came to a realization of why there are two different features doing almost the same thing? Or are they? What is the difference between the Greek Lambda and Lightning Bolt? Give those two blogs above a read as we will refer to them later down below to quickly understand the rationale behind each of them.
Event Notifications
MinIO bucket notifications allow administrators to send notifications to external services such as Kafka or RabbitMQ for certain object or bucket events. MinIO supports bucket and object-level S3 events similar to the Amazon S3 Event Notifications.
These events can be operations like
- Adding an object to a bucket
- Accessing an object in a bucket
- Deleting and removing an object
- Creating and Deleting buckets
Among several others. You can find the entire list of supported event types in the docs.
Moreover, MinIO supports two modes of delivery of the messages, Asynchronous and Synchronous.
With asynchronous delivery, MinIO fires the event at the configured remote and does not wait for a response before continuing to the next event. Asynchronous bucket notification prioritizes sending events with the risk of some events being lost if the remote target has a transient issue during transit or processing.
With synchronous delivery, MinIO fires the event at the configured remote and then waits for the remote to confirm a successful receipt before continuing to the next event. Synchronous bucket notification prioritizes delivery of events with the risk of a slower event-send rate and queue fill.
The event notification is perfect for the following scenarios:
- ETL jobs that need to wait on a large CSV file to start processing.
- Fanning out objects to be processed by multiple applications.
- Processing workloads at different times such as cron jobs.
One more thing worth knowing is generally you can use any programming language for the different services to process the data in different parts of the infrastructure. We’ll see with Object Lambda you can only write in Python.
Object Lambda
Speaking of Object Lambda, let's define what a “Lambda” is. Forget about all the fancy jargon that you know about Lambda and go back to the basics. Think about what a Lambda is in Python. It's a small anonymous function that can be reused several times. It's as simple as that.
For example open a python REPL and run the following
upper = lambda string: string.upper()
print(upper("dataone"))
This will print DATAONE
. Now if you have 'datatwo'
, instead of redefining string.upper()
, you can just use upper()
print(upper("datatwo"))
Can you guess what this would print?
Yup, you got it. DATATWO
. So you have a reusable function without having to define it using the def
keyword. But be careful, don’t make these too long or multi-line as they are generally meant to be single line expressions that is easy to read and simple. If you are doing anything more than a line then just use the regular function with def
. The advantage of this methodology is that if you want to change the output, instead of changing it in all your scripts you just have to update where the original lambda was defined.
Now let’s apply the same concept to objects. Let’s say you have sensitive data such as credit cards in a large CSV file. Before the object gets requested you want to ensure the full credit card number gets redacted and only the last 4 digits are shown. While you can do an Event Notification to achieve this you need to think about a few things
- Do you have multiple processes reusing the same code and you want a single source of truth.
- Do you want to set up an entirely new infrastructure for the messaging server cluster and add one more overhead tech debt to your maintenance of that infrastructure. Think of this as using the `def` function in python. Do you really need it?
If you are just modifying an existing object (called “Transform” in Object Storage parlance) you can simply write a function in python and when your application requests a particular object, it will just have to reference the name of the object and the name of the lambda function so the object can be transformed according to the python snippet. Unlike Python Lambdas though you can have multiple lines of Python code to define your Lambda – with the one caveat being there should only be a single response returned from the function.
So in the case of our credit card data, it makes more sense to simply use a couple lines of python code snippet to modify the data before returning it to the end user.
Which one to use?
Now that we’ve established the differences in the two features and the rationale behind them, could we have used Object Lambda for resizing images instead of Kafka or could we have used Event Notification in Kafka to redact sensitive data? Let’s think out loud here about the various use cases.
First let's talk about the Image Resizer blog. If we were to do the image resizing in Lambda it would be possible to use the same python code blob in the Lambda’s Python function. If the goal of the resizing is to just retrieve a thumbnail version of the larger image, for let's say one off resizing when maybe we want to view a quick copy of it. But what if you are using a CDN and want to distribute these thumbnails across the globe so that the thumbnails load a lot faster and then later when the user requests the complete version it will take an extra second or two. In this case, if we were to use a Lambda function every time a request for the thumbnail is made, then the end user on the website has to wait additional time it takes for the image to be resized before being displayed. This totally defeats the purpose of a CDN. In this case you would have an ETL job on a regular basis to resize images and make them available for the CDN to use so the user experience is near instantaneous.
Now let’s take a look at the blog where we scrub sensitive data. If we were to do this with Event Notifications it's definitely possible but you have to think about the following caveats:
- Do you really want to set up an entire new infrastructure just to modify some data in an existing object? Frankly as a practicing DevOps engineer that seems like overkill.
- If an important function like redacting data is mission critical, then you want it redacted as early in the request as possible.
- For instance, you can use ETL to sanitize the data and make an entirely new object that all applications have access to. But if the data is several terabytes then you are using up valuable data for just one redaction – what about redacting other things like peoples last names, would we need a separate data set for that? As you can see it can get unwieldy quite quickly.
- You can also sanitize this data at the application level after you’ve retrieved it but how can you guarantee that all applications are redacting the data appropriately? Each application can have its function written in its own way which differs slightly, and that can make debugging a nightmare.
After considering all options, the logical approach seems to be something in the middle – small, light and agile – that can quickly redact the sensitive data no matter which application is requesting the data. Moreover, based on the type of redaction, you can apply different lambda functions to redact just the right amount of data based on the application needs and regulatory requirements. This kills two birds with one shot to provide the following benefits
- First you ensure there is no duplication of the data taking up valuable space and causing needless confusion.
- Second you can ensure everyone is using the same standard way of transforming the data no matter the team, application or organization.
- Bonus: If you have to change logic in the codebase, you don’t have to have a project to “find the places where this code is being altered”. Rather you will have to modify the codebase in only a single place and all the applications will get the newer version of the transformed data on their next call of the lambda function.
As you can see, while there are subtle differences and some overlap between the two, it is important to plan out the design for your specific use case(s) and application(s). Our Engineers on SUBNET help our customers make architecture based decisions exactly like these everyday! If you want your storage infrastructure supercharged or have any questions about either Event Notifications or Object Lambda be sure to reach out to us on Slack!