Using Splunk to Monitor MinIO - A Tutorial

Using Splunk to Monitor MinIO - A Tutorial

Overview

MinIO and Splunk have a symbiotic relationship when it comes to enterprise data. Splunk uses MinIO in its Digital Stream Processor. MinIO is a Splunk SmartStore endpoint.

In this post we explain how to use Splunk's advanced log analytics to help understand the performance of the MinIO object storage suite and the data under management. A quick level-set for those new to us and/or Splunk.

MinIO is a high performance, Amazon S3 compatible, distributed object storage system. By following the methods and design philosophy of hyperscale computing providers, MinIO delivers high performance and scalability to a wide variety of workloads in the private cloud. Because MinIO is purpose-built to serve only objects, a single-layer architecture achieves all of the necessary functionality without compromise. The advantage of this design is an object server that is simultaneously performant and lightweight.

Splunk is a high performance event processing platform for enterprise computing environments that provides critical and timely insight into IT operations, including data from IoT, firewalls, web servers and more. Splunk recently introduced a feature called SmartStore that enables Splunk to offload Petabytes of data to an external Amazon S3 compatible object store. Disaggregating compute and storage frees Splunk nodes to focus on indexing and search, while the object storage is free to focus on the management, resilience and security of the data. While this post is focused on using Splunk for managing notifications and logs on MinIO - you can check out our detailed SmartStore post here.

MinIO Notification 101

MinIO allows administrators to configure various types of notifications, including audit logs (which gives details about any API activity that happens within the cluster such as creating new buckets, adding or deleting objects, listBucket calls, etc...) and MinIO server logs, which give details about errors that happen on the server.

At MinIO, we believe that simplicity scales, and this philosophy extends to our logging. We only log critical errors. Information and warn level logging only flood the logs with white noise that is rarely, if ever helpful. Given this approach, we know if we see an error in MinIO, it is something that needs to be addressed.

Splunk can help us understand and optimize MinIO's performance. Using Splunk's powerful log analysis tools, we have insight into what is happening within our MinIO cluster, and what is happening to the data that lives there.

Setting up MinIO

For this demonstration, we will need a MinIO server (or cluster) running, and we will do administrative actions against that server with the MinIO client (mc). To configure MinIO server, see the quickstart guide.

https://docs.min.io/docs/minio-quickstart-guide.html

For setting up mc, see the mc quickstart guide

https://docs.min.io/docs/minio-client-quickstart-guide.html

Setting up the Splunk HTTP Event Collector

Splunk allows the collection of events directly to an http(s) endpoint, which they refer to as HEC, the HTTP event collector (documented here)

This requires a few simple steps. Configure the HEC global settings, create a token, and configure the index you want the data to reside in.

From the Splunk Web UI, go to Settings -> Data inputs -> HTTP event collector

Under global settings set All Tokens to "Enabled" and change default source type to _json

Other settings can be left at default. Since this is a test environment without TLS enabled, we will uncheck the "enable SSL" box before saving and continuing.

After this is complete, click "New Token" and follow the wizard.

On the next page of the wizard, "Input settings", click "Create new index"

For the name, choose something a name like minio_audit.

Change default index to the index you just named, and make sure it appears in the “Selected items” box on the right hand side to make sure your events go to the newly created index.

Click Review and Submit

You will see the newly created token

Copy this value to use in the next step.You can see this token later by going to Settings -> Data Inputs -> HEC

Splunk configuration is now complete. We can test and make sure things are working using curl before we move on to the MinIO side of things.

curl http://localhost:8088/services/collector/raw -H "Authorization: Splunk 2d30fe74-f448-4fc4-9522-6679e5377658" -d '{"event": "hello world"}'

{"text":"Success","code":0}

Note the “Success” status at the end. Now we can see the event by searching for index="minio_audit".

Now we know that we are using the correct token, and sending data to the correct index.

Configuring MinIO Audit Notifications

After verifying that the HEC endpoint is properly configured, we can use mc to take a look at the existing configuration for our audit notifications.

First, let’s take a look at all the possible configurations using mc.

There are a lot of configuration options, but for now we are just concerned with audit_webhook. We will use this to tell MinIO to send its events to Splunk.

First, let’s take a look at what is already configured:

Here we see audit_webhook is not enabled, and doesn’t have an endpoint or auth_token defined. Let’s do that now using the values from we got from Splunk.

As the message states, we need to restart the MinIO cluster, which can be done via the mc tool. When the server comes back up, we can verify the settings took hold.

Everything looks correct so far. First, let’s take a look at the buckets we have in our setup. If you don’t have a bucket configured yet, you can create one with mc mb myminio/mybucket. Here, we have a simple test setup:

Just a single bucket called “testevents”

Now, let’s upload an object to that bucket

Going back to the Splunk search interface and searching for this object shows us the audit notification:

As with any audit logging, things can get pretty chatty. We can filter out a lot of that noise with a search that strips out all the typical day to day stuff by using this search in the Splunk Web UI:

Now we have filtered out all day to day operations an admin might have been doing, like listing buckets. This will now show us the things we might care most about, like objects or buckets being created or deleted. Let's investigate the logs a little further.

Take this example, seen when I use the search term index="minio_audit" AND statuscode NOT 200:

This is an event that is just a little bit different than a typical PutBucket. From the api: name field of this event, we know that someone was trying to create a new bucket. When we look at the status, we see Conflict and statusCode: 409. This means that someone tried to create a bucket, but it already existed. Maybe somebody accidentally reran a script or command to try and create that bucket. Not necessarily a big deal on it's own. But if I see thousands of these in a day, I probably have some inefficient code running somewhere, and I can use this information to investigate further.

Some other interesting fields we can drill down on.

deploymentid - If I am running multiple minio clusters, I can use this to drill down onto a specific cluster

Server - Find the version of the minio server running for the request.

TTFB/TTFR - Can be used to find when a request is taking longer than expected.

Take note that even the response headers are logged, so you can tell for example whether custom metadata you attached to the object is present.

Configuring MinIO Logger Notifications

Knowing what users and applications are doing in your cluster is important. But, what about knowing what is happening to the cluster itself? For this, we can use configure Splunk as an endpoint for the MinIO logger_webhook. If you remember from our previous mc admin config get myminio output, there was also a field called logger_webhook, that sends minio server logs to an endpoint. First, we take a look at what is configured:

Now, we set the configuration the same as we did with audit_webhook. Although you can use the same HEC token and index, you might want to separate them out by creating a new HEC token and index to put the data into. I prefer this approach, so you will notice that the token changes but the rest of the command remains the same. Since I created a new index, searches for these log events will be against index="minio_logs".

Don’t forget to restart afterwards for the configuration to take effect.

Remember from before, MinIO doesn’t log anything that doesn’t need action, so we will need to try and create some sort of failure in order to see something happen. To begin with, let’s take a look at how we are running the server for this testing. The command I used to start the server is as follows:

minio server /tmp/splunk/{1...4}

This simulates having a multi disk setup for the server, which means erasure code and bitrot protection are turned on. By default, MinIO erasure coding can handle n/2 disks being lost and still have read access to the data.

If we take a look at the backend filesystem, we see something like this:

In short, there are four folders (“disks”) each containing some amount of the data or parity for the mytestobject we uploaded earlier. Let’s start by “removing” two of the “disks” (remember in this test environment, these aren’t real disks, just folder paths, but removing them simulates yanking the disks out of a running server).

rm -rf /tmp/splunk/{1..2}

Now our tree command shows things are in bad shape.

Half of the “disks” in our setup are empty. Let’s see what got logged in Splunk.

MinIO reports that it sees some empty disks with the field message: unformatted disk found. In a typical production setup, losing a few disks (or even a few servers) is not a big deal, as MinIO is built to handle large scale failures. But, in our tiny little test setup, we have lost half our disks, and if we lose any more we won’t have access to the data. Now that we know what we are looking for, we can configure Splunk to visually call out such an event. In Settings -> Event Types (under the “Knowledge” heading), create a New Event Type. For our test, we will set it up like this:

Now as we are browsing events in the minio_logs index, we will see a giant red banner letting us know that the unformatted disk event we configured is here, and needs attention. In a large scale deployment, MinIO will automatically correct the situation by launching a heal process. In a large scale deployment I can wait, but here I don’t want to take the risk, so I will manually launch a recursive heal on the MinIO server.

Now that the heal is complete, we verify the backend data is intact.

Everything is back in order, and we can continue on with our day.

Conclusion

Splunk is increasingly ubiquitous in the enterprise. With the rise of machine data, their deployments have proliferated and they have 10s of thousands of customers at this point. Given MinIO's similar growth trajectory - it is likely that both software solutions co-exist in the enterprise. This creates opportunties.

One is to use MinIO as as a SmartStore endpoint - reducing Splunk costs while enabling additional scale - all without sacrificing performance. Another is to leverage Splunk's industry leading log analysis capabilities to optimize the performance of your MinIO deployment.

Both offer the ability to make the most of your data - by securing it, learning from it and extracting value from it. Give it a try on your own. If you don't have MinIO already you can download it here. If you need a little help, check out our documentation. You can also check out our public Slack channel as well.