A Closer Look: MinIO Observability

Observability is all about gathering information (traces, logs, metrics) with the goal of improving performance, reliability, and availability. Seldom does just one of these pinpoint the root cause of an event. More often than not, it's when we correlate this information to form a narrative is when we’ll have a better understanding.

From the start MinIO has not only focused efforts on performance and scalability, but also on observability. MinIO has a built-in endpoint /minio/v2/metrics/cluster that Prometheus can scrape and gather metrics from. You can also publish events to Kafka and trigger alerts and other processes to run that are dependent on action performed within MinIO.

In the last blog, we discussed Observability at a very high level ten thousand foot view at an introduction level. In this post, we’ll dive deeper into each of the different features of Observability and see how we can use them to have a production-grade monitoring ready to roll out of the box.

Overview

View the cluster’s overall state, including total disk used, erasure code settings, and drive settings among others.

Data

Drill down into specific disk pools to see drives that could be in the healing process.

System

Overall cluster metrics for CPUs, Memory, Disks and Network.

API

There are a ton of S3 calls made against the cluster. It would be prudent to monitor these to ensure there are no failures or latency. That could mean a larger issue somewhere down the pike.

Replication

When replication is enabled, all the replication related statuses, such as objects remaining to replicate and replication speed among other stuff can be tracked.

ILM

In the past we’ve talked about different tiers MinIO can be used for using Integrated Lifecycle Management (ILM), now you can monitor its progress in detail.

Healing Metrics

If any disk has a failure or data is corrupted, MinIO automatically starts the healing process. This can be monitored in detail.

Scanner

As objects are scanned for various operations, those metrics are displayed here.

Monitoring is Key

Observability is multifaceted; you will often have to examine a combination of traces, metrics, and logs to determine the root cause. You can use Chaos Engineering tools such as Gremlin, ChaosMonkey, our very own MinIO Warp, and the like to break down your system and observe the patterns in the metrics.

For example, perhaps you are collecting your HTTP request status, and generally, you see 200s all the time, but suddenly, you see a spike of 500s. You go take a look at the logs, and you notice a deployment took place recently or a database went down for maintenance. Or, if you are monitoring object storage performance metrics, you can correlate any server-side issues with this data. It's these types of events that often cause the most pain and having visibility in these cases is paramount.

If you have any questions on AIStor be sure to reach out to us by sending an email at hello@min.io.