Supportability as Software with MinIO SUBNET

Tl;dr MinIO leverages engineering focus and minimalist thinking to add supportability functionality, which means finding and fixing issues before they become problems in your installation.

Going Beyond: Supportability as Software in MinIO

In the same way that DevOps and Cloud Native require a rethinking of old paradigms of development, MinIO’s approach to supportability is driving new thinking on support.

This approach is fundamentally about respect for the customer. Customers today, especially in the software world, are technically sophisticated, and generally know quite clearly what their problem is, if not directly how to fix it. It’s disrespectful and frustrating for everyone involved to send them to a Tier 1 tech support person who is working from a script.

Credibility and Complexity

The data store is the foundation of the modern enterprise. If it goes down, there is not just economic loss there is reputation damage as well. While the underlying software is the building block, the supportability of that software is what enables the business to run efficiently. Furthermore, in the developer-led world, supportability is a core component of customer respect.

As MinIO continues to grow in adoption, our user base widens, and that means more complexity and variation in the environment in which MinIO is run. Often, the problems we find are not within the MinIO software, but within the context of the installation itself. Increasing supportability functionality means that no matter the complexity of your ecosystem, the same feature set is available to you when things require examination for proper flow, enabling either you, or the SUBNET team, to solve issues not in days, but in minutes.

The supportability functionality through SUBNET means that a simple health check can keep your enterprise on the side of maintenance, before trouble arises.

SUBNET

MinIO’s supportability helps you scale, running potentially complex infrastructure within the guardrails of direct-to-engineer support. It’s a simple, but entirely profound, change in the way supportability is handled in the enterprise.

For MinIO’s commercial customers, SUBNET is the base of it all. According to Eco Willson who leads the support function at MinIO

“SUBNET combines a commercial license (important for the AGPLv3 obligations) with a unique support model that delivers 24/7/365 direct-to-engineer support through a MinIO-built portal that blends the best of Slack and Zendesk into an issue-resolving machine. There are a number of other features including security and architecture reviews, access to the Panic Button and indemnification, but the core function is to transmit our expertise to our clients for large scale data infrastructure solutions.

SUBNET is so disruptive because of our obsession with simplicity. Simplicity in our software. Simplicity in our approach (we only do one thing, object storage). These conspire to create a product that is simple to support.”

Disaster Recovery vs. Troubleshooting

Engineering for supportability rather than disaster recovery is critical within an enterprise environment, where downtime is disaster. Everyone uses things that back up and restore as a solution, but we should think of this as only disaster recovery. If you need to recover from a backup you are in really bad shape. Once you are using petabytes of data, it can take you days to recover that data and ensure it’s back up and running. At that point you have lost business, lost revenue, lost credibility. It’s really important to make sure your production system is available and can be restored quickly. While it remains prudent to back up — disaster avoidance is the better strategy.

Troubleshooting

Finding the specific issue is always the central problem, especially within a dynamic enterprise-level production environment. Within that environment, things are mission critical, and the business depends on it. Supportability is not just a BI tool. If a tool goes down, it’s inconvenient. If your environment goes down, it’s a huge problem.

MinIO is a pure software solution; no hardware involved. Our supportability tools were built to help you identify where in the flow you are having the issues. Network? Drive? Server? How are you using it? How do you work with apps that don’t behave well?

Our main focus is not on simply making sure you are just running; it is on making sure you are running optimally.

Support

If it made it simpler, more scalable or more enterprise-ready we did it and did it right. These new features make it easier to ascertain the source of any difficulties more specifically, and therefore resolve them more rapidly.

Some in the industry will claim that our model isn’t scalable, that you need a massive support organization as you go. Those companies don’t relentlessly pursue simplicity like we do. This scales because everything is about minimalism (simplicity) and everything is an engineering problem. As we grow, we will grow engineering, not support.

This ethos means that supportability features, which inherently improve the product, not just our responses, are part of MinIO’s maintaining a laser-focus on engineering-based improvements.

The Functions

Note: The mc support commands were designed for MinIO deployments registered with MinIO SUBNET to ensure optimal outcome of diagnostics and performance testing. MinIO does not guarantee any functionality if used against non-MinIO deployments or if used independently of MinIO engineering and support.

Accessing SUBNET

For an excellent description of SUBNET, all of its features and functionality, and how it is accessed and navigated, please refer to this excellent piece on our website.

Portal

Home screen of SUBNET, showing 500 TiB, 20 Clusters, 118 Servers, 925 Drives, 196 Buckets, and ~24 million objects

Supportability Functions

The supportability and health check left nav of SUBNET

Health check diagnostics

Results screen from diagnostic test in SUBNET

Trace

Trace is supremely important in order to find out about traffic, and is the primary tool for reporting support issues. In general, getting a trace is always useful for better debugging. Read more in our documentation and in our GitHub repo.

Logs show

systemd is mostly redundant to susted/k8s/docker logs. With Logs show, we can ship the logs, centralize them, then take action on them. In SUBNET, they are put under support to be in the same spot. When error logs are generated as the server runs, Logs show prints out actionable logs. Read more in our documentation.

import/export — IAM/Bucket metadata

This is brand new within the health check, and useful for backup and restore. It is targeted for disaster recovery – must have a copy of all bucket metadata, and IAM users group information. These tools fill a gap in disaster recovery, because that involves object info from the cluster, but must extract all policies and groups. Read more in our gist.

Profile

Profile is critically important for when the development team gets called in. It provides low-level analysis of things like CPU and memory time for what the MinIO binary is doing. Profile info is mostly useful when you know the code base, but it will typically be one of the first things we ask for when you have a performance question. CPU info says which part of the system is utilizing CPU and memory, and what things are taking memory. Goroutines checks for what is running. Scanner info provides insight into the operations of the scanner when the customer has questions regarding usage calculations, ILM operations, and healing. Profile allows us to see the general speed and how long each operation takes.

scanner info

Scanner info is useful for big clusters (PB of data in small files). With scanner info, you can determine how long it takes to scan the data, which allows the engineers to find bottlenecks and slowdowns.

ping

The mc ping command checks for the liveliness of the server. Ping is useful for showing what the latency is to the instances. If it is into seconds or microseconds, that shows that the issue is within the environment, rather than with MinIO itself.

ready

Ready is a command to check the cluster's health status, to know if the cluster is ready or not for incoming traffic. This basically hits the MinIO server's health check endpoints to know if the cluster has enough quorum (consensus to successfully write/read data from the disks) to serve the s3 API requests.

mc ready has a special flag, mc ready <alias> --cluster-read, which checks to see if the cluster has enough READ quorum to serve HTTP GETs.

If the cluster is not ready, mc ready will repeatedly ping every five seconds until the cluster is ready and returns the result.

For example,

➜  mc git:(master) ./mc ready myminio 
The cluster is not ready
The cluster is not ready
The cluster is not ready
The cluster is not ready
The cluster is not ready
The cluster is not ready
The cluster is not ready
The cluster is ready

This command is useful to check if the cluster is ready and can serve S3 requests. If the cluster reports to be "not ready" then there might be a possibility of n/w or disk failures and the quorum is lost.

Immediate purge

Immediate purge is an enhancement to force delete objects immediately. Before this feature, object deletions would move the entries to .minio.sys/tmp/trash and then the trash would be emptied every five minutes through a background thread.

This enhancement will purge the objects immediately without moving them to any temporary trash directory. And this feature applies only for DeleteBucket and DeleteObject API and not for DeleteObjects. When the x-minio-force-delete header is passed to DeleteBucket and DeleteObject APIs, the object(s) will be purged immediately instead of waiting for a background clean-up thread to kick-in and clear the trash. This immediate purge means that the drive space will be reclaimed immediately.

Inspect

On the backend, we have erasure code pieces of a file. If you wanted to reconstruct those, you could not unless you are a MinIO developer. Inspect lets you get info about the files from all the servers. Read more in our documentation.

Top

Top is being extended, but is only used occasionally. The scanner does the accounting processes for MinIO. Top determines if it needs another tier, or if it has been healed recently. It is not like Linux top, which gives you system info. Read more in our documentation.

Perf

Perf is one of the most important commands when deploying a cluster. If you’re having issues in the ecosystem before you deploy, the deployment won’t go well. Perf is the tool that enables you to technically assess the environment you deploy into. For example, here are some realistic numbers on an 8 server 80 drive setup:

Support(ed)

Greater supportability not only creates a better environment in which to operationally support our commercial customers, it is the result of a strong commitment to respecting customer time and resources.

To go deeper, download MinIO and see for yourself or spin up a marketplace instance on any public cloud. Do you have questions? Ask away on Slack or via hello@min.io.