Why Kubernetes Managed Object Storage Matters
We were talking with a well-respected industry analyst the other day and he challenged us to articulate why Kubernetes is so important to Object Storage. It got us thinking that this was a topic worthy of our time, and yours.
At the most basic level, the value of Kubernetes lies in its ability to treat infrastructure as code, delivering full scale automation to both stateful and stateless components of the software stack.
To derive the maximum amount of value requires treating the maximum number of components as code and orchestrating those. That means you put EVERYTHING into the container, including applications, infrastructure and data.
In the modern world, applications are stateless and containerized. Still, that state has to be held somewhere. That somewhere is object storage (not legacy block and file) and that object storage needs to run IN the container. When done this way Kubernetes can manage the automation of the infrastructure - both stateful and stateless.
If the object store is left to bare metal or public cloud storage services, the benefits of Kubernetes based infrastructure orchestration are considerably diminished.
Another way to think about it is through a VMware analogy. VMware created the concept of the software defined datacenter. This was a predecessor to Kubernetes (which is why they claim it as their birthright). To get the true value of SDDC, you have to virtualize the entire datacenter. If some of the applications are left behind to run on bare metal, SDDC benefits are lost.
The same is true for Kubernetes. If you only use Kubernetes for the applications, you are only tapping a fractional amount of the value. Let’s explore this a little deeper.
First off, in the modern model, CPU, Network and Storage are physical layers to be abstracted by Kubernetes. They have to be abstracted so that applications and data stores can run as containers anywhere. In particular, the data stores include all persistent services (databases, message queues, object stores..).
From the Kubernetes perspective, object stores are not different from any other key value stores or databases. The storage layer is reduced to physical or virtual drives underneath. The need to run persistent data stores as containers arises from hybrid cloud portability. Leaving essential services to external physical appliances or the public cloud takes away the benefits of Kubernetes automation.
This VMware post announcing the reason they built the Data Persistence platform is an excellent resource. DPp is the answer to the question “how can we allow modern applications to do what they do best, but still provide the ease of use and transparent operations of the VMware platform to admins and developers?”
Modern applications, in particular, those built to run on Kubernetes, are designed to take care of availability, replication, scaling and encryption within themselves to become completely independent of the infrastructure. In turn storage needs to run IN the container in order to deliver Observability, Data Placement, Maintenance Operations, and Failure Handling.
This was not always the case. Traditionally, applications relied on databases to store and work with structured data, and storage, such as local drives or distributed file systems, to house all of their unstructured and even semi-structured data. However, the rapid rise in unstructured data challenged this model. As developers quickly learned, POSIX was too chatty, had too much overhead to allow the application to perform at scale and was confined to the data center as it was never meant to provide access across regions and continents.
This led them to object storage, which is designed for RESTful APIs (as pioneered by AWS S3). Now applications were free of any burden to handle local storage, making them effectively stateless (as the state is with the remote storage system).
Modern applications are built ground up with this expectation. Well-designed modern applications that deal with some kind of data (logs, metadata, blobs, etc), conform to the cloud-native (RESTful API) design principle by saving the state to a relevant storage system.
As a quick side note, REST APIs only address application-storage communication challenges such as PUT and GET or READ/WRITE data, and tracking metadata and version data, but not container orchestration and automation. That requires Kubernetes.
SAN and NAS can also make application containers stateless - but POSIX based File and Block are hopelessly inflexible in a containerized environment - i.e. ability to have application workers grow and shrink based on inbound load, move to a new node as soon as a current node goes down and so on. This is why object storage has replaced them as the primary storage class - as evidenced by public cloud’s reliance on object storage (and pricing of block and file).
This is not to say that storage applications, e.g. databases, object stores, key value stores, must be stateless. On the contrary, they need to be stateful - they just shouldn’t have the effect of making the application stateful in the process.
Kubernetes native storage applications (like MinIO) are designed to leverage the flexibility containers bring. Agile and DevOps best practices dictate that applications and CI/CD processes be simple and straightforward, independent of underlying infrastructure and consistent in how it accesses underlying infrastructure. Simply put, containers need to run the same way everywhere in order to be portable across development, test, and production. Combining that with variable hardware infrastructures, it makes sense for Kubernetes to be the point of contact between all the disaggregated infrastructures, applications and data stores.
Therefore, storage applications cannot make assumptions about the environment in which they are deployed. For example, MinIO uses an internal erasure coding mechanism to ensure there is adequate redundancy in the system, across varying hardware and cloud infrastructures, to allow up to half of the drives to fail. MinIO also manages the data integrity and security using its own hashing and server side encryption.
No application should have to do any of that for itself anymore.
In the Kubernetes world, functions are simplified and abstracted: applications do application things and storage does storage things. The application doesn’t have to think about it - it just happens, all inside a container that can be expanded, moved or wiped out.
This is the cloud-native way.
There are certainly non-cloud native ways. For example, you could solve this problem with the Container Storage Interfaces (CSI), but sophisticated architects and developers don’t because they add needless complexity and scalability challenges. This is because CSI-based PVs bring their own management and redundancy layers which generally compete with the stateful application’s design.
Take the following example of how cloud-native platforms work with storage and state. Apache Spark, in the cloud-native world, runs in a stateless manner on Kubernetes and ships state to other systems while Spark containers themselves are running completely stateless. Other major enterprise players in the big data analytics space like Vertica, Teradata, Greenplum are also moving to a disaggregated model of compute and storage.
Similarly, all the other major analytics platforms from Presto, Tensorflow to R, Jupyter notebooks follow such patterns. Offloading state to remote cloud storage systems makes your application much easier to scale and manage. Additionally, it helps keep the application portable to different environments.
MinIO has always thought of storage in this context. A majority of our workloads (523M Docker pulls as of this morning) run in containers (64%) and almost half are managed by Kubernetes (42%). That is why VMware picked us as a design partner for the launch of their Data Persistence platform (DPp). We are the standard for this type of deployment.
We continue to refine our approach. For example, our widely adopted Helm chart approach was not enough to cross the chasm from our DevOps audience to the mainstream IT administrator audience. Our previous implementation effectively dealt with a single tenant. For multi-tenancy and other DevOps tasks like provisioning, scaling, upgrades/updates, monitoring and encryption services - this required customer code.
Our new Kubernetes Operator helps our clients cross the chasm. Building a multi-tenant, self-service object storage infrastructure on top of MinIO required a significant amount of skills and custom code development.
With the introduction of the Operator, such tasks are automated and API / Web driven. Now MinIO is a full blown multi-tenant, self-service cloud storage on top of Kubernetes. The Operator and Console put the power of Kubernetes-native, object-storage-as-a-service into the hands of IT - without requiring CLI or scripting skills.
When we started talking about the concept of #minioeverywhere it was to illustrate our integrations with the cloud-native elite. Now, however, #minioeverywhere speaks to the fact that MinIO, in conjunction with Kubernetes, runs everywhere.
This can be lost on some given its nuance. Because of key economic and technical hurdles among the public cloud providers, it is increasingly attractive to use MinIO/Kubernetes across all infrastructures.
For example, public clouds are not interchangeable. AWS S3 does not equal Blob (Azure) and certainly does not equal GCP (marginally S3 compatible). Also, in the public cloud, bandwidth is more expensive than storage and latency is high. Smoothing these differences is a very expensive proposition.
Enterprises are adopting MinIO as a core part of their software stack (applications AND storage) because they can roll it anywhere. AWS, GCP, Azure, Tanzu, Openshift - the list goes on. Because MinIO is Kubernetes native and runs IN the container - MinIO works out of the box in any Kubernetes environment - from a car or 5G POP to the public cloud. That is why you find 7.7M IPs running MinIO in AWS, GCP and Azure.
All Together Now
There is a lot here so let’s summarize quickly. Kubernetes' value lies in its ability to treat infrastructure as code, delivering full scale automation to both stateful and stateless components of the software stack.
The value of Kubernetes is only achieved if you can get the maximum number of components inside the container. This includes storage/persistent data.
MinIO is built for this - it easily fits in containers (~45MB), it is designed for RESTful APIs and continues to evolve its approach (see MinIO Operator) to deliver the most native Kubernetes experience when it comes to storage.
When you are native to Kubernetes you can run anywhere it does - and today, that is everywhere you care about running - public cloud, private cloud, Kubernetes distribution and edge.
Don’t take our word for it. See for yourself. You can pull the MinIO Operator for Kubernetes code from Github. Questions? Join the conversation on our Slack channel, or hit the Ask an Expert button and get started today.