This is the first year since the beginning of the pandemic where conferences have really kicked off. DeveloperWeek Cloud 2022 also came back this year with a bang in Austin, TX. The two major tracks were DevOps and DevLead.
A number of the talks at the conference delved deep into some of the sub themes of observability, security and reliability. It gave me tremendous insights on how developers could incorporate MinIO at different levels in the DevOps stack to support various applications and in the process ensure we’ve deployed a reliable and manageable MinIO cluster.
In order for developers to improve the reliability of any system, they need to observe it first, in a way that they can see trends for longer periods of time. Observability is the path towards reliability.
There are several methods available to observe your MinIO cluster:
- MinIO has a metrics endpoint
/minio/v2/metrics/clusterthat could be scraped by Prometheus. This scraped data can then be visualized through Grafana to be displayed on radiator dashboards.
- There are several bucket notification targets that MinIO’s events can be sent to. In a past blog we’ve shown how to send bucket notification events to Kafka. But you can also send these events to Elasticsearch to be able to query or graph in Kibana.
- You can take MinIO SystemD logs and throw them into Elasticsearch, which can help you correlate various log streams in a single pane of glass view.
- And of course, you can always store those metrics in MinIO as well as a reliable backend.
Have defined SLAs/SLOs/SLIs targets that you can hit:
- SLA: It’s the agreement you make with your end-users/customers/vendors.
- SLO: Objectives your team must meet.
- SLI: The actual numbers that you use to measure your performance.
In order to further tightly integrate your app with events in MinIO, you can use OpenTelemetry to send traces, metrics, and logs to further observe the interactions between them. MinIO provides SDKs for Python, GO, and Java, and so does OpenTelemetry, so you can get started quickly with the language you are familiar with.
So not only should developers observe and monitor their applications, but it's paramount that DevOps monitor the infrastructure that your app is running on top of and correlate these events in a meaningful way.
Managing and rotating passwords is one of those painstaking tasks DevOps engineers have to deal with, and it would be nice to have a password-less experience but still have all the benefits of having authentication. In MinIO, instead of using static pre-determined credentials, your application can request temporary credentials using the Security Token Service. This way, your application does not need to store any credentials to access MinIO resources. It can simply request the credentials for a particular amount of time when it needs to access the bucket, and afterwards they expire.
Authentication is just one piece of the puzzle; the other is authorization. Authentication allows developers to verify if the user is valid, but once the user is validated they need to determine the access rights to various components. This is where authorization comes into play.
For example, once the user is logged in, let’s say we want the team manager and anyone who reports to the manager to have access to a particular bucket. In this scenario let’s assume Alice is the manager and Bob reports to her but Joe reports to Alice’s boss. We want this bucket to be accessible by Alice and Bob but not Joe. Keep in mind that this logic can change as Alice has more reports under her or if Bob is no longer reporting to Alice and we don’t want Bob to access the bucket.
You can write complex logic in your application to handle these scenarios but as it gets more complex it can get unwieldy to manage these. This is where Open-Policy Agent (OPA) can come in handy. It has a purpose-built language called Rego that helps us define the policy that can be queried based on the request made. Rego has built in functions that can help with writing the policy succinctly and works as close as possible to zero-decision time, meaning from the time the authorization query is made until the result is sent back should be within subseconds.
A systems engineer 20 years ago might have been dealing with one server, one monolith app and one log file. These days a typical DevOps engineer deals with hundreds, if not thousands, of servers/VMs, 100+ of microapps (or microservices) and 1000 log streams with thousands of logs per second. With this type of scale and velocity, a DevOps engineer today is not only managing the infrastructure underneath, but the role is now usually responsible for the performance of applications such as databases. For example, these days you rarely see Database Administrator roles available, because this function has been more or less taken over by far more capable SREs and DevOps engineers who have it in their best interest to ensure these systems are optimized.
When dealing with multiple apps and multiple clouds, developers should start thinking about microservices in a granular and reusable way. If you are using two different cloud providers and you have to use two different storage systems, that is a tech debt and maintenance burden on the engineers.
- The application developers will need to handle the different logic based on the cloud provider from within the codebase.
- DevOps engineers will have to build and maintain two different solutions with the two cloud providers, causing more friction and a learning curve for new engineers.
- Even if a system is *aaS (PaaS, DBaaS, IaaS, etc.), the Engineers still need to worry about slowdowns, outages, security and most importantly…costs. It is a shared responsibility to architect, design, configure, tune and optimize these systems.
This is where cloud agnostic systems such as MinIO come in very handy. MinIO can run in a distributed setup anywhere; in the cloud, on site (physical hardware), IoT/Edge, containers and more. By having a common storage backend, developers need to only concentrate on writing code for S3 no matter where they are and your DevOps engineers can use the same upgrade and maintenance policies across all the clouds. You also save costs by controlling the hardware, provider, and bandwidth based on your needs.
One of the challenges of managing disparate systems is the inability to correlate. For example, you might have your application logs going to Elasticsearch. If you are unable to fetch an object from the bucket you might see an error about the
GET request failing due to a server issue. Generally these things are vague and you either need someone who knows the app really well and has tribal knowledge, or like most of us, you need more information as to why you are unable to fetch the object. This is because logs are uniquely coded for human troubleshooting; they are able to find the root cause in logs by:
1. Detecting a problem: through some sort of stack trace or metric alert.
2. Noticing rare events: events that have been never seen before.
3. Noticing problems and rare events abnormally close in time: stack, deployment, stream dependent.
4. Construct a narrative from problems and rare events.
Root cause will not always be errors, it can sometimes be debug and info logs as well since the vast majority of logs are based on rare events. There are tools out there to “test” these rare events called “Chaos Engineering” tools. One of the most common ones is from Netflix: Chaos Monkey. But there are others, such as Gremlin, Chaos Mesh and Litmus.
All in all I’m glad I was able to attend this conference this year. I especially like the fact that most of the talks focused on the basics like observability and security which lays the foundations for much of the infrastructure and applications developers build.
I’m sure everyone has heard of the term “shift left”, meaning start thinking about a specific practice in the earlier stages of the development lifecycle and not later. You can apply the same principle to observability. Start monitoring, observing, and tracing your applications early in the development lifecycle as you build them. This way you can ensure all the systems can be secure from the get go and think about scale and reliability rather than it being an afterthought.
If you have any questions about monitoring, scaling or securing MinIO, reach out to us on our Slack!