AIStor Best Practices for Updates and Restarts

AJ AJ on Best Practices |
AIStor Best Practices for Updates and Restarts

In the modern world, keeping your systems running isn’t just table stakes - it is non-negotiable. 

When it comes to software updates and what that means for your systems - well that is more complicated. On the one hand, security is the primary driver of updates today and that too is non-negotiable. Patches need to be implemented as soon as possible and across all systems to maintain the strongest security. 

The same holds true for software updates that carry important bug fixes, performance improvements and enhanced functionality. They too should be implemented in a timely fashion to leverage the improvements within. 

But what about downtime? Depending on your environment and operational processes, when security patches and software updates are applied, you may incur minor downtime due to service restarts, and you may introduce buggy code that causes major downtime until it is rolled back or updated. The puzzle of how to update and patch in a timely fashion without downtime is a tough one, and you need to make the right architectural decisions or your environment doesn’t stand a chance of being highly available.

The solution is to rely on software that guarantees sub-second restart times, even across hundreds of nodes. This is an area where we have invested heavily and, as a result, upgrading AIStor, even at scale, is non-disruptive. In this post we will outline the philosophy behind our approach and show you how to perform the non-disruptive upgrades that your high availability object storage requires.

AIStor is focused on non-disruptive restarts because we know how important your data is to you. Data is the blood of the modern enterprise. That means data storage is the heart. If it stops, the blood stops flowing - and, well, bad things happen. Since everything from applications, ETL, workflows, databases, AI/ML – even CDNs – depend on data, the data storage system must always be available – even as it consumes security and release updates. Failure to achieve non-disruptive upgrades accumulates technical debt and creates negative incentives for upgrading, thereby reducing the utility of the software and increasing risks to data privacy and ongoing business operations. 

General Principles

With AIStor, like with time, stability moves in one direction. The latest AIStor release will always be the most stable. Modern CI/CD oriented developers will understand this, while old school architects may not – but we assure you it is true. AIStor releases frequently, but intentionally, and each release is carefully planned, developed and tested. We always encourage our customers and community to be on the most recent release. 

Bugs found are fixed and are merged with the upstream from where the latest releases are made from. If specific customers require the fix can be backported on request, these are patches applied to older versions. Weekly latest releases are always made from the upstream. So our customers and community can rest assured that the latest version has the required fixes for the current version you are using, regardless of how old that version is.

While everyone should be on the most recent release, we also recognize that our release cadence and your deployment cadence, coupled with the necessary internal processes, may not coincide.  In a world where everyone runs a multitude of software packages and open source frameworks, it is rarely the case that all release and deployment cadences overlap.

Your cadence shouldn’t impact your downtime – and with AIStor it does not. 

So how does AIStor handle all of this? First and foremost, AIStor’s latest binary version  ensures that it is backward compatible with the oldest version possible. If the data is written in an older format, for example using the previous metadata structure, the new version will have logic to read it and perform an in-place upgrade of that data to the new format. The migration logic from each version is baked into the code itself so you don’t need to perform extra steps to update. 

In Linux, the package manager manages the config change. But this can get cumbersome because we need to teach the package manager about every single change. It's a chain of all the changes someone has to remember and scripts to apply them. 

AIStor’s rolling upgrade is fast and the entire knowledge of compatibility from any old version is handled by the new version. For example, recently a customer was on version 2.0.9 and they were wondering if they could update to version 4.5.2 directly or would they have to do a complete reinstallation or update one version at a time. We suggested they upgrade to 4.5.2 directly and they were able to get up and running in no time. AIStor always ensures there is backward compatibility and we will never leave you high and dry.

As we’ll show in the tutorial, you simply need to download the newer version, place it in the location of the old binary and restart your service. You can jump from version B to version X seamlessly. In effect, AIStor ensures capability through a dependency chain that guarantees the upgrade takes into account the previous version and, more importantly, upgrades one version at a time, like B->C->D….X. From the user's perspective, and perhaps more importantly from the admin’s perspective, the connected infrastructure just works. AIStor performs this migration and handles the intricacies of compatibility. The result is that there is no requirement to manually upgrade through multiple versions – this is an outdated and legacy way of approaching upgrades that harkens back to the days of running monolithic client-server applications and is simply not feasible when you are several versions behind, or have several hundreds of nodes, or both, in today’s cloud-native microservices world.

If the requirement were to sequentially upgrade it would be like the adage (false by the way)  that it takes two years to paint the Golden Gate Bridge, but you have to paint it every two years. The point is that if you needed to sequentially upgrade version by version, you would spend every resource on upgrades and be left with nothing for improvement, optimization and customization. 

This is, in fact, what happens in older legacy systems. They stall, and then they collapse under their own weight. 

Most of us who have managed systems know the drill. Supporting and dependent systems are taken offline. Then, in an effort to minimize total downtime, each node is taken down, the update or security patch applied, the node is restarted and then the entire cycle is repeated on the next node. 

If you have 2-3 nodes this is not a big deal, but if there are 50-nodes in the cluster this could take all day (or night). Not only the number of data but the amount of data in these nodes could also make the process a lot longer. Again, practitioners know that this process is notoriously difficult to  track, and without accurate tracking, they run the risk of not knowing which state each node is in. The consequences of incomplete upgrades can be serious. The existing AIStor servers with the older binary version would receive requests, yet would not understand that include new functionality. If there is a cluster load balancer, then the request is randomly sent to a node so it's difficult to control where the requests go.

The longer the update takes the longer your ETL jobs get backed up and the longer it will take them to recover. This is the accumulated tech debt that I referred to earlier. As a former DevOps engineer, I know this first-hand. Sometimes it would take a full 24-hour cycle for the entire system to recover and stabilize. When systems restart, inevitably there will be a configuration mismatch or a settings change because of the nature of the update and the time needed to apply it that causes the process to break everywhere. AIStor alleviates these issues because between the time the binary is downloaded and the service is restarted there are not many moving parts that could cause a drift. 

First, let’s talk about exactly how AIStor allows you to achieve uninterrupted updates and rapid restarts. Later we’ll show you how to do this with a tutorial on upgrading and restarting AIStor across your environments. This AIStor upgrade procedure can be expanded to a 50 node cluster, with each running a AIStor binary listening on port 9000.

Seamless, Simple, Instantaneous

When a new version of the AIStor binary is released, the actual upgrade itself is pretty straightforward. We’ll show you how to upgrade the binary as well as the kubernetes operator. This procedure will simultaneously upgrade on all 50 nodes. This nature of the upgrade saves time and incentivizes your DevOpsteam to upgrade the binary more frequently because this task can be easily automated. Architecturally everything is in one binary that runs in a single process ID.

After the binary is in place, we will restart the AIStor service on all nodes simultaneously. Again, you do not need to do this one node at a time because they all use a single binary and data format in the backend. When the new process restarts, the backend process knows how to read the drives that old versions are managing because the migration logic is also encoded in the newer version. The restart happens very fast, sub second in fact, because AIStor is efficient. AIStor barely uses any CPU resources, and the entire codebase is bundled into a single binary, keeping it streamlined and straightforward and eliminating the need to manage a lot of moving parts.

In parallel, we recommend that you update all the nodes at the same time. Traditionally, non-disruptive upgrades could only be achieved via rolling updates because every node takes time to update and the nodes are dependent on one another so they must be updated one at a time. When updating a AIStor cluster, as each AIStor node is updated, the cluster continues to operate using the old version until all the nodes in the cluster are updated. This alleviates the burden of worrying about different requests showing different versions as all requests go to the same version until a complete changeover takes place. 

By performing updates in parallel, we are doing it atomically; either all of the nodes are running the newer version of the binary or all of them are running the old version of the binary. AIStor ensures that you will not encounter the potentially devastating situation where some nodes are running the old version and some nodes are running the new version.  

The best part is that the applications using AIStor are unaware of this upgrade process because the HTTP/API calls do not even know that the server is restarting. We’re not talking about legacy applications or file shares that require nodes to inform other nodes and clients that they are restarting. AIStor maintains consistency by still taking the requests on the designated port, and, when the service restarts, it will route requests to the newer version that has come up.

You can use this method of upgrade and restart whether you run AIStor on physical servers in a datacenter or on pods running in Kubernetes. In the next part, we’ll show you a simple tutorial that teaches you how you can achieve these concepts in a real-world running infrastructure in real time.

Performing the Upgrades

Performing the actual upgrade itself is rather simple. We’ll show you two ways of Upgrading: with both a Binary as well as a Kubernetes Operator.

Binary

It is important to ensure all the nodes running in your cluster are AIStor deployed nodes and not other S3-compatible services.

To update AIStor on all the nodes run the following command

mc admin update ALIAS

This command updates all the servers in the AIStor deployments. Once updated it will also restart the AIStor service on all the AIStor deployed nodes simultaneously. AIStor operations are atomic and strictly consistent and as such the restart process is non-disruptive to applications.

This avoids “rolling” restart of AIStor nodes and conforms to the updates-and-restart ethos of the AIStor way.

Kubernetes Operator

If you are running the Kuberntes operator you can upgrade as follows.

Before starting the upgrade verify the status of the resources in the operator

kubectl get all -n minio-operator

Verify the version of the operator as well

kubectl get pod -l 'name=minio-operator' -n minio-operator -o json | jq '.items[0].spec.containers'

Once verified you can either use the krew plugin to upgrade or in this case we’ll show you how to do it manually.

Download the AIStor kubernetes plugin and replace it with the existing on in the system path

curl https://github.com/minio/operator/releases/download/v4.5.8/kubectl-minio_4.5.8_linux_amd64 -o kubectl-minio

chmod +x kubectl-minio

mv kubectl-minio /usr/local/bin/

Verify the plugin has been installed in the right path

kubectl minio version

The following command is the one that actually upgrades the operator

kubectl minio init

Run the jq command again from earlier to verify the version of the newly installed operator

kubectl get pod -l 'name=minio-operator' -n minio-operator -o json | jq '.items[0].spec.containers'

This can also be verified by logging into the Operator Console

kubectl minio proxy

Final Thoughts on Non-Disruptive Upgrades

The modern enterprise is always on and never down. Granted, in an interconnected world, the enterprise doesn’t have end to end control - just ask Amazon themselves. Having said that, self-inflicted downtime needs to be engineered out - and with AIStor’s suite of capabilities, that becomes possible. 

Don’t take our word for it, however, see for yourself. Create a AIStor cluster, then download AIStor (50 nodes might be a little high for a test :)) and execute an upgrade - you will see for yourself the power of parallelism, resilience and a binary that is small enough to reload in a second but powerful enough to power some of the biggest enterprises on the planet. 

Failure to achieve a non-disruptive upgrade immediately results in the accumulation of technical debt. This has a rippling effect that adds additional tech debt to your entire engineering and DevOps teams. After maintenance, bringing up these systems is a whole ‘nother ball game. More often than not they have to be brought up in proper sequence and require extensive testing to make sure they operate the same way as they were before taking them offline.

You can be confident that each AIStor release operates flawlessly in your environment whether it is Development, QA, Staging or Production. We do everything we can to make sure that each new release is an improvement, and we give you the tools to quickly verify that in your environment.

In a future blog post we’ll go into more details on our recommendations for how to set up these different environments. For instance, if you don’t have enough nodes for the different environments, then you can run multiple AIStor binaries on the same nodes but on different ports; this way you can repurpose data for production and test it against the staging environment port to ensure things are working as expected before upgrading the binary running on the production port. If you have multiple AIStor tenants, then you could first upgrade the free tier first, and as you build confidence you can roll it out to other regions one at a time.

If you have any questions feel free to reach out to us on Slack!