AI/ML’s Sous-Chef: Why your Second Hire should be a DevOps Engineer

Hire the Right Expertise, in the Right Order. That is how we left off our previous iteration of this blog on Hiring for AI Success: Why Your First Hire Should Be a Data Engineer, which is an excellent read by the way you should give a read before this one.

AI/ML engineers would like to focus on building and fine-tuning models without needing to be bogged down with things like Infrastructure as Code or Monitoring Infrastructure or Development Environments or anything DevOps in general. It makes sense, DevOps is a mundane yet important role, I’d rather find a cure for cancer using AI technology. But in your quest to fix the world, you need a solid foundation to build off of. You need a “right hand person” of sorts in your endeavor, this is where a DevOps engineer should be your second hire.

Your DevOps Engineer’s customer should be your AI/ML Engineering Team. The DevOps Engineer is there to ease the friction points in infrastructure so AI/ML folks can focus on the task at hand. Any issues that come with the infrastructure should be the responsibility of the DevOps Engineer to ensure the infrastructure is running in ship shape at all times, proactively, even before the AI/ML team complains the infrastructure is too slow. The DevOps Engineer should be able to predict these several weeks/months in advance to implement the appropriate solution so the AI/ML team can make their advancements.

In this post we’ll show some of the reasons why a DevOps Engineer should be your second engineer right after your first main engineer hire.

Monitoring

When you have your application and ETL pipelines running based on a model, it's paramount you track various aspects of the application. Specifically

The Run time of the jobs
Code regressions
Application Logs
Health Endpoint of services
Monitor load for scaling purposes

among various others metrics. The reason being as you deploy new services and features, there is a chance of existing codebase regressing due to various unknown reasons. It's for this reason we need to ensure all aspects of the apps are monitored.

There are several tools out there but one of the basic things are CPU, Memory and Disk in addition to Application Performance Monitoring (APM). This type of monitoring is more granular and you can tell exactly in which portion of your code you are experiencing the issue.

Scaling and Disaster Recovery

Something that works on a single node for a single user might not work on thousands of nodes with thousands of users. When designing an application’s scalability you need to take into account some of the limitations of infrastructure, for example infrastructure takes time to scale up and down, it's not instantaneous, so how do you manage the load? You need to monitor the load, as mentioned above, and then scale beforehand before you run into bottlenecks. Often an application can be scaled fairly well vertically by beefing up a single instance. Eventually though that single instance will reach the limitations of the node's resources and you would scale horizontally by adding multiple nodes. This changes the dynamics of how the application is accessed, for example, how are the sessions stored? What about the Backends such as AIStor and Databases, do we need to scale those as well?

By having monitoring in place we can get a baseline of how resources are being used, did all the space get taken up in a week or over a few months? This understanding will help us build infrastructure that can be scalable based on needs without wasting too many resources that are not in use.

CI/CD Pipelines

Once infrastructure is set up, if it needs to be updated or changed, it needs to be tested first. For a DevOps engineer even a development infrastructure environment is considered production. Reason for this being if the dev infra is down for any reason, the AI/ML engineers cannot proceed to the Production deployment stage without testing and building confidence. It's all about building confidence through a series of steps we perform in the Development environment using a series of steps of a CI/CD Pipeline.This way when we go to Production with the same code base we know exactly how it will work as expected. But this entire pipeline cannot be set up manually, you need to automate the testing and steps so that each time a new code base is committed the entire application is tested using the provided infrastructure.

We can also set it up in such a way that the worker nodes that run the CI/CD jobs can scale as required based on when the load is the most. If we only have a few jobs or deployments running we can simply run it on a few static nodes, but as we scale out CI/CD pipeline infrastructure it doesn’t make sense to run the nodes all the times 24/7, during off-peak hours they can be terminated/shutdown so only the basic what is needed to run a few jobs are running.

Development Environments

Environments need to mimic each other as much as possible to get the best results. Meaning before deploying to Production you need to test the code in Development by deploying it on multiple nodes, a couple at a time in canary mode, by keeping an eye on monitoring systems, to ensure deployment is going as smoothly as possible. As soon as the graphs show something out of the ordinary we stop, assess and then rollback to the previously good version.

In addition to these environments to deploy production-ready code, you need instances where developers can test their code with things that take up a lot of CPU. Having run these processes alongside your dev codebase is not ideal. You will see results that are abnormal to what you would expect because additional resources are taken up by users who are testing their own code in addition to the dev code base. The proper way to do this is to have different environments with dedicated resources which developers can use to test their own code base without causing any “noisy neighbor” type issues. This can even be something as simple as running Vagrant/Virtualbox with MinIO locally on your laptop and scale to On-Prem or EC2 from there.

Infrastructure as Code

No matter if you have running a few servers, the automation code to launch the infrastructure should be codified and versioned. Gone are the days we manually set things up because not only is that cumbersome but it's not repeatable. We want to be able to write and test the infrastructure to build confidence in deploying to different environments.

Moreover, it's not possible for an AI/ML engineer or even a new DevOps member to the team, to understand the state of the infrastructure by looking at a UI/CLI console. These modes are only helpful to a certain degree after which to know more details we have to SSH into the node to figure out the specifics, for example. By having all parts of the infrastructure codified we can easily onboard new team members and in general understand over time the changes made to the infrastructure. 6 months from now you are not going to remember the settings you used to set up a particular piece of infrastructure.

The right order of expertise

A Data Engineer should be your first hire, no doubt about it. But if you want to keep the data engineer around and focus on models then your second hire should be a DevOps Engineer who can take care of everything that comes along with running a first-class infrastructure.

Because as you saw above, there is a long way to go when it comes to managing, maintaining and scaling the infrastructure. The above are just a sample of the many other things in infrastructure that you need to handle such as rotating logs, updating OS versions, making sure packages are compatible during update, and we haven’t even talked about Networking and Airgap which is crucial when it comes to keeping your data safe.

If you have any other questions and would like to talk further be sure to reach out to us at hello@min.io.