Machine Learning (ML) initiatives can push compute and storage infrastructures to their limits. Many DataOps teams rely on a Kubernetes-based hybrid cloud architecture to satisfy compute and object storage requirements for scalability, efficiency, reliability, multi-tenancy, and support for RESTful APIs. DataOps teams have standardized on tools that rely on high-performance S3 API-compatible object storage for their pipelines, training and inference needs.
Kubeflow is the standard machine learning toolkit for Kubernetes and it requires S3 API compatibility. Kubeflow is widely used throughout the data science community, but the requirement for S3 API compatible object storage limits deployment options. How would you run Kubeflow on Azure or GCP when they lack S3 API support for their object storage offerings?
MinIO Kubernetes-native object storage is S3 API compatible so you can run your preferred data science tools on any managed Kubernetes service (Azure Kubernetes Service , Google Kubernetes Engine, Amazon Kubernetes Service) and on any Kubernetes distribution (VMware Tanzu, Red Hat OpenShift, even Minikube.
In this post we are going to set up a Kubeflow cluster using Azure Kubernetes Service (AKS) using MinIO as the underlying storage for the whole setup and to test it End to End we are going to deploy a pipeline that access its data on MinIO and stores the resulting model there as well. The problem we are going to use is the traditional MNIST challenge, which consists of an Optical Character Recognition (OCR) problem.
Setting up the Kubernetes Cluster
Let's start by setting up the AKS cluster called KubeFlowMinIO with four nodes within a resource group called MinIOKubeFlow.
This process will take a few minutes, and after that you'll have a working Kubernetes cluster ready to go. You just need to configure your local kubectl with the access for this cluster.
Setting up MinIO
The next step is to set up the MinIO Operator to manage our Object Storage on Azure. We've simplified the management of MinIO on Kubernetes plenty, so there are multiple ways to install the MinIO operator and you can choose the one that best matches your workflow. For this post we'll use MinIO's krew plugin to set up the MinIO Operator and our object storage.
Download the MinIO Krew plugin.
Then initialize the operator.
Now, let's go into the MinIO Operator UI to create our first Tenant. Enter the following command to receive a locally accessible endpoint and a token to log in.
The expected output is:
Now let's go into http://localhost:9090 and log in using the suggested token.
After logging in we'll be greeted with an empty list of Tenants, let's create one by clicking on + Create Tenant on the top right.
In order to keep things simple during this setup, we are going to create a tenant called machine-learning-cluster on the default namespace of our cluster. Of course you can change this to any namespace that suits your needs. Then we will choose a storage class, and since we are aiming for a high-performance data repository we will use Azure's Managed Premium Storage to get the best performance for our Kubeflow pipelines. After completing these fields, select Advanced. Here is where you can configure advanced features such as Custom Docker Registries, Identity Providers, Encryption and Pod Placement. For now, we are going to click Next until we reach the Security step and turn off TLS so we can complete this guide without needing to setup a domain and an external TLS certificate.
Turn off TLS for this tenant.
Now, we will tell the MinIO Operator how big we want our tenant. I'm going to go for 4 nodes to match our current setup and 1 Terabyte of capacity, but you can adjust this to whatever fits your needs.
The last step is a review of what's going to happen. simply click Create and MinIO does the rest!
Write down the auto generated credentials to access your object storage, we will use these to access the underlying storage.
That's it! You've provisioned a high performance object storage and it took just a few minutes. After another few minutes you'll see the tenant Initialized and it's ready to go.
The Tenant details are where you can update your Object Storage and expand it. We can also see that there's a public IP for our object storage and for managing our object storage. We are not going to use that in this guide, but that's what you could use to start consuming the object storage from outside this cluster.
We are ready to go on the object storage front - we've setup a high performance cluster and now we need to leverage it within our Kubeflow pipelines.
Setting up Kubeflow
To set up Kubeflow on AKS we are going to use the command line utility kfctl which can be downloaded from the kfctl release page. There are binaries for Mac and Linux, but if you are on Windows, you'll have to compile that binary from the source. Just make sure the kfctl binary is in your PATH.
Run the following commands, taken from the Kubeflow on Azure installation documentation.
This process will take about eight minutes as configured, so grab a cup of coffee and monitor the completion with the following command.
Once all pods are running, we are ready to move forward with building a Kubeflow pipeline that leverages MinIO.
Open the Kubeflow dashboard by running the following port-foward command and going to http://localhost:8080.
Then complete the Kubeflow setup by creating a machine-learning namespace.
The Kubeflow dashboard opens after we configure a namespace.
Let's set up a Jupyter notebook server and configure it from there. Using the Tensorflow 1.15 image, create a notebook called setup-pipeline.
Once the server is ready, connect to it, and then create a Python 3 notebook called Setup Pipeline.
The final step is to configure your Docker account. Kubeflow will push to Docker every new model you build throughout your pipeline and you may hit the 100 request per hour limit pretty quickly. When you use a Docker account, the limit is raised to 200 requests per hour.
Running a Kubeflow Pipeline
Now back to our Notebook. From here on, we will follow the excellent example for vanilla kubernetes that the Kubeflow team provides. We’ll learn how to submit models to Kubeflow for distributed training, as well as how to deploy and serve them.
You are going to need a few files for this notebook to work, mainly model.py, k8s_util.py, notebook_setup.py, requirements.txt and Dockerfile.model to build your model, submit it to Kubeflow and then deploy it. Let's start with the following snippet to download those files into our notebook.
Now, let's prepare the namespace and configure our MinIO credentials. For our endpoint we are going to use the internal Kubernetes service name minio.default.svc.cluster.local and for the DOCKER_REGISTRY we will enter our Docker username.
Next we’ll prepare the local notebook by installing dependencies and downloading the required data. All of this can be done in a single block but I used the same separate blocks as the example notebook to make it easier for you to follow.
At this point, you can go to your personal Docker registry and confirm a new Docker image for the MNIST model was created.
The next step is to create a MinIO Bucket.
Next we simply build a TFJob and Deployments to train our model, inspect it using TensorBoard and finally to serve it, with all the intermediate steps stored on your MinIO Tenant.
Let's start walking through the blocks, keeping in mind that these are reproduced verbatim from the kubeflow vanilla kubernete example.
Next we submit the job via the Kubernetes Python SDK.
Then we get the logs of the job.
We’re ready to check the model on MinIO. We can do this via our notebook or through the MinIO Console. First, I’m showing how to do this through a notebook.
Now I’ll show how to do this using the Operator Console. Go into the Tenant Details In the Operator GUI and click on the console URL.
From here, log in to the MinIO Console, go into the Object Browser and explore the miniodev-mnist bucket where we can see the checkpoints and the model itself.
Let's explore how the training went. Using TensorBoard, we will create a deployment.
Now let's explore the TensorBoard by visiting http://localhost:8080/mnist/machine-learning/tensorboard/
As you can see, the training was short and uneventful, but you learned how to read training data straight from MinIO.
Lastly, let's deploy this model and play with it a little.
That's it, but the model is not yet being served. Let's deploy a sample UI so we can poke at it.
Now visit http://localhost:8080/mnist/machine-learning/ui/?ns=machine-learning and you’ll see the nice UI that interacts with the model that is being served straight from MinIO.
MinIO Enables Data Science and DataOps Everywhere
Alright! We reached the end of this large guide that explains how to set up MinIO on Azure Kubernetes Service and then deploy Kubeflow to work with MinIO out of the box. The easiest parts were setting up the building blocks of AKS, MinIO and Kubeflow thanks to their high degree of automation. This frees you to focus on more important tasks such as building your machine learning pipelines to run smoothly on Kubeflow, leveraging large datasets straight from MinIO and storing and deploying the models straight from the object storage as well.