Setting up a Development Machine with Kubeflow Pipelines 2.0 and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML 2 June 2023

Engineers like to play and learn locally. It does not matter which tool is under investigation: a high-end storage solution, a workflow orchestration engine, or the latest thing in distributed computing. The best way to learn a new technology is to find a way to cram it all on a single machine so that you can put your hands on everything.

Kubeflow Pipelines is a core component of the full distribution of Kubeflow. You can install the full distribution of Kubeflow or the standalone installation containing just Kubeflow Pipelines. In this post, I’ll show how to set up a development machine with the standalone installation of Kubeflow Pipelines (KFP) and a standalone installation of MinIO. KFP and MinIO are better together. Using KFP, you can build and run pipelines for acquiring data and training models. As you build pipelines for your data and train models, you will need a storage solution. This is where MinIO can help.

MinIO is a great way to store your ML data and models. Using MinIO, you can save training sets, validation sets, test sets, and models without worrying about scale or performance. Also, someday AI will be regulated; when this day comes, you will need MinIO’s enterprise features (object locking, versioning, encryption, and legal locks) to secure your data at rest and to make sure you do not accidentally delete something that a regulatory agency may request. We could have tried to use KFP’s instance of MinIO — however, this is not the best design for an ML Data Pipeline. You will want a storage solution that is totally under your control. Below is a diagram of our Kubeflow and MinIO deployments that illustrates the purpose of each MinIO instance.

What We Will Install

Below is a list of everything that needs to be installed. This list includes core components (MinIO and KFP), as well as dependencies and SDKs. It is my hope that this post serves as a recipe that can be followed exactly to configure a KFP Pipeline development machine. If any of these instructions do not work, then please let us know.

Docker Desktop
Kubernetes
kubectl (the Kubernetes command line tool)
Kubeflow Pipeline Resources
KFP SDK
MinIO
MinIO Access Key and Secret Key
MinIO SDK

Docker Desktop

You can find the appropriate installation for your operating system on Docker’s site, located here. If you are installing Docker Desktop on a Mac, then you need to know the chip your Mac is using — Apple or Intel. You can determine this by clicking the Apple icon in the upper left corner of your Mac and clicking the “About This Mac” menu option.

Kubernetes

Kubeflow runs on Kubernetes – consequently, you will need a running Kubernetes cluster. Also, you must be familiar with the Kubernetes command line tool to install and manage Kubeflow. The fastest way to get both Kubernetes and its command line tool is to enable the Kubernetes capabilities that come with Docker Desktop. To do this, start the Docker Desktop application, and in the upper right corner, click on the “Settings” icon.

This will take you to the Docker Desktop’s settings page, as shown below.

Click on the Kubernetes tab on the left, and you should see the Kubernetes setup page.

Click the Enable Kubernetes check box to start a Kubernetes cluster on your machine. Once you click this check box, it will take a few minutes for Docker Desktop to get a cluster ready for you - so go and get a cup of coffee if you wish.

If at any point you wish to remove all the deployments you have installed in your Kubernetes cluster, then click the “Reset Kubernetes Cluster” button. This will remove all resources and give you a brand-new cluster. You will do this often when you are experimenting with prerelease software.

kubectl

Enabling Kubernetes also installs the Kubernetes command line tool (`kubectl`) for you. Type the following command in a terminal window to make sure `kubectl` is working.

kubectl version --short

You should see output similar to what is shown below.

Client Version: v1.25.9
Kustomize Version: v4.5.7
Server Version: v1.25.9

Once Kubernetes is installed and the `kubectl` command line tool works, you can install Kubeflow Pipelines.

Kubeflow Pipelines

Setting up Kubeflow Pipelines is four simple steps. First, we need to specify the version of KFP we would like to install. We will set the environment variable below, which will be used by subsequent `kubectl` apply commands. These instructions are for Kubeflow Pipelines 2.0.0. You can check the latest version here.

export PIPELINE_VERSION=2.0.0

KFP prefers cluster-scoped resources to be installed separately from namespace-scoped resources. Depending on the environment, cluster-scoped resources may need the admin role. Namespace-scoped resources can be deployed by individual teams managing a namespace. The command below installs the cluster-scoped resources.

kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"

You should see output indicating the creation of various resources. It is omitted here for brevity, but you should scan it and ensure no errors occurred.

The next command is a wait command that will check the status of the previous command.

kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io

Keep running the wait command until you get a message indicating success, as shown below.

customresourcedefinition.apiextensions.k8s.io/applications.app.k8s.io condition met

The apply command for namespace-scoped resources is below. It will also show output as resources are created in your cluster. Make sure there are no errors.

kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

Check all the pods that our two `kubectl` apply commands created by running the command below.

kubectl get pods --all-namespaces

NAMESPACE NAME READY STATUS RESTARTS AGE

kube-system coredns-565d847f94-df8xr 1/1 Running 0 24h
kube-system coredns-565d847f94-zrvmm 1/1 Running 0 24h
kube-system etcd-docker-desktop 1/1 Running 1 24h
kube-system kube-apiserver-docker-desktop 1/1 Running 1 24h
kube-system kube-controller-manager-docker-desktop 1/1 Running 1 24h
kube-system kube-proxy-c6bmz 1/1 Running 0 24h
kube-system kube-scheduler-docker-desktop 1/1 Running 1 24h
kube-system storage-provisioner 1/1 Running 0 24h
kube-system vpnkit-controller 1/1 Running 36 (12m ago) 24h
kubeflow cache-deployer-deployment-8667bd7cc4-pq5nx 1/1 Running 0 24h
kubeflow cache-server-69558cdf5b-mjbd9 1/1 Running 0 24h
kubeflow controller-manager-86bf69dc54-j6qzw 0/1 Running 1 24h
kubeflow metadata-envoy-deployment-596cbdf475-z224k 1/1 Running 0 24h
kubeflow metadata-grpc-deployment-784b8b5fb4-5bhcl 1/1 Running 1 (23h ago) 24h
kubeflow metadata-writer-84474967b6-67c2k 1/1 Running 0 24h
kubeflow minio-65dff76b66-n9l54 1/1 Running 0 24h
kubeflow ml-pipeline-7d8868b6b5-jqt2s 1/1 Running 0 24h
kubeflow ml-pipeline-persistenceagent-55f55fc7bc-5j2sd 1/1 Running 0 24h
kubeflow ml-pipeline-scheduledworkflow-7469b7c4b7-4ccdl 1/1 Running 0 24h
kubeflow ml-pipeline-ui-74cc4f9f89-zpb22 1/1 Running 0 24h
kubeflow ml-pipeline-viewer-crd-57fc94f5fd-b94nn 1/1 Running 0 24h
kubeflow ml-pipeline-visualizationserver-7587fb49f8-tl6ht 1/1 Running 0 24h
kubeflow mysql-c999c6c8-vvb7j 1/1 Running 0 24h
kubeflow proxy-agent-5c9b879c-7tg7z 0/1 Running 3 24h
kubeflow workflow-controller-6c85bc4f95-f9czz 1/1 Running 0 24h

If you check the pods right after installing KFP, you will notice that many are still starting. Wait until all pods are running before moving on to the next section. Once you start creating pipelines, run this `get pods` command while a pipeline runs. You will see how KFP creates pods based on the tasks in your pipeline.

Starting the KFP UI

To use the KFP UI on a local machine, we must forward an unused port on our local machine to port 80 of KFP’s UI service. This is done using kubectl’s port-forward command. This command will not return. You need to keep it running until you are done using the KFP UI.

kubectl port-forward svc/ml-pipeline-ui -n kubeflow 8080:80

Navigate to localhost:8080. You should see the Kubeflow Pipelines home page.

Take some time to explore all the tabs. If you are new to Kubeflow, then get familiar with Pipelines, Runs, and Experiments. A detailed description of these three concepts is beyond the scope of this post but here is the short story:

Pipelines are the descriptions you create in code. Pipelines are analogous to classes in object oriented programming.
A Run is an instance of a pipeline much like an object is an instance of a class.
Experiments are a way to tag related runs so that you can see them grouped together in the KFP UI. For example, you may have multiple runs of a pipeline as you iron out the kinks. Tagging these runs with the same experiment name will group them together in the Experiments tab.

Install MinIO

I like to use Docker Compose to install MinIO as the configuration is in a YAML file, and the command is simple. Below is the Docker Compose YAML. Name this file `docker-compose.yml`.

version: '3'
services:
minio:
image: quay.io/minio/minio
volumes:
- ./data:/data
ports:
- 9000:9000
- 9001:9001
environment:
MINIO_ROOT_USER: 'minio_user'
MINIO_ROOT_PASSWORD: 'minio_password'
MINIO_ADDRESS: ':9000'
MINIO_STORAGE_USE_HTTPS: False
MINIO_CONSOLE_ADDRESS: ':9001'
command: minio server /data

Run the following command in the same directory as the `docker-compose.yml` file.

docker-compose up -d

This installs MinIO in a Docker container outside of the Kubernetes cluster. If you do not want to use Docker Compose to install MinIO, then this document will show you how to install MinIO using the Docker command line.

MinIO Access Key and Secret Key

To use the MinIO SDK, you will need a new access key and secret key. You can get these keys in the MinIO UI. From your browser, go to localhost:9001. If you specified a different port for the MinIO console address in the docker-compose file, use that port instead.

Once you sign in navigate to the Access Keys tab and click the Create access key button.

This will take you to the Create Access Key page.

Your access key and secret key are not saved until you click the Create button. So do not navigate away from this page until this is done. Don’t worry about copying the keys from this screen. Once you click the Create button, you will be given an option to download the keys to your file system (in a JSON file).

You are now ready to start using the MinIO SDK. In the next two sections we will install both the KFP SDK and the MinIO SDK.

Install the KFP Python package

The KFP Python package is a simple `pip` install. I recommend installing it in a Python virtual environment - especially when you are testing prerelease versions of KFP. You can check PyPi for the latest version of the KFP package or you can install the latest prerelease version as shown in the command below.

pip install kfp

Remove the `--pre` switch once KFP 2.0 is generally available.

Double check the installation by listing out the KFP libraries.

pip list | grep kfp

You should see the three libraries below.

kfp 2.0.1
kfp-pipeline-spec 0.2.2
kfp-server-api 2.0.0

Install the MinIO Python Package

If you installed the KFP Python package in a virtual environment, install MinIO in the same environment.

pip install minio

Double check the installation.

pip list | grep minio

This will confirm that the Minio library was installed and display the version you are using.

minio 7.1.15

Summary

This post provided an easy to follow recipe for creating a development machine with Kubeflow Pipelines 2.0 and MinIO. The goal was to save you the time and effort of researching Kubeflow dependencies, installation commands, and SDK setup for the new version.

Every effort was made to keep this recipe accurate. If you run into a problem, please let us know by dropping us a line at hello@min.io or joining the discussion on our general Slack channel.

You are ready to start coding and building pipelines with Kubeflow and MinIO. Check out Building an ML Data Pipeline with MinIO and Kubeflow v2.0 where we use Kubeflow and MinIO to build a data pipeline.