Setting up a Development Machine with MLFlow and MinIO

Keith Pijanowski Keith Pijanowski on AI/ML 21 July 2023

About MLflow

MLflow is an open-source platform designed to manage the complete machine learning lifecycle. Databricks created it as an internal project to address challenges faced in their own machine learning development and deployment processes. MLflow was later released as an open-source project in June 2018.

As a tool for managing the complete lifecycle, MLflow contains the following components.

MLflow Tracking - An engineer will use this feature the most. It allows experiments to be recorded and queried. It also keeps track of the code, data, configuration and results for each experiment.
MLflow Projects - Allows experiments to be reproduced by packaging the code into a platform agnostic format.
MLflow Models - Deploys machine learning models to an environment where they can be served.
MLflow Repositories - Allows for the storage, annotation, discovery, and management of models in a central repository.

It is possible to install all these capabilities on a development machine so that engineers can experiment to their heart's content without worrying about messing up a production installation.

All the files used to install and setup MLflow can be found in our Github repository.

Installation Options

The MLFlow documentation lists no less than 6 options for installing MLFlow. This may seem like overkill, but these options accommodate different preferences for a database and varying levels of network complexity.

The option that is best suited for an organization that has multiple teams using large datasets and building models that themselves may get quite large is shown below. This option requires the setup of three servers - a Tracking server, a PostgreSQL database, and an S3 Object Store - our implementation will use MinIO.

The Tracking Server is a single entry point from an engineer’s development machine for accessing MLflow functionality. (Don’t be fooled by its name - it contains all the components listed above - Tracking, Model, Projects and Repositories.) The Tracking server uses PostgreSQL to store entities. Entities are runs, parameters, metrics, tags, notes, and metadata. (More on runs later.) The Tracking server in our implementation accesses MinIO to store artifacts. Examples of artifacts are models, datasets and configuration files.

What is nice about the modern tooling available to engineers these days is that you can emulate a production environment - including tooling choice and network connectivity - using containers. That is what I will show in this post. I will show how to use Docker Compose to install the servers described above as services running in a container. Additionally, the configuration of MLflow is set up such that you can use an existing instance of MinIO if you wish. In this post, I will show how to deploy a brand new instance of MinIO, but the files in our Github repository have a `docker-compose` file that shows how to connect to an existing instance of MinIO.

What We Will Install

Below is a list of everything that needs to be installed. This list includes servers that will become services in containers (MinIO, Postgres, and MLFlow), as well as the SDKs you will need (MinIO and MLflow).

Docker Desktop
MLFlow Tracking Server via Docker Compose
MinIO Server via Docker Compose
PostgresSQL via Docker Compose
MLFlow SDK via pip install
MinIO SDK via pip install

Let’s start with Docker Desktop, which will serve as the host for our Docker Compose services.

Docker Desktop

You can find the appropriate installation for your operating system on Docker’s site. If you are installing Docker Desktop on a Mac, then you need to know the chip your Mac is using — Apple or Intel. You can determine this by clicking the Apple icon in the upper left corner of your Mac and clicking the “About This Mac” menu option.

We are now ready to install our services

MLFlow Server, Postgres and MinIO

The MLFLow Tracking Server, PostgresSQL and MinIO will be installed as services using the Docker Compose file shown below.

version: "3.3"

services:
db:
restart: always
image: postgres
container_name: mlflow_db
expose:
- "${PG_PORT}"
networks:
- backend
environment:
- POSTGRES_USER=${PG_USER}
- POSTGRES_PASSWORD=${PG_PASSWORD}
- POSTGRES_DATABASE=${PG_DATABASE}
volumes:
- ./db_data:/var/lib/postgresql/data/
healthcheck:
test: ["CMD", "pg_isready", "-p", "${PG_PORT}", "-U", "${PG_USER}"]
interval: 5s
timeout: 5s
retries: 3
s3:
restart: always
image: minio/minio
container_name: mlflow_minio
volumes:
- ./minio_data:/data
ports:
- "${MINIO_PORT}:9000"
- "${MINIO_CONSOLE_PORT}:9001"
networks:
- frontend
- backend
environment:
- MINIO_ROOT_USER=${MINIO_ROOT_USER}
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD}
- MINIO_ADDRESS=${MINIO_ADDRESS}
- MINIO_PORT=${MINIO_PORT}
- MINIO_STORAGE_USE_HTTPS=${MINIO_STORAGE_USE_HTTPS}
- MINIO_CONSOLE_ADDRESS=${MINIO_CONSOLE_ADDRESS}
command: server /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3

tracking_server:
restart: always
build: ./mlflow
image: mlflow_server
container_name: mlflow_server
depends_on:
- db
ports:
- "${MLFLOW_PORT}:5000"
networks:
- frontend
- backend
environment:
- AWS_ACCESS_KEY_ID=${MINIO_ACCESS_KEY}
- AWS_SECRET_ACCESS_KEY=${MINIO_SECRET_ACCESS_KEY}
- MLFLOW_S3_ENDPOINT_URL=http://s3:${MINIO_PORT}
- MLFLOW_S3_IGNORE_TLS=true
command: >
mlflow server
--backend-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:${PG_PORT}/${PG_DATABASE}
--host 0.0.0.0
--serve-artifacts
--artifacts-destination s3://${MLFLOW_BUCKET_NAME}
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:${MLFLOW_PORT}/"]
interval: 30s
timeout: 10s
retries: 3

volumes:
db_data:
minio_data:

networks:
frontend:
driver: bridge
backend:
driver: bridge

There are a few things worth noting that will help you troubleshoot problems should something go wrong. First, both MinIO and PostgreSQL are using the local file system to store data. For PostgreSQL, this is the `db_data` folder and for MinIO, it is the `minio_data` folder. If you ever want to start over with a clean installation, then delete these folders.

Next, this Docker Compose file is configuration driven. For example, instead of hard coding the PostgreSQL database name to `mlflow`, the name comes from the `config.env` file shown below using the following syntax in the Docker Compose file - `${PG_DATABASE}`.

# Postgres configuration
PG_USER=mlflow
PG_PASSWORD=mlflow
PG_DATABASE=mlflow
PG_PORT=5432

# MLflow configuration
MLFLOW_PORT=5000
MLFLOW_BUCKET_NAME=mlflow

# MinIO access keys - these are needed by MLflow
MINIO_ACCESS_KEY=XeAMQQjZY2pTcXWfxh4H
MINIO_SECRET_ACCESS_KEY=wyJ30G38aC2UcyaFjVj2dmXs1bITYkJBcx0FtljZ

# MinIO configuration
MINIO_ROOT_USER: 'minio_user'
MINIO_ROOT_PASSWORD: 'minio_pwd'
MINIO_ADDRESS: ':9000'
MINIO_STORAGE_USE_HTTPS: False
MINIO_CONSOLE_ADDRESS: ':9001'
MINIO_PORT=9000
MINIO_CONSOLE_PORT=9001

All environment variables that these services need are set up in this file. This configuration file also contains information needed for these services to talk to each other. Notice that it is the use of environment variables that informs the MLFlow Tracking server how to access MinIO. In other words, the URL (including port number), the access key, the secret access key, and the bucket. This leads me to my final and most important point about using Docker Compose - the first time you bring up these services, the MLflow Tracking services will not work because you will need to first go into the MinIO UI, get your keys and create the bucket that the tracking service will use to store artifacts.

Let’s do this now. Start the services for the first time using the command below.

docker-compose --env-file config.env up -d --build

Make sure you run this command in the same directory where your Docker Compose file is located.

Now we can get our keys and create our bucket using the MinIO UI.

The MinIO Console

From your browser, go to `localhost:9001`. If you specified a different port for the MinIO console address in the `docker-compose` file, use that port instead. Sign in using the root user and password specified in the `config.env` file.

Once you sign in, navigate to the Access Keys tab and click the Create access key button.

This will take you to the Create Access Key page.

Your access key and secret key are not saved until you click the Create button. Do not navigate away from this page until this is done. Don’t worry about copying the keys from this screen. Once you click the Create button, you will be given the option to download the keys to your file system (in a JSON file). If you want to use the MinIO SDK to manage raw data then create another access key and secret key while you are on this page.

Next, create a bucket named `mlflow`. This is straightforward, go into the Buckets tab and click the `Create Bucket` button.

Once you have your keys and you have created your bucket, you can finish setting up the services by stopping the containers, updating the `config.env`, and then restarting the containers. The command below will stop and remove your containers.

docker-compose down

To restart:

docker-compose --env-file config.env up -d --build

Let’s start the MLflow UI next and make sure everything is working.

Starting the MLflow UI

Navigate to `localhost:5000`. You should see the MLflow UI.

Take some time to explore all the tabs. If you are new to MLflow, then get familiar with the concept of Runs and Experiments.

A Run is a pass through your code that usually results in a trained model.
Experiments are a way to tag related runs so that you can see them grouped together in the MLflow UI. For example, you may have trained several models using different parameters in an attempt to achieve the best accuracy (or performance). Tagging these runs with the same experiment name will group them together in the Experiments tab.

Install the MLflow Python Package.

The MLflow Python package is a simple `pip` install. I recommend installing it in a Python virtual environment.

pip install mlflow

Double check the installation by listing the MLflow library.

pip list | grep mlflow

You should see the library below.

mlflow 2.5.0

Install the MinIO Python Package

You do not need to access MinIO directly to take advantage of MLflow features - the MLflow SDK will interface with the instance of MinIO we set up above. However, you may want to interface with this instance of MinIO directly to manage data before it is given to MLflow. MinIO is a great way to store all sorts of unstructured data. MinIO makes full use of underlying hardware, so you can save all the raw data you need without worrying about scale or performance. MinIO includes bucket replication features to keep data in multiple locations synchronized. Also, someday AI will be regulated; when this day comes, you will need MinIO’s enterprise features (object locking, versioning, encryption, and legal locks) to secure your data at rest and to make sure you do not accidentally delete something that a regulatory agency may request.

If you installed the MLflow Python package in a virtual environment, install MinIO in the same environment.

pip install minio

Double check the installation.

pip list | grep minio

This will confirm that the Minio library was installed and display the version you are using.

minio 7.1.15

Summary

This post provided an easy to follow recipe for setting up MLflow and MinIO on a development machine. The goal was to save you the time and effort of researching the MLflow servers and docker compose configuration.

Every effort was made to keep this recipe accurate. If you run into a problem, please let us know by dropping us a line at hello@min.io or joining the discussion on our general Slack channel.

You are ready to start coding and training models with MLflow and MinIO.

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix