Setting up a Development Machine with MLFlow and MinIO
About MLflow
MLflow is an open-source platform designed to manage the complete machine learning lifecycle. Databricks created it as an internal project to address challenges faced in their own machine learning development and deployment processes. MLflow was later released as an open-source project in June 2018.
As a tool for managing the complete lifecycle, MLflow contains the following components.
- MLflow Tracking - An engineer will use this feature the most. It allows experiments to be recorded and queried. It also keeps track of the code, data, configuration and results for each experiment.
- MLflow Projects - Allows experiments to be reproduced by packaging the code into a platform agnostic format.
- MLflow Models - Deploys machine learning models to an environment where they can be served.
- MLflow Repositories - Allows for the storage, annotation, discovery, and management of models in a central repository.
It is possible to install all these capabilities on a development machine so that engineers can experiment to their heart's content without worrying about messing up a production installation.
All the files used to install and setup MLflow can be found in our Github repository.
Installation Options
The MLFlow documentation lists no less than 6 options for installing MLFlow. This may seem like overkill, but these options accommodate different preferences for a database and varying levels of network complexity.
The option that is best suited for an organization that has multiple teams using large datasets and building models that themselves may get quite large is shown below. This option requires the setup of three servers - a Tracking server, a PostgreSQL database, and an S3 Object Store - our implementation will use MinIO.
The Tracking Server is a single entry point from an engineer’s development machine for accessing MLflow functionality. (Don’t be fooled by its name - it contains all the components listed above - Tracking, Model, Projects and Repositories.) The Tracking server uses PostgreSQL to store entities. Entities are runs, parameters, metrics, tags, notes, and metadata. (More on runs later.) The Tracking server in our implementation accesses MinIO to store artifacts. Examples of artifacts are models, datasets and configuration files.
What is nice about the modern tooling available to engineers these days is that you can emulate a production environment - including tooling choice and network connectivity - using containers. That is what I will show in this post. I will show how to use Docker Compose to install the servers described above as services running in a container. Additionally, the configuration of MLflow is set up such that you can use an existing instance of MinIO if you wish. In this post, I will show how to deploy a brand new instance of MinIO, but the files in our Github repository have a `docker-compose` file that shows how to connect to an existing instance of MinIO.
What We Will Install
Below is a list of everything that needs to be installed. This list includes servers that will become services in containers (MinIO, Postgres, and MLFlow), as well as the SDKs you will need (MinIO and MLflow).
- Docker Desktop
- MLFlow Tracking Server via Docker Compose
- MinIO Server via Docker Compose
- PostgresSQL via Docker Compose
- MLFlow SDK via pip install
- MinIO SDK via pip install
Let’s start with Docker Desktop, which will serve as the host for our Docker Compose services.
Docker Desktop
You can find the appropriate installation for your operating system on Docker’s site. If you are installing Docker Desktop on a Mac, then you need to know the chip your Mac is using — Apple or Intel. You can determine this by clicking the Apple icon in the upper left corner of your Mac and clicking the “About This Mac” menu option.
We are now ready to install our services
MLFlow Server, Postgres and MinIO
The MLFLow Tracking Server, PostgresSQL and MinIO will be installed as services using the Docker Compose file shown below.
There are a few things worth noting that will help you troubleshoot problems should something go wrong. First, both MinIO and PostgreSQL are using the local file system to store data. For PostgreSQL, this is the `db_data` folder and for MinIO, it is the `minio_data` folder. If you ever want to start over with a clean installation, then delete these folders.
Next, this Docker Compose file is configuration driven. For example, instead of hard coding the PostgreSQL database name to `mlflow`, the name comes from the `config.env` file shown below using the following syntax in the Docker Compose file - `${PG_DATABASE}`.
All environment variables that these services need are set up in this file. This configuration file also contains information needed for these services to talk to each other. Notice that it is the use of environment variables that informs the MLFlow Tracking server how to access MinIO. In other words, the URL (including port number), the access key, the secret access key, and the bucket. This leads me to my final and most important point about using Docker Compose - the first time you bring up these services, the MLflow Tracking services will not work because you will need to first go into the MinIO UI, get your keys and create the bucket that the tracking service will use to store artifacts.
Let’s do this now. Start the services for the first time using the command below.
Make sure you run this command in the same directory where your Docker Compose file is located.
Now we can get our keys and create our bucket using the MinIO UI.
The MinIO Console
From your browser, go to `localhost:9001`. If you specified a different port for the MinIO console address in the `docker-compose` file, use that port instead. Sign in using the root user and password specified in the `config.env` file.
Once you sign in, navigate to the Access Keys tab and click the Create access key button.
This will take you to the Create Access Key page.
Your access key and secret key are not saved until you click the Create button. Do not navigate away from this page until this is done. Don’t worry about copying the keys from this screen. Once you click the Create button, you will be given the option to download the keys to your file system (in a JSON file). If you want to use the MinIO SDK to manage raw data then create another access key and secret key while you are on this page.
Next, create a bucket named `mlflow`. This is straightforward, go into the Buckets tab and click the `Create Bucket` button.
Once you have your keys and you have created your bucket, you can finish setting up the services by stopping the containers, updating the `config.env`, and then restarting the containers. The command below will stop and remove your containers.
To restart:
Let’s start the MLflow UI next and make sure everything is working.
Starting the MLflow UI
Navigate to `localhost:5000`. You should see the MLflow UI.
Take some time to explore all the tabs. If you are new to MLflow, then get familiar with the concept of Runs and Experiments.
- A Run is a pass through your code that usually results in a trained model.
- Experiments are a way to tag related runs so that you can see them grouped together in the MLflow UI. For example, you may have trained several models using different parameters in an attempt to achieve the best accuracy (or performance). Tagging these runs with the same experiment name will group them together in the Experiments tab.
Install the MLflow Python Package.
The MLflow Python package is a simple `pip` install. I recommend installing it in a Python virtual environment.
Double check the installation by listing the MLflow library.
You should see the library below.
Install the MinIO Python Package
You do not need to access MinIO directly to take advantage of MLflow features - the MLflow SDK will interface with the instance of MinIO we set up above. However, you may want to interface with this instance of MinIO directly to manage data before it is given to MLflow. MinIO is a great way to store all sorts of unstructured data. MinIO makes full use of underlying hardware, so you can save all the raw data you need without worrying about scale or performance. MinIO includes bucket replication features to keep data in multiple locations synchronized. Also, someday AI will be regulated; when this day comes, you will need MinIO’s enterprise features (object locking, versioning, encryption, and legal locks) to secure your data at rest and to make sure you do not accidentally delete something that a regulatory agency may request.
If you installed the MLflow Python package in a virtual environment, install MinIO in the same environment.
Double check the installation.
This will confirm that the Minio library was installed and display the version you are using.
Summary
This post provided an easy to follow recipe for setting up MLflow and MinIO on a development machine. The goal was to save you the time and effort of researching the MLflow servers and docker compose configuration.
Every effort was made to keep this recipe accurate. If you run into a problem, please let us know by dropping us a line at hello@min.io or joining the discussion on our general Slack channel.
You are ready to start coding and training models with MLflow and MinIO.