Over the past few months, I have written about a number of different technologies (Ray Data, Ray Train, and MLflow). I thought it would make sense to pull them all together and deliver an easy-to-understand recipe for distributed data preprocessing and distributed training using a production-ready MLOPs tool for tracking and model serving. This post integrates the code I presented in my Ray Train post that distributes training across a cluster of workers with a deployment of MLFlow that uses MinIO under the hood for artifact storage and model checkpoints. While my code trains a model on the MNIST dataset, the code is mostly boilerplate - replace the MNIST model with your model and replace the MNIST data access and preprocessing with your data access and preprocessing, and you are ready to start training your model. A fully functioning sample containing all the code presented in this post can be found here.
The diagram below is a visualization of how distributed training, distributed preprocessing and MLflow fit together. This is the diagram I presented in my Ray Train post with MLFlow added. It represents a really good start to building a foundation for all your AI initiatives: MinIO for high-speed object storage, Ray for distributed training and data processing, and MLFlow for MLOPs.
Let’s start by revisiting the setup code I introduced for Ray Train and add the MLFlow setup to it.
Setting Up MLFlow for Distributed Training
The code below is the setup for distributed training with the MLFlow setup code added. I have highlighted the additional code necessary for MLFlow. At the top of the function, MLFlow is configured and a run is started. I’ll explain the additions to the training configuration parameter in the next section. When a run is complete, you need to let MLFlow know - this is done at the bottom of the function. If you are new to MLFlow Tracking, then check out my post on MLFlow Tracking with MinIO. You may also want to check out Setting Up a Development Machine with MLFlow and MinIO if you want to install MLflow on your development machine.
The Problem with Tracking Distributed Experiments
The problem with using the MLFlow Python library with distributed training is that all of its functions use a run id that is maintained internally - the run id itself is not a parameter to functions like log_metric(), or log_metrics(). So, when the Ray Train workers start, they will not have the run ID that was created when the controlling processes started a run since they are in different processes. This problem is easy to fix. We can simply pass the run ID into the worker processes as part of the training configuration. However, that does not solve the problem with the MLFlow library. Fortunately, MLFlow has a REST API that accepts run ID as a parameter for all calls. It also requires the base URL for MLflow. Below is a function that wraps the MLFlow REST API for logging a metric. Check out the MLFlow REST API samples for functions that wrap other MLFlow features.
MLflow’s base URL and the run ID can be added to the training configuration variable using the snippet below. (The training configuration variable is a Python dictionary; it is the only parameter that can be passed to the worker functions.)
We now have a way to send MLFlow information to the distributed workers and we have a function that can make RESTful calls to MLFlow’s Tracking Server. The next step is to use the function above from within the distributed workers' training loop.
Adding Experiment Tracking to Ray Train Workers
Adding tracking to the function that will run within the processes of the remote worker requires minimal code. The complete function is shown below with the added lines of code highlighted.
There are a few things to know about this code. First, I am only logging metrics from one of the workers. While all workers have their own copy of the model being trained, it is synchronized across the workers. Therefore, it is unnecessary to log metrics from every worker. If you do, you will get redundant information in MLFlow. Second, This code still uses Ray Train reporting for metrics and checkpointing. If you wish, it is possible to transition all reporting and checkpointing to MLFlow.
In this post I showed how to add MLflow Tracking to a Machine Learning pipeline that uses distributed training and distributed preprocessing. If you want to learn more about what you can do with MinIO, Ray Data, Ray Train, and MLflow, then check out the following related posts.
Incorporating these technologies into your ML pipeline is the first step toward building a complete AI Infrastructure. You will have:
- MinIO - A high-performance Data Lake
- Ray Data - Distributed preprocessing
- Ray Train - Distributed Training
- MLflow - MLOPs
As a next step, consider adding a Modern Datalake to your infrastructure.