AI/ML Reproducibility with lakeFS and MinIO

This post was written in collaboration with Amit Kesarwani from lakeFS.

The reality of running multiple machine learning experiments is that managing them can become unpredictable and complicated - especially in a team environment. What often happens is that during the research process, teams constantly change configuration and data between experiments. For example, try several training sets and several hyperparameter values, and - when large data sets are involved - also different configurations of distributed compute engines such as Apache Spark.

Part of the ML engineer’s work requires going back and forth between these experiments for comparison and optimization. When engineers manage all these experiments manually, they are less productive.

How can engineers run ML experiments confidently and efficiently?

In this article, we dive into reproducibility to show you why it’s worth your time and how to achieve it with lakeFS and MinIO.

Why data practitioners need reproducibility

What is reproducibility?

Reproducibility ensures that teams can repeat experiments using the same procedures and get the same results. It’s the foundation of the scientific method and, therefore, a handy approach in ML.

In the context of data, reproducibility means that you have everything needed to recreate the model and its results, such as data, tools, libraries, frameworks, programming languages, and operating systems. That way, you can produce identical results.

Why do data teams need reproducibility?

ML processes aren’t linear. Engineers usually experiment with various ML methods and parameters iteratively and incrementally to arrive at a more accurate ML model. Because of the iterative nature of development, one of the most difficult challenges in ML is ensuring that work is repeatable. For example, training an ML model meant to detect cancer should return the same model if all inputs and systems used are the same.

Additionally, reproducibility is a key ingredient for regulatory compliance, auditing, and validation. It also increases team productivity, improves collaboration with nontechnical stakeholders, and promotes transparency and confidence in ML products and services.

As stated previously, the ML pipeline can get complicated. You must manage code, data sets, models, hyperparameters, pipelines, third-party packages, and environment-specific configurations. Repeating an experiment accurately is challenging. You need to recreate the exact conditions used to generate the model.

Benefits of reproducible data

Consistency

Given the same data, you want a model to deliver the same outcome. This is how you can establish confidence in your data products. If you acquire the same result by repeating the experiment, the users' experience will also be consistent.

Moreover, as part of the research process, you want to be able to update a single element, such as the model's core, while keeping everything else constant - and then see how the outcome has changed.

Security and Compliance

Another consideration is security and compliance. In many business verticals such as banking, healthcare, and security, organizations are required to maintain and report on the exact process that led to a given model and its result. Ensuring reproducibility is a common practice in these verticals.

To accomplish this, you need to version all the inputs to your ML pipeline such that reverting to a previous version reproduces a prior result. Regulations often require you to recreate the former state of the pipeline. This includes the model, data, and a previous result.

Easier management changing data

Data is always changing. This makes it difficult to keep track of its exact status over time. People frequently keep only one state of their data: the present state.

This has a negative impact on the work since it makes the following tasks incredibly difficult:

Debugging a data problem.
Validating the correctness of machine learning training (re-running a model on different data yields different results).
Observing data audits.

How do you achieve reproducibility?

To achieve reproducibility, data practitioners often keep several copies of the ML pipeline and data. But copying enormous training datasets each time you wish to explore them is expensive and just isn’t scalable.

Furthermore, there is no method to save atomic copies of many model artifacts and their accompanying training data. Add to that the challenges of handling many types of organized, semi-structured, and unstructured training data, such as video, audio, IoT sensor data, tabular data, and so on.

Finally, when ML teams make duplicate copies of the same data for collaboration, it's difficult to implement data privacy best practices and data access limits.

What you need is a data versioning tool that has a zero-copy mechanism and lets you create and track multiple versions of your data. Version control is the process of recording and controlling changes to artifacts like code, data, labels, models, hyperparameters, experiments, dependencies, documentation, and environments for training and inference.

Can you get away with using Git for versioning data? This might sound like a good idea, but Git is neither secure, adequate, nor scalable for data.

The version control components for data science are more complicated than those for software projects, making reproducibility more challenging. Moreover, since raw training data is often stored in cloud object stores (S3, GCS, Azure Blob), teams need a versioning solution that works for data in-place (in object stores).

Luckily, there is an open-source tool that does just that: lakeFS.

Data version control with lakeFS

lakeFS is an open-source tool that enables teams to manage their data using Git-like procedures (commit, merge, branch), supporting billions of files and petabytes of data. lakeFS adds a management layer to your object storage, like S3, and transforms your entire bucket into something akin to a code repository. Additionally, Although lakeFS only handles a portion of the MLOps flow, it strives to be a good citizen within the MLOPs ecosystem by interacting with all the tools shown below - especially data quality tools.

Reproducibility means that team members have the capability to time travel between multiple versions of the data, taking snapshots of the data at various periods and with varying degrees of modification.

To ensure data reproducibility, we recommend committing a lakeFS repository every time the data in it changes. As long as a commit has been made, replicating a given state is as simple as reading data from a route that includes the unique commit_id produced for the commit.

Getting the current state of a repository is straightforward. We use a static route with the repository name and branch name. For example, if you have a repository called example with a branch called main, reading the most recent state of this data into a Spark Dataframe looks like this:

df = spark.parquet("s3://example/main/")

Note: This code snippet assumes that all items in the repository under this path are in Parquet format. If you’re using a different format, use the appropriate Spark read method.

However, we can also look at any previous commit in a lakeFS repository. Any commit can be reproduced. If a commit can be reproduced then the results are repeatable.

In the repository above, each time a model training script is performed, a new commit is made to the repository, and the commit message specifies the exact run number. What if we wanted to re-run the model training script and get the same results from a previous run. As an example, let's say we want to reproduce the results of run #435. To do this, we simply copy the commit ID associated with the run and read the data into a dataframe as follows:

df = spark.parquet("s3://example/296e54fbee5e176f3f4f4aeb7e087f9d57515750e8c3d033b8b841778613cb23/training_dataset/")

The ability to reference a single commit_id in code facilitates the process of duplicating a data collection's or several collections' specific state. This has several typical data development uses, such as historical debugging, discovering deltas in a data collection, audit compliance, and more.

Object storage with MinIO

MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake, and database workloads. It runs on-prem and on any cloud (public or private) and from the data center to the edge. MinIO is software-defined and open source under GNU AGPL v3. Enterprises use MinIO to deliver against ML/AI, analytics, backup, and archival workloads - all from a single platform. Remarkably simple to install and manage, MinIO offers a rich suite of enterprise features targeting security, resiliency, data protection, scalability, and identity management. In the end-to-end demo presented here, MinIO was used to store a customer's documents.

Putting it all together: lakeFS + MinIO

lakeFS provides object-storage-based data lakes with version control capabilities in the form of Git-like operations. It can work on top of your MinIO storage environment and integrate with all contemporary data frameworks like Apache Spark, Hive, Presto, Kafka, R, and Native Python, among others.

Using lakeFS on top of MinIO, you can:

Create a development environment that keeps track of experiments
Efficiently modify and version data with zero copy branching and commits for every experiment.
Build a robust data pipeline for delivering new data to production.

Here’s how you can set up lakeFS over MinIO and make data processing easier.

Prerequisites

MinIO Server Installed from here.
Installed mc from here.
Installed docker from here.

Installation

Let’s start by installing lakeFS locally on your machine. More installation options are available in the lakeFS docs.

An installation fit for production calls for a persistent PostgreSQL installation. But in this example, we will use a local key-value store within a Docker container.

Run the following command by replacing <minio_access_key_id>, <minio_secret_access_key> and <minio_endpoint> with their values in your MinIO installation:

docker run --name lakefs \
--publish 8000:8000 \
-e LAKEFS_BLOCKSTORE_TYPE=s3 \
-e LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true \
-e LAKEFS_BLOCKSTORE_S3_ENDPOINT=<minio_endpoint> \
-e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=<minio_access_key_id> \
-e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=<minio_secret_access_key> \
treeverse/lakefs:latest \
run --local-settings

Configuration

Go to lakeFS and create an admin user: http://127.0.0.1:8000/setup

Take note of the generated access key and secret.

We will use the lakectl binary to carry out lakeFS operations. You need to find the distribution suitable to your operating system here, and extract the lakectl binary from the tar.gz archive. Locate it somewhere in your $PATH and run lakectl --version to verify.

Then run the following command to configure lakectl, using the credentials you got in the setup before:

lakectl config
# output:
# Config file /home/janedoe/.lakectl.yaml will be used
# Access key ID: <LAKEFS_ACCESS_KEY_ID>
# Secret access key: <LAKEFS_SECRET_KEY>
# Server endpoint URL: http://127.0.0.1:8000/api/v1

Make sure that lakectl can access lakeFS with the command:

lakectl repo list

If you don’t get any error notifications, you’re ready to set a MinIO alias for lakeFS:

mc alias set lakefs http://s3.local.lakefs.io:8000 <LAKEFS_ACCESS_KEY_ID> <LAKEFS_SECRET_KEY>

If you don’t get already have, set a MinIO alias for MinIO storage also:

mc alias set myminio <minio_endpoint> <minio_access_key_id> <minio_secret_access_key>

Now that we understand the basic concepts and have everything installed let’s walk through an end-to-end example that demonstrates how easy this is to incorporate into your AI/ML engineering workflow. You will also notice that using lakeFS is very git-like. If your engineers know git, they will have an easy time learning lakeFS.

Zero Clone Copy and Reproducibility example

One of the key advantages of combining MinIO with lakeFS is the ability to achieve parallelism without incurring additional storage costs. lakeFS leverages a unique approach (zero clone copies), where different versions of your ML datasets and models are efficiently managed without duplicating the data. This functionality will be demonstrated in this section.

Let’s start by creating a bucket in MinIO!

Note that this bucket will be created directly in your MinIO installation. Later on, we’ll use lakeFS to enable versioning on this bucket.

mc mb myminio/example-bucket

So, let’s start by creating a repository in lakeFS:

lakectl repo create lakefs://example-repo s3://example-bucket

Generate two example files:

echo "my first file" > myfile.txt
echo "my second file" > myfile2.txt

Create a branch named experiment1, copy a file to it and commit:

lakectl branch create lakefs://example-repo/experiment1 --source lakefs://example-repo/main

mc cp ./myfile.txt lakefs/example-repo/experiment1/

lakectl commit lakefs://example-repo/experiment1 -m "my first experiment"

Let's create a tag for the committed data in experiment1 branch (your ML models can access your data by tag later):

lakectl tag create lakefs://example-repo/my-ml-experiment1 lakefs://example-repo/experiment1

Now, let’s merge the branch back to main:

lakectl merge lakefs://example-repo/experiment1 lakefs://example-repo/main

Create a branch named experiment2, copy a file to it, commit it and tag it:

lakectl branch create lakefs://example-repo/experiment2 --source lakefs://example-repo/main

mc cp ./myfile2.txt lakefs/example-repo/experiment2/

lakectl commit lakefs://example-repo/experiment2 -m "my second experiment"

lakectl tag create lakefs://example-repo/my-ml-experiment2 lakefs://example-repo/experiment2

Now, let’s merge the branch back to main:

lakectl merge lakefs://example-repo/experiment2 lakefs://example-repo/main

List the data for different experiments by using tags:

mc ls lakefs/example-repo/my-ml-experiment1
# only myfile.txt should be listed

mc ls lakefs/example-repo/my-ml-experiment2
# both files should be listed

Full example

You can review ML Reproducibility example along with multiple ML experiments here

Setup instructions for this example are here

Next steps

If you’re ready to extend your MinIO object storage with Git-like features, take the installation and configuration steps outlined above and try it yourself!

Head over to this documentation page to get started.

Summary

By bringing lakeFS and MinIO together, you can take advantage of the power of Git branching to create reproducible experiments.

Check out the lakeFS documentation to learn more and join the vibrant lakeFS community on their public Slack channel.

If you have questions about MinIO, then drop us a line at hello@min.io or join the discussion on MinIO’s general Slack channel.