Git-like versioning for your AI Data

AJ AJ on AI/ML |
Git-like versioning for your AI Data

You’ve surely version controlled code in the past. But have you version controlled your data? Did you ever want to collaborate on large sets of data with various teams without committing a large chunk? Imagine how cool it would be to use git-like commands to operate with a repository-like ecosystem where you can Commit Data, Create Branches, check History and track changes throughout the lifecycle of the data. Ultimately this ensures teams in large organizations collaborate on data the same way they collaborate on code.

The backbone of Pachyderm is its File System, PFS. which is essentially built on top of Postgres and Object Store, such as MinIO. This ensures that the data is secure and consistent across all requests. This ensures users can version their data using branches and commits to manage and track changes over time.

Let’s take a look at how to set up Pachyderm with AIStor as its backbone.

AIStor and Pachyderm

You should already have a Kubernetes cluster running with a support version of Kubernetes.

Once you have that going, go ahead and download and untar the Kubernetes YAML files for AIStor.

wget https://dl.min.io/enterprise/console.tar.gz


tar xvf console.tar.gz


Launch the Global Console

kubectl apply -k console

Next let's install Pachyderm.

Add Helm chart repo and update

helm repo add pachyderm https://helm.pachyderm.com

helm repo update

Create a MinIO bucket using the steps below

<div>

  <script async src="https://js.storylane.io/js/v2/storylane.js"></script>

  <div class="sl-embed" style="position:relative;padding-bottom:calc(79.17% + 25px);width:100%;height:0;transform:scale(1)">

<iframe loading="lazy" class="sl-demo" src="https://app.storylane.io/demo/cesgrcyf9wnq?embed=inline" name="sl-embed" allow="fullscreen" allowfullscreen style="position:absolute;top:0;left:0;width:100%!important;height:100%!important;border:1px solid rgba(63,95,172,0.35);box-shadow: 0px 0px 18px rgba(26, 19, 72, 0.15);border-radius:10px;box-sizing:border-box;"></iframe>

  </div>

</div>

Update the Pachyderm Helm values file with the MinIO endpoint, bucket name, access key ID, and secret key.

pachd:

  storage:

backend: "AMAZON"

storageURL: "s3://pachyderm-test?endpoint=minio.default.svc.cluster.local:9000&disableSSL=true&region=dummy-region"

Deploy Pachyderm

helm install pachyderm -f values.yaml pachyderm/pachyderm --version <your_chart_version>

Adding and Retrieving Data

There are 2 ways to add and retrieve data.

MC

MC is the best and simplest way.

You add a Pachyderm endpoint just like any other S3 endpoint

mc alias set pachyderm_minio <pachyderm-address> <YOUR-PACHYDERM-AUTH-TOKEN> <YOUR-PACHYDERM-AUTH-TOKEN>

List the content of the Pachyderm repo and project

mc ls local/master.<repo>.<project>

AWS CLI

You can also put data into MinIO using the aws cli

aws --endpoint-url <pachyderm-address> s3 cp myfile.csv s3://minio.default.svc.cluster.local:9000

Data retrieval from AIStor is just as simple

aws --endpoint-url <pachyderm-address> s3 cp s3://minio.default.svc.cluster.local:9000/myfile.csv


If you are outside the Kubernetes cluster you can use port-forwarding although I would recommend limiting that to testing use cases.

Versioning for AI Data

We version code, by now it's obvious why we do that. We version infrastructure as code, this didn’t used to be the norm but slowly even for small setups versioning your infra code is just as important as your application code. Fundamentally the reason we do this is to collaborate. It is very important to understand that Big Data and AI/ML are two sides of the same coin, you cannot have one without the other, and they both feed into each other as the models evolve. So you want to be able to make sure the data your are generating can be worked on by other teams in a meaningful way without having to redo the entire data again, imagine every time someone overwrote your code without a proper git commit/merge.

At MinIO we are not only about simplicity but about best practices when it comes to managing your infrastructure so you don’t have those 3 AM pager calls. If you have any questions on AIStor or any AI/ML or Big Data topics in general be sure to reach out to us on Slack!