Containerized data analytics at scale, with MinIO and Pachyderm
Containers running on orchestration platforms like Kubernetes, Docker Swarm, DC/OS et al. offer powerful, versatile ways to deploy applications. Containers let you deploy isolated application instances, and you can launch multiple such instances to scale up your load serving capacity. You don’t even need to worry about individual server capacities and scheduling thanks to orchestration tool, which provide declarative deployments.
However, data analytics is still seemingly difficult to achieve with the cloud-native strategy. This is mainly because data analytics involves large scale processing that is tightly coupled to large amounts of data. A single container can’t possibly mount all the data and if there are multiple containers working on the data, it’s difficult to track data being processed.
How do you apply cloud-native principles to a compute and data intensive field like big data.
Pachyderm is an open source framework that enables distributed data versioning and data pipelining via containers. We worked closely with the Pachyderm team, and are very happy to announce that MinIO is now integrated as a backend data store for Pachyderm. In this post, we’ll share how to setup Pachyderm to create containerized data analytics backed by MinIO cloud storage. But before that, let me explain why this is a good idea to begin with!
Why Pachyderm and MinIO?
MinIO is versatile enough to serve as the backend for both the application stack and data analytics platform. This way, you can plug your application to store unstructured data to MinIO, while also running analytics on some of the MinIO buckets in parallel.
MinIO is cloud-native, S3 compatible, robust, scalable and can be deployed anywhere — on premise or on cloud.
Pachyderm enables version controlled data and containerized processing. While the data may live in MinIO, Pachyderm automatically assigns slices of data to different containers for parallel processing — helping you make the best use of container technology for data analysis.
Setup MinIO with Pachyderm
To begin with, you’ll need MinIO up and running. You can either install MinIO on a cloud server or on premises. Checkout the details here. Once you have MinIO ready, create a bucket (to be used as data source for Pachyderm) and keep the accessKey, secretKey, bucketName and the endPoint handy. We’ll need these while deploying Pachyderm.
Next step is to deploy Pachyderm. Note that Pachyderm needs Kubernetes and the command line utility pachctl
as prerequisites.
Kubernetes is used to orchestrate your data processing via containers . You can deploy Kubernetes locally via Minikube or on cloud providers like AWS/GCP.pachctl
enables interaction with the Pachyderm application running on Kubernetes. You can install pachctl
by:
# For macOS:
$ brew tap pachyderm/tap && brew install pachctl
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://pachyderm.io/pachctl.deb && sudo dpkg -i /tmp/pachctl.deb
Finally, you can deploy Pachyderm. Assuming you are deploying in one of the common cloud providers, you have Kubernetes running, and you have created a persistent disk on the respective cloud provider (used to store various metadata), execute the following command to deploy Pachyderm backed by MinIO:
$ pachctl deploy custom \
--persistent-disk <cloud_provider> \
--object-store s3 ${STORAGE_NAME} ${STORAGE_SIZE} \
<bucket_name> <acces_key> <secret_key> <endpoint> \
--static-etcd-volume=${STORAGE_NAME}
Note that, cloud_provider
can be google
, azure
or aws
. You can even deploy Pachyderm on-premise. Refer this doc for on-premise deployment.
After a couple of minutes, Pachyderm will be up and running! Then forward the port to the running Pachyderm cluster, so pachctl
can talk to the Pachyderm deployment.
$ pachctl port-forward &
Test if the communication is working fine by
$ pachctl version
COMPONENT VERSION
pachctl 1.4.0
pachd 1.4.0
You could also configure other object stores with Pachyderm. Follow this document for more information.
What next?
Now that you have the setup ready, lets see how you actually create data analysis pipelines with Pachyderm and MinIO.
Add data to Pachyderm: Whenever you add data to Pachyderm, it backs it up to a MinIO bucket (the one provided during Pachyderm deployment). But, it’s not simple backup, what happens is that, the data is version controlled by Pachyderm using a Git like system for data. This way any data manipulation is encapsulated via a commit. Your team members can reproduce your setup just by referring to a commit. They can even revert to old states if needed.
Pachyderm offers several ways to add data to a Pachyderm repo. Read more about it here.
Creating analysis pipelines: Pachyderm also allows you to create DAG pipelines, and it can shard your large data into parallel containers. Each container automatically has access to the data at /pfs/<repo_name>
. Data analysis code running in a container can simply access the data at /pfs/<repo_name>
and write the output at /pfs/out
. Again Pachyderm takes care that the data output by each container is segregated and ends at proper place. This makes data analysis scalable and easy to manage.
Read more about creating data analysis using Pachyderm here.
Summary
In this post we learned how MinIO and Pachyderm integration helps you setup a scalable and reliable data analytics pipeline. We saw that Pachyderm takes care of sharding your data across data processing containers, while MinIO serves as the cloud-native, reliable, scalable backend for the processed and unprocessed data.
Since both MinIO and Pachyderm are cloud native applications, and can run on orchestration tools like Kubernetes — there is not much Ops effort involved.
While you’re at it, help us understand your use case and how we can help you better! Fill out our best of MinIO deployment form (takes less than a minute), and get a chance to be featured on the MinIO website and showcase your MinIO private cloud design to MinIO community.
You can visit the Pachyderm team at: http://slack.pachyderm.io