Data Migration Tools to Get You Into MinIO

AJ AJ on DevOps 10 April 2023

MinIO runs on anything – bare metal, Kubernetes, Docker, Linux and more. Organizations choose to run MinIO to host their data on any of these platforms, and increasingly rely on multiple platforms to satisfy multiple requirements. The choice of underlying hardware and OS is based on a number of factors, primarily the amount of data to be stored in MinIO plus requirements for integration with other cloud-native software, performance and security.

Many of our customers run MinIO on bare metal, while the majority run on Kubernetes. Running multiple instances of MinIO in a containerized architecture that is orchestrated by Kubernetes is extremely efficient. MinIO customers roll out new regions and update services without disruption, with separate Kubernetes clusters running in each region, and the operational goal of shared-nothing for greatest resiliency and scalability.

Customers switch to MinIO for a variety of reasons, including:

S3 Compatible API
Multi-Cloud Cloud Agnostic Deployments
S3 style IAM style ACL management
Distributed and Fault tolerant storage using Erasure Coding
Tiering and Versioning of objects across multiple clusters
Bucket and Site-to-Site Replication
Batch Replication via Batch Framework
Server side object and client data encryption
Transport Layer Network Encryption of Data

Due to these diverse reasons and environments where MinIO can be utilized and installed, it's realistic to assume there are a number of data sources where data is already stored that you would want to get into MinIO.

In this post, let's review some of the tools available to get data out of S3, local FileSystem, NFS, Azure, GCP, Hitachi Content Platform, Ceph, and others, and into MinIO clusters where it can be exposed to cloud-native AI/ML and analytics packages.

MinIO Client

To get started, we’ll be using the MinIO Client (mc) during the course of this post for a few of these options. Please be sure to install it and set the alias to your running MinIO Server.

mc alias set destminio https://myminio.example.net minioadminuser minioadminpassword

We will be adding some more “source” aliases as we go through the different methods.

FileSystems

The majority of use cases for migrating data into MinIO start with a mounted filesystem or NFS volume. In this simple configuration, you can use mc mirror to sync the data from the source to the destination. Think of mc mirror as a swiss army knife for data synchronization. It takes the burden off of the user to determine the best way to interact with the source from which you are fetching the objects. It supports a number of sources and, based on the source you are pulling from, the right functions are used to enable them.

For example, let's start with a simple FileSystem that is mounted from a physical hard disk, virtual disk, or even something like a GlusterFS mount. As long as it's a file system readable by the OS, MinIO can read it too

filesystem kbytes used avail capacity mounted on
/dev/root 6474195 2649052 3825143 41% /
/dev/stand 24097 5757 18340 24% /stand
/proc 0 0 0 0% /proc
/dev/fd 0 0 0 0% /dev/fd
/dev/_tcp 0 0 0 0% /dev/_tcp
/dev/dsk/c0b0t0d0s4 10241437 4888422 5353015 48% /home
/dev/dsk/c0b0t1d0sc 17422492 12267268 5155224 71% /home2

Let’s assume your objects are in /home/mydata, you would then run the following command to mirror the objects (if the mydata bucket does not already exist, you would have to create it first)

mc mirror /home/mydata destminio/mydata

This command will ensure that objects that are no longer in the source location are removed from the destination or when new objects get added to the source they will get copied to the destination. But if you want to overwrite existing objects modified in the source, pass the --overwrite flag.

NFS

Network File Share (NFS) is generally used to store objects or data that are not accessed often because, while ubiquitous, often the protocol is very slow across the network. Nonetheless, a lot of ETL and some legacy systems use NFS as a repository for data to be used for operations, analytics, AI/ML, and additional use cases. It would make better sense for this data to live on MinIO because of the scalability, security and high performance of a MinIO cluster, coupled with MinIO’s ability to provide services to cloud-native applications using the S3 API.

Install the required packages to mount the NFS volume

apt install nfs-common

Be sure to add the /home directory to /etc/exports

/home client_ip(rw,sync,no_root_squash,no_subtree_check)

Note: Be sure to restart your NFS server, for example on Ubuntu servers

systemctl restart nfs-kernel-server

Create a directory to mount the NFS mount

mkdir -p /nfs/home

Mount the NFS volume

mount <nfs_host>:/home /nfs/home

Copy the data from NFS to MinIO

mc mirror /nfs/home destminio/nfsdata

There you go, now you can move your large objects from NFS to MinIO.

S3

As we mentioned earlier, mc mirror is a swiss army knife of data synchronization. In addition to filesystems, it also copies objects from S3 or S3 API compatible stores and mirrors it to MinIO. One of the more popular use cases of this is mirroring an Amazon S3 bucket.

Follow these steps to create an AWS S3 bucket in your account. If you already have an existing account with data we could use that too.

Once a bucket has been created or data has been added to an existing bucket, create a new IAM policy with access key and secret key allowing access only to our bucket. Save the generated credentials for the next step.

We can work with any S3 compatible storage using the MinIO Client. Next let’s add an alias using the S3 bucket name we created along with the credentials we downloaded

mc alias set s3 https://s3.amazonaws.com BKIKJAA5BMMU2RHO6IBB V7f1CwQqAcwo80UEIJEjc5gVQUSSx5ohQ9GSrr12 --api S3v4

Use mc mirror to copy the data from S3 to MinIO

mc mirror s3/mybucket destminio/mydata

Depending on the amount of data, network speeds and the physical distance from the region where the bucket data is stored, it might take a few minutes or more for you to mirror all the data. You will see a message when mc is done copying all the objects.

HDFS

For the next set of tools, we write dedicated scripts to satisfy some of the non-standard edge case data migration requirements that we need to fulfill. One of these is migrating from HDFS and Hadoop. Many enterprises have so much data stored in Hadoop that it's impossible to ignore it and start fresh with a cloud-native platform. It is more feasible to transfer that data to something more modern (and cloud-native) like MinIO and run your ETL and other processes that way. It's rather simple to set up.

Create a file called core-site.xml with the following contents

<configuration>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>https://minio:9000</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>minio-sample</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>minio-sample123</value>
</property>
</configuration>

Set the following environment variables

export HDFS_SOURCE_PATH=hdfs://namenode:8080/user/minio/testdir
export S3_DEST_PATH=s3a://mybucket/testdir

Download the following file, chmod +x and run it

curl -LSs -o https://github.com/minio/hdfs-to-minio/blob/master/hdfs-to-minio.sh
chmod +x hdfs-to-minio.sh
./hdfs-to-minio.sh

If you’ve been storing data in Hadoop for several years, then this process might take several hours. If it's on a production cluster, then we recommend migrating data in off hours during maintenance windows to minimize the impact of any performance degradation to your Hadoop cluster while data is being mirrored.

More details about migrating from HDFS to MinIO are available in this GitHub Repo, and we’ve got a blog post as well, Migrating from HDFS to Object Storage.

HCP

We previously wrote an amazing blog post on Hitachi Content Platform and how to migrate your data to a MinIO cluster. I would recommend reading the blog post for full details but the crux is as follows.

Once you have the necessary HCP cluster and input file configured, download the migration tool and run the following command to start the migration process

$ hcp-to-minio migrate --namespace-url https://finance.europe.hcp.example.com
--auth-token "HCP bXl1c2Vy:3f3c6784e97531774380db177774ac8d"
--host-header "s3testbucket.sandbox.hcp.example.com"
--data-dir /mnt/data
--bucket s3testbucket
--input-file /tmp/data/to-migrate.txt

More details are available in this Blog Post.

Ceph

Last but not least, we’ve kept the elephant in the room until the end. Although aging, Ceph is a popular store for data and it has a S3 compatible API. It is used by other Kubernetes projects as the backend for object storage, such as Rook. Ceph, however, is an unwieldy behemoth to set up and run. So it's natural that folks would want to move their data to something simpler, easier to maintain and with greater performance.

There are two ways to copy data from Ceph

Bucket Replication: Creates the object but if the object is deleted from the source it will not delete it on the destination. https://min.io/docs/minio/linux/administration/bucket-replication.html
Mc mirror: Synchronizes objects and versions, it will even delete objects that do not exist https://min.io/docs/minio/linux/reference/minio-mc/mc-mirror.html

Similar to S3, since Ceph has S3 compatible API, you can add a alias to MinIO Client

mc alias set ceph http://ceph_host:port cephuser cephpass

You can then use mc mirror to copy the data to your MinIO cluster

mc mirror ceph/mydata destminio/mydata

We suggest that you run the mc mirror command with the --watch flag to continuously monitor for objects and sync them to MinIO.

Migrate Your Data to MinIO Today!

There are just a few examples to show you how easy it is to migrate your data to MinIO. It doesn’t matter if you are using older legacy protocols such as NFS or the latest and greatest such as S3, MinIO is here to support you.

In this post we went into detail on how to migrate from filesystems and other data stores such as NFS, filesystem, GlusterFS, HDFS, HCP, and last but not least Ceph. Regardless of the tech stack running against it, backend MinIO provides a performant, durable, secure, and scalable yet simple software-defined object storage solution.

If you have any questions feel free to reach out to us on Slack!

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix Certifications Snowflake Repatriation Migration Tabular Databricks

MinIO Client

FileSystems

NFS

S3

HDFS

HCP

Ceph

Migrate Your Data to MinIO Today!

Get a Quote

Select Plan

Choose Capacity