Data Migration from HDFS to MinIO

Sidhartha Mani Sidhartha Mani on Apache Hadoop 15 December 2021

In the previous blog post of this series, Migrating from HDFS to Object Storage, we focused on moving applications that were using HDFS to use S3. The next step is the migration of data.

Only when all the data required by applications is consumed from Object Storage, can the architecture be considered to be disaggregated, and the migration complete.

In this blog post, we will focus on migrating data from HDFS to Object Storage.

Planning

The first step is to define which data needs to be migrated. Since data migration is an expensive operation, it is best to only migrate data that will be immediately required by applications. Historic data can be moved into a backup store and loaded into object storage as needed. The steps to migrate historical data will be the same as the steps for immediate data.

Once the data to be migrated is clearly identified, the following concerns need to be considered:

Data Layout
Migration Speed
Migration Architecture

The final step is to perform the migration. Skip to the Data Migration section below if you are just interested in the commands to perform migration.

Data Layout

The general structure of data in HDFS follows the scheme:

hdfs://apps/$app_name/$db_name/$table_name/$record

where,
$app_name is the name of the application managing the data (eg. hive)
$db_name is the name of the database
$table_name is the name of the table
$record is a row in the table

A simple rule of thumb is to replicate the exact structure in object storage. For instance, a bucket would be created per application, and its subfolders would be created for each database in that app. It carries forward then that the tables would be subfolders of the database folders, and records would be subfolders of the table folders.

The structure in object storage would be:

s3a://$app_name/$db_name/$table_name/$record

To accommodate usage patterns where different applications scale at different rates, it is common to create a separate MinIO cluster per application, and use a bucket per database. This will allow you to manage, scale and secure each MinIO cluster as a separate unit. The structure in object storage would be:

MinIO Cluster 1 - Apache Hive

s3a://endpoint-hive:9000/$hive_db_name/$table_name/$record

MinIO Cluster 2 - Apache Spark

s3a://endpoint-spark:9000/$spark_db_name/$table_name/$record

If there are a lot of records in your database, and each of the records is stored in a separate file, then bucketing should be considered before migrating the data. Bucketing will group data into files in order to overcome the problem of having “too many files”. Bucketed data will lead to higher query performance if there are a large number of records in the table.

Migration Speed

In order to migrate data at maximum speed, the migration parameters need to be set such that the underlying hardware capabilities are saturated.

For instance, on a HDFS cluster with 10 nodes, each with 8 hard drives (200 MB/s max throughput), and with 10 Gbps links - here is the calculation to account for maximum performance:

Max throughput = Min(Max Drive throughput, Max Network throughput)

In the scenario defined above,

Max drive throughput = 8 * 200 MB/s = 1600 MB/s = 12.8 Gbps
Max network throughput = 10 Gbps
Max I/O throughput = Min(12.8 Gbps, 10 Gbps) = 10 Gbps

Max throughput for the entire HDFS cluster will be:

Num Nodes * Max migration throughput = 10 * 10 Gbps = 100 Gbps

NOTE: This assumes that the receiving cluster has capabilities to ingest data at this rate.

The minimum amount of time to transfer the entire dataset can be calculated as:

Total data size (MB) / Max cluster-wide throughput (MB/s)

Here is a table that shows the minimum time taken for a given data size and max throughput:

Data Size	Max Throughput	Time Taken
1 TB	12.5 GB/s / 100 Gbps	1m 20s
100 TB	12.5 GB/s / 100 Gbps	2h 13m 20s
1 PB	12.5 GB/s / 100 Gbps	22h 13m 20s

Migration Architecture

Once you have determined the desired throughput, then choose the minimum number of nodes required to achieve it. The data transfer can be initiated from these nodes to MinIO.

If you have a multi-node MinIO cluster, then it is possible to aggregate the throughputs of all the nodes by pointing each of the data transfer tasks to a different endpoint in the cluster.

For instance, if MinIO is running on 10 nodes, namely minio-{1...10}:9000, then the first migration runner will talk to minio-1, the second to minio-2 and so on. That way load is distributed evenly across all MinIO servers.

The way to achieve this is by using round-robin DNS or using an /etc/hosts file. The latter approach is described below:

On each of the nodes, the /etc/hosts file should contain the domain name ‘minio’ pointing to a different node in the minio cluster.

| /etc/hosts - hdfs1  |	/etc/hosts - hdfs2   |	/etc/hosts - hdfs3 |
|                     |                      |                     |
| minio 192.168.1.10  |	minio 192.168.1.20   |	minio 192.168.1.30 |  ...|

Data Migration

The distcp (distributed copy) utility provided by hadoop can be used to perform the data migration. The command to copy is:

hadoop distcp			\
-direct				\
-update				\
-m $num_copiers			\
hdfs://apps/$app_name		\
s3a://app_name

where,
-direct		implies that copy should be made directly to destination without writing to a temporary directory first
-update		overwrite files if they have changed in the source
-m 		number of parallel copiers. This can be used to achieve the intended throughput by increasing the number of nodes performing the copy.

Conclusion

Start your journey from HDFS to MinIO with careful planning that works towards clear goals.

Further notes on data migration from HDFS to MinIO can be found in our benchmarking guide, Performance comparison between MinIO and HDFS for MapReduce Workloads.

Reach out to us at hello@min.io if you have any questions, or visit min.io to download and get started right away!

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix

Planning

Data Layout

Migration Speed

Migration Architecture

Data Migration

Conclusion

Get a Quote

Select Plan

Choose Capacity