Cloud native, Kubernetes-oriented, microservices based architectures are driving the need for over-the-network storage such as object storage. The benefits of object storage are numerous in a cloud native environment - it allows elastic scaling of compute hardware independent of storage hardware. It makes applications stateless, as state is stored over the network, and allows applications to achieve higher scale than ever possible before, by easing operational complexity.
In contrast to the legacy approach of bringing your data to the compute, the pattern of storing data over-the-network from compute workloads epitomizes the modern disaggregated architecture. The most prominent standard for writing and reading data from an over-the-network object storage system is S3. MinIO is a fully S3-compliant, high performance, hybrid and multi-cloud ready object storage solution.
As most sophisticated Hadoop admins know, high performance object storage backends have become the default storage architecture for modern implementations. This post details how to bring the benefits of object storage to Hadoop - changing the storage protocol, data migration and performance tuning.
In the following sections, we will describe the process of migrating from HDFS to MinIO - our own high performance, kube-native, s3 compatible object storage system.
hdfs:// to s3a://
Any big data platform from the Hadoop ecosystem supports S3 compatible object storage backends by default. This support goes all the way back to 2006 when the emerging technology embedded an s3 client implementation. The hadoop-aws module and aws-java-sdk-bundle are used by all Hadoop-related platforms to provide support for the s3 API.
Applications can seamlessly switch between HDFS and S3 storage backends by specifying the appropriate protocol. In case of S3, the protocol scheme is s3a://, and in case of HDFS, the scheme is hdfs://.
The S3 client implementation in Hadoop SDK has evolved over the years, each with different protocol scheme names such as s3://, s3n://, and s3a://. Currently s3:// denotes Amazon’s EMR client. The most prominent s3 client available in the Hadoop ecosystem is s3a:// which is meant for all other S3 backends.
NOTE: s3n:// is defunct and no longer supported by any major Hadoop vendor.
The first step in migration is changing the protocol that Hadoop uses to communicate with backend storage from hdfs:// to s3a://. In the core-site.xml file for your platform, change the following parameter Hadoop.defaultFS to point to a s3 backend.
There are several ways to approach the migration to object storage. You could leave old data in HDFS to be accessed by Hadoop, while new data is saved in MinIO to be accessed by cloud-native applications like Apache Spark. You could move everything to MinIO where it would be accessed by Hadoop and cloud-native applications. Or you can choose to conduct a partial migration. You’ll have to choose the best one for your organization. I’ll describe how to do a full migration below, and dive deeper into planning your migration in a future blog post.
Migrating data from HDFS to S3
Data can be migrated between different storage backends using a Hadoop-native tool called distcp - which stands for distributed copy. It takes two parameters, source and destination. The source and destination can be any storage backend supported by Hadoop.
In this example, in order to move data from HDFS to s3, the source would have to be set to hdfs://192.168.1.2:9000 and destination would be s3a://minio:9000 .
Depending on the size of data, and speed of transfer, distcp itself can be scaled and data can be migrated using a massively parallel infrastructure.
The number of mappers, i.e. the number of parallel tasks copying the data, can be configured using the -m flag. A good rule of thumb is to set it to the number of CPU cores free across all nodes in your infrastructure.
For instance, if you have 8 free nodes with 8 cores each, then the number of CPU cores would be 64.
NOTE: The number of mappers should correspond to the number of free cores in your infrastructure, and not the total number of cores in your entire cluster. This is to ensure that other workloads have resources available for their operations.
Tuning for performance
The access pattern of data between Hadoop and object storage are vastly different. By design, object storage systems do not support edits. This plays a key role in its ability to achieve multi-petabyte scale.
Secondly, copying data from one location to another in object storage systems is expensive, because the operation incurs a server-side copy. Some object storage systems are not strictly consistent, which can confuse Hadoop because a file may not show up, or a deleted file may show up during a listing operation if it is eventually consistent.
NOTE: MinIO does not have the consistency drawbacks because it is strictly consistent.
Taking these factors in consideration, it is easy to tune your application to become object storage native. A massive effort was already taken to help speed-up this journey, and that was the introduction of S3 committers to Hadoop. As the name suggests, S3 committers promise consistent, reliable and high performance commitment of data to S3.
Committers change the read/write access pattern of data from S3. Firstly, they avoid server side copies, which is otherwise employed extensively by Hadoop applications to allow multiple Hadoop workers to atomically write data. Some committers even use local drives as cache, and only write final output to object storage in order to improve performance.
There are three committers, each with different trade-offs to address various use cases. They are:
- Directory Committer
- Partitioned Committer
- Magic Committer
In order to enable committers in your application, set the following configuration in your core-site.xml file:
This committer changes the access pattern to write data locally (cache drive) first, and once the final version of the data-to-be-written is collected, the write is performed. This style of writing is much better suited to distributed computing and object storage connected by a fast network, and greatly improves performance by preventing server side copies.
In order to choose this committer, set the following parameter fs.s3a.committer.name to directory.
This committer is similar to the directory committer, except in how it handles conflicts. The directory committer handles conflicts of different Hadoop workers writing to the same file by taking the entire directory structure into consideration. In the case of the partitioned committer, the conflicts are handled on a partition-by-partition basis.
This committer provides higher performance compared to the directory committer if the directory structure is deeply nested or very large in general. It is only recommended to be utilized for Apache Spark workloads.
The inner workings of this committer is less well-known, therefore the name Magic committer. It automatically chooses the best strategy to achieve the highest possible performance. It is only applicable for strictly consistent S3 stores.
Since MinIO is strictly consistent, the Magic committer can be safely used. It is recommended to try this committer with your workload to compare performance with the other committers.
A good rule of thumb for choosing committers is to start with the simplest and most predictable directory committer, and if your applications needs are not satisfied by it, try the other two committers as applicable.
Once the appropriate committer is chosen, your application can be put to the test for both performance and correctness.
Further notes on performance tuning can be found in our benchmarking guide here. More information on S3 committers can be found here.
Reach out to us at firstname.lastname@example.org if you have any questions, or visit min.io to download and get started right away!