Architecting a Modern Data Lake

Architecting a Modern
Data Lake

Approximately 90% of all the data in the world is replicated data, with only 10% being genuine, new data. This has significant implications for an enterprise's data strategy — particularly when you consider the growth rates. For example, in 2020, the total amount of data generated and consumed was 64.2 zettabytes. In 2021, it was forecast that the overall amount of data created worldwide would reach 79 zettabytes, and by 2025 the number is expected to be at 160 zettabytes.

As organizations build modern data lakes, here are some of the key factors we think they should be considering:

1. Separation of compute and storage

2. Disaggregation of monolithic frameworks to best of breed frameworks

3. Seamless performance across small and large files/objects

4. Software defined, cloud native solutions that scale horizontally

This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.

Adoption of Hadoop

With the expansion of internet applications, the first major data storage and aggregation challenges for advanced tech companies started 15 years ago. Traditional RDBMS (Relational Database Management System) could not be scaled to approach large amounts of data. Then came Hadoop, a highly scalable model. In the Hadoop model, a large amount of data is divided into multiple inexpensive machines in a cluster which is then processed in parallel. The number of these machines or nodes can be increased or decreased as per the enterprise’s requirements.

Hadoop was open source and used cost-effective commodity hardware which provided a cost-efficient model, unlike traditional relational databases that require expensive hardware and high-end processors to deal with big data. Because it was so expensive to scale in the RDBMS model, enterprises started to remove the raw data. This led to suboptimal outcomes across a number of vectors.

In this regard, Hadoop provided a significant advantage over the RDBMS approach. It was more scalable from a cost perspective, without sacrificing performance.

The End of Hadoop

The advent of newer technologies like change data capture and streaming data, primarily generated from social media companies like Twitter and Facebook, altered how data is ingested and stored. This triggered challenges in processing  and consuming these even larger volumes of data.

A key challenge was with batch processing. Batch processes run in the background and do not have interaction with the user. Hadoop was efficient with batch processing when it came to very large files, but suffered with smaller files  — both from an efficiency perspective as well as a latency perspective — effectively rendering it obsolete as enterprises sought out processing and consumption frameworks that could ingest varied datasets large and small in batch, cdc and real time.

Separating compute and storage simply makes sense today. Storage needs to outpace compute by as much as ten to one. This is highly inefficient in the Hadoop world where you need one compute node for every storage node.  Separating them means they can be tuned individually. The compute nodes are stateless and can be optimized with more CPU cores and memory. The storage nodes are stateful and can be I/O optimized with a greater number of denser drives and higher bandwidth.

By disaggregating, enterprises can achieve superior economics, better manageability, improved scalability, and enhanced total cost of ownership.

HDFS cannot make this transition. When you leave data-locality, Hadoop HDFS’s strength becomes its weakness. Hadoop was designed for MapReduce computing, where data and compute had to be co-located. As a result, Hadoop needs its own job scheduler, resource manager, storage, and compute. This is fundamentally incompatible with container-based architectures where everything is elastic, lightweight, and multi-tenant.

In contrast, MinIO was born in the cloud and is designed for containers and orchestration via Kubernetes, making it the ideal technology to transition to when retiring legacy HDFS instances.

This has given rise to the modern data lake. The modern data lake takes advantage of using the commodity hardware approach inherited from Hadoop but disaggregates storage and compute — thereby changing how data is processed, analyzed, and consumed.

Building a Modern Data Lake with MinIO

MinIO is a high performance object storage system that was built from scratch to be scalable and cloud-native. The team that built MinIO also built one of the most successful file systems, GlusterFS, before evolving their thinking on storage. The deep understanding of file systems and which processes were expensive or inefficient informed the architecture of MinIO — delivering performance and simplicity in the process.

Minio uses erasure coding and provides a better set of algorithms to manage storage efficiency and provide resiliency. Typically, it's 1.5 times copy, unlike 3 ways in Hadoop clusters. This alone already provides storage efficiency and reduces costs compared to Hadoop.  

MinIO was, from inception, designed for the cloud operating model. As a result, it runs on every cloud — public, private, on-prem, bare-metal, and edge. This makes it ideal for multi-cloud and hybrid-cloud deployments. With hybrid configuration, MinIO enables migration of data analytics and data science workloads in accordance with approaches like the Strangler Fig Pattern popularized by Martin Fowler.

Here are several other reasons as to why MinIO is the basic building block for an Enterprise Data Lake, Data Analytics, and Data Science Platforms, as listed below.

Modern Data Ready

Hadoop was purpose-built for machine data where “unstructured data” means large (GiB to TiB sized) log files. When used as a general purpose storage platform where true unstructured data is in play, the prevalence of small objects (KB to MB) greatly impairs Hadoop HDFS as the name nodes were never designed to scale in this fashion. MinIO excels at any file/object size (8KiB to 5TiB).

Open Source

The enterprises that adopted Hadoop did so out of a preference for open source technologies. The ability to inspect, the freedom from lock-in, and the comfort that comes from tens of thousands of users, has real value. MinIO is also 100% open source, ensuring that organizations can stay true to their goals while upgrading their experience.

Simple

Simplicity is hard. It takes work, discipline, and above all, commitment. MinIO’s simplicity is legendary and is the result of a philosophical commitment to making our software easy to deploy, use, upgrade, and scale. Even Hadoop’s fans will tell you it is complex. To do more with less, you need to migrate to MinIO.

Performant

Hadoop rose to prominence on its ability to deliver big data performance. They were, for the better part of a decade, the benchmark for enterprise-grade analytics. Not anymore. MinIO has proven in multiple benchmarks that it is materially faster than Hadoop. This means better performance on Spark, Presto, Flink, and other modern analytic workloads.

Lightweight

MinIO’s server binary is all of <100MB. Despite its size, it is powerful enough to run the datacenter, yet still small enough to live comfortably at the edge. There is no such alternative in the Hadoop world. What it means to enterprises is that your S3 applications can access data anywhere, anytime, and with the same API. Implementing MinIO edge location and with replication capability, we can capture and filter data at the edge and ship it to the mother cluster for aggregation and further analytics implementation.

Resilient

MinIO protects data with per-object, inline erasure coding, which is far more efficient than HDFS alternatives which came after replication and never gained adoption. In addition, MinIO’s bitrot detection ensures that it will never read corrupted data — capturing and healing corrupted objects on the fly. MinIO also supports cross-region, active-active replication. Finally, MinIO supports a complete object locking framework offering both Legal Hold and Retention (with Governance and Compliance modes).

Software Defined

Hadoop HDFS’ successor isn’t a hardware appliance, it is software running on commodity hardware. That is what MinIO is — software. Like Hadoop HDFS, MinIO is designed to take full advantage of commodity servers. With the ability to leverage NVMe drives and 100 GbE networking, MinIO can shrink the datacenter — improving operational efficiency and manageability.

Secure

MinIO supports multiple, sophisticated server-side encryption schemes to protect data — wherever it may be — in flight or at rest. MinIO’s approach assures confidentiality, integrity, and authenticity with negligible performance overhead. Server side and client side encryption are supported using AES-256-GCM, ChaCha20-Poly1305, and AES-CBC, ensuring application compatibility. Furthermore, MinIO supports industry-leading key management systems (KMS).

Migrating from Hadoop to MinIO

The MinIO team has expertise in assisting migrating off of HDFS to MinIO with appropriate tools, and with the Enterprise license our engineers can assist with hand-holding the migration process. To learn more about using MinIO to replace HDFS check out this collection of resources.

Conclusion

Every enterprise is a data enterprise at this point. The storage of that data and the subsequent analysis need to be seamless, scalable, secure, and performant. The analytical tools spawned by the Hadoop ecosystem, like Spark, are more effective and efficient when paired with object storage-based data lakes. New technologies like Flink improve the overall performance as it provides single run-time for the streaming as well as batch processing that didn’t work well in the HDFS model. Frameworks like Apache Arrow are redefining how data is stored and processed and Iceberg is redefining how table formats allow efficient querying of data.

These technologies all require a modern, object storage-based data lake where compute and storage are disaggregated and workload optimized.

Previous Post Next Post