Why the Modern Datalake is Being Built Privately

The modern datalake (or perhaps "lakehouse," as you prefer) is remaking the enterprise data architecture. These modern datalakes have expanded to unprecedented sizes, scaling from petabytes to exabytes, while becoming more performant and multi-engine capable through the adoption of cloud-native principles and open standards. These datalakes don't fit in the public cloud for several reasons, the biggest being economics. This mismatch has led to a surge in the construction and migration of datalakes to private cloud environments and colocation data centers.

This blog details why this shift to the private cloud is happening and seeks to explain the motivations behind this strategic change.

The Ascendance of the Cloud Operating Model

Let’s start with the concept of the cloud operating model. The cloud operating model is a set of principles from containerization to orchestration and automation. It is not, however, a place. The cloud operating model asks us to rethink how we structure and manage our IT environments to take full advantage of the technologies used to build clouds. Instead of thinking about the cloud as a location, the cloud operating model invites us to think of the cloud as an incredibly valuable business tool. 

The ability to apply the cloud operating model to business problems is a competitive advantage for enterprise IT teams. Approach challenges like a hyper scaler: architect cloud-native portable software, orchestrate with Kubernetes, automate operations, and scale with technology, not people, all while staying agile and efficient.

The cloud operating model is a competitive advantage. The principles may have been learned in the public cloud, but the economics don’t support staying there. In their rush to implement a cloud strategy, many enterprises locked themselves into paying sky-high fees for mediocre software bundled together in the public clouds. The right approach involves picking and choosing best-of-breed tools based on their attributes and your enterprise’s requirements.

Here is a breakdown of the characteristics cloud technologists are looking for: 

High Performance: Prioritize tools that emphasize speed and efficiency, adopting a software-first approach to enhance both performance and developer experience.

Decoupled Compute and Storage: Unlinking these components offers increased flexibility and scalability, enabling your chosen services and tools to excel in their respective areas of expertise.

Open Standards: Open standards not only encourage interoperability but also future-proof your investments. This encompasses not just open-source solutions but also open table formats as we will explore.

Compatibility with RESTful APIs: Interconnectivity is a must. Your tools should share a common language, with S3 serving as the lingua franca for cloud storage.

Software Driven/Infrastructure-as-code: Automate and let Kubernetes take care of orchestrating your infrastructure, enabling you to abstract away the complexities of manual management, and allowing for rapid and efficient scalability.

By embracing the cloud operating model, organizations optimize their application and data architectures for the cloud, allowing for scalability, flexibility, and efficiency. 

Truly Cloud-Native Portable Formats

One of the key drivers behind the success of the cloud operating model as a philosophy is the emergence of truly cloud-native portable formats. Technologies like Apache Iceberg, Apache Hudi and Delta Lake are leading this charge. These formats break free from vendor lock-in, making it possible to deploy your data and applications in multi-cloud and private cloud environments. Increasingly, this is exactly how modern enterprises are building their infrastructures.

The versatility and freedom these portable formats provide are invaluable for enterprises seeking to maximize their reporting, analytics and AI/ML application options. In essence, OTFs have the potential to serve as cost-effective alternatives to expensive, strained and proprietary Relational Database Management Systems (RDMS) that have outgrown their effectiveness in the modern era.

The Significance of Query Engines in Data Lakes

Query engines sit on top of the data in data lakes and allow users to execute queries on that data. More often than not, query engines are coupled with object lambas that allow users to fully automate analytics pipelines. The emergence of query engines that can access data wherever it lives – on-prem, in the public clouds for a hybrid deployment, on edge or bare metal – has inarguably proved pivotal in ensuring data accessibility and analytics.

These modern query engines have fully decoupled storage and compute in order to focus on query performance and the user experience. Databases large and small that have embraced this pattern include Snowflake, Teradata, Mircosoft SQL Server, DuckDB and others. More and more applications join the cloud-native ranks as it becomes apparent that decoupled storage and compute is a critical part of the modern cloud operating model.

The Future-Proof Move: Query Engines Going S3-Compatible

S3-compatibility is no longer a luxury but a necessity. As the demand for S3 compatibility grows, major datalake, data storage, query engine and ETL platform makers are driving the transition. This shift is a response to the evolving needs of businesses that require interoperability between the modular components of the cloud-native data stack. Everything must work together by design or be left behind in the pursuit of actionable data insights.

Cost Savings and Freedom

Shifting operations from the cloud to on-premises infrastructure can lead to substantial cost savings. The success of companies like X, which achieved a 60% reduction in monthly cloud costs through their cloud repatriation initiative, serves as an inspiring example of what's possible. However, there are other benefits besides simple cost effectiveness to building on private clouds.

Private clouds enable:

Regulatory Compliance: Often, the internal policies of organizations drive the decision of where to build the modern data lake. Industries with financial, health care, national security, or other sensitive data simply cannot store their data on the public clouds for legal and compliance reasons. To move forward with a cloud strategy, they must build privately.

Security Controls: Enterprises concerned about the security and privacy of sensitive data have increasingly found that moving data in-house allows them to implement stringent security measures and exert complete control over data access.

Integrations: The best-of-breed tools for your Enterprise don’t come bundled together on the public clouds. They are found wherever innovation lives. Don’t get hobbled by the public cloud’s offerings.

Breaking Down Data Silos: An overarching cloud architecture backed by highly performant object storage such as MinIO unites data that was previously application-specific. This model reaches data wherever it lives allowing your teams to work with your data using the latest tools and collaborate securely. 

These other benefits in the end boil down to flexibility over your infrastructure and by extension your data. The key is to keep control over your data because when you let go of it, you also give up the benefits that your data can bring to your business.

Conclusion

The pace of change in the data world can be exhausting. It doesn’t change the fact that we have to continuously adapt and innovate. The architecture and design of datalakes are undergoing a profound transformation as more and more are being deployed to private clouds. This change is driven by truly cloud-native portable formats, the rising significance of query engines instead of RDBMS, the importance of S3 compatibility, and the rise of software-defined infrastructure – all of which combine to yield the freedom and flexibility of the private cloud. More importantly, the long-term result of the cloud operating model is that it challenges enterprises to build tools and services that are actually cloud-native, not just labeled as such. Embracing these changes leads to cost savings, greater flexibility, and future-proofing your data management strategy. 

Understanding and harnessing these shifts is your path to success in the ever-evolving data ecosystem.