This summer both Databricks and Apache Iceberg rolled out enhancements to their open table formats. Databricks announced Delta Lake 3.0 can read and write data to all of the most popular open table formats: Delta Table, Iceberg and Apache Hudi. Delta Universal Format (UniForm) makes it so that open table formats become interoperable, avoiding the need for creating and storing extra copies of data in this or that format. Data teams using existing query engines like DuckDB, Dremio, and others to query over Iceberg or Hudi files can read Delta tables directly without conversion.
Around the same time, Iceberg announced a slew of new support for query engines and platforms including Snowflake, AWS Athena, Apache Doris and StarRocks. With these announcements from Databricks and Iceberg, interoperability joined hand-in-hand with data portability. Open table formats by design promote the concept that you should be able to access, control, share, and operate on your data with whatever tool you want, wherever you want to, whether that is in the public clouds, in your private clouds, on-edge, or on bare-metal.
Understanding Open Table Formats
Let’s put these announcements in context. Open table formats allow data lakes to attain performance and compliance standards that in the past could only be achieved by traditional data warehouses or databases, all the while preserving the flexibility of a data lake environment.
There are three major open table formats:
Iceberg was originally designed by Netflix specifically for handling substantial data volumes within data lakes. This open table format boasts distinctive features such as time travel, dynamic schema evolution, and partition evolution. These capabilities make it revolutionary, enabling concurrent and secure operations by query engines on the same dataset.
Delta Lake is an open-source storage framework in the Lakehouse architecture that empowers data lakes on object storage like MinIO. It ensures ACID transactions, scalable metadata handling, and unified processing for Apache Spark, offering reliability and scalability. Delta Lake can handle the performance and correctness challenges of complex Spark workloads, especially under heavy concurrency, with non-atomic updates and metadata operations causing significant bottlenecks.
Hudi is rooted in the Hadoop ecosystem, and Hudi's primary purpose is to decrease latency during the ingestion of streaming data, offering features such as tables, transactions, upserts/deletes, advanced indexes, and compatibility with various storage implementations, including cloud-native object storage like MinIO.
Much has been written about choosing between different formats, with some asserting up to 80% functional equivalence among the three primary Open Table formats. This blending of distinctions makes sense given the environment of interoperability in which these open table formats were created and continue to thrive. The creators of these formats prioritized capability over traditional notions of vendor lock-in and operational control.
Open Table Formats as Part of the Modern Data Stack
Even before these recent announcements, open table formats had already become integral to modern data lake design. And reciprocally, data lakes have been integral to the modern data stack. A recent survey by Dremio found that 70% of respondents said that more than half of their analytics is or would be in a data lake within three years. This pervasive adoption signifies a paradigm shift in how organizations structure and manage their data, placing a strong emphasis on interoperability, flexibility and performance.
It’s no surprise really that cloud-native data lakes and their components and technologies like open table formats have become center stage in the modern data stack. This stands in stark contrast to traditional, monolithic legacy hardware and software sold wholesale to organizations hoping to slap the phrase ‘cloud technology’ onto their aging systems. Becoming cloud-native is more than adding an API – the modern data stack is a modular and specialized ensemble of tools tailored for various data handling facets. It is built for adaptability, born in the cloud and held to high-performance standards. Features that make the modern data stack a compelling choice for organizations. The stack's modularity provides a range of options, allowing organizations to craft a bespoke data infrastructure that aligns with their specific needs, fostering agility in the continually evolving data landscape.
Despite this continuously evolving range of options, there are defining characteristics that weave through the components of the stack:
- Cloud-Native: The modern data stack is designed to seamlessly scale across diverse cloud environments, ensuring compatibility with multiple clouds to prevent vendor lock-in.
- Optimized Performance: Engineered for efficiency, the stack incorporates components that take a software-first approach and design for performance.
- RESTful API compatibility: The stack establishes a standardized communication framework between its components. This promotes interoperability and supports the creation of microservices.
- Disaggregated Storage and Compute: The stack enables independent scaling of computational resources and storage capacity. This approach optimizes cost efficiency and enhances overall performance by allowing each aspect to scale according to specific needs.
- Commitment to Openness: Beyond supporting open table formats, the modern data stack embraces openness in the form of open-source solutions. This commitment eliminates proprietary silos and mitigates vendor lock-in, fostering collaboration, innovation, and improved data accessibility. The dedication to openness reinforces the stack's adaptability across various platforms and tools, ensuring inclusivity.
Data Portability and Interoperability as a Business Standard
Truly embracing data portability and interoperability means being able to create and access data wherever it is. This approach facilitates flexibility, allowing organizations to harness the capabilities of diverse tools without being constrained by either vendor lock-in or data silos. The goal is to enable universal access to data, promoting a more agile and adaptable data ecosystem within organizations.
Understanding that the cloud as an operating model is built on principles of cloud-native technology rather than a specific location is critical to achieving data portability. Some organizations struggle in this endeavor and attempt to buy their way into the cloud at a tremendous cost. The reality is that while cloud adoption presents an opportunity for the average company to increase profitability by 20 to 30 percent, the real impact and true cost savings comes from embracing the cloud operating model on private infrastructure.
Many established organizations are actively adopting this philosophy, choosing to repatriate workloads from the cloud and achieving substantial cost savings, with companies like X.com, 37Signals, and a major enterprise security firm saving an average 60% from cloud exits. The cloud operating model allows for the coexistence of seemingly contradictory ideas: companies can benefit from migrating to the cloud and repatriating workloads. The key determinant is the adoption of the cloud operating model, fundamentally transforming how organizations approach infrastructure, development and technical efficiency. This model optimizes for flexibility, efficiency and long-term success – whether in the public cloud or beyond – and dovetails precisely with the concept of the modern data stack, enabling data portability and interoperability with open table formats.
Recent strides in open table formats by Databricks, Apache Iceberg and Hudi signify a pivotal moment in data management. Delta Lake 3.0's universal compatibility and expanded support for Apache Iceberg showcase a commitment by both data infrastructure companies and on the ground implementers to seamless data portability and interoperability.
These developments align with the inherent modularity of the modern data stack, where open table formats play a central role in achieving performance and compliance standards. This shift is not isolated but intersects with the cloud operating model. Beyond the allure of public clouds, real impact and cost savings emerge by embracing the cloud operating model on private infrastructure.
The confluence of open table formats, the modern data stack, and the cloud operating model signifies a transformative era in data management. This approach ensures adaptability across various environments, whether public or private, on-prem on edge. For those navigating data lake architecture complexities, our team at MinIO is ready to assist. Join us at firstname.lastname@example.org or on our Slack channel for collaborative discussions as you embark on your data journey.