The Architect’s Guide to Data and File Formats

This blog post originally ran on The New Stack.

There is a revolution occurring in the data world. Driven by technological advancements, the current wave of open source data formats is changing the game for the entire ecosystem — from vendor to enterprise.

Data needs organization and analysis to generate timely insights to help companies make better decisions and improve operations. Working with unstructured data like images, .pdfs, audio, videos, etc., presents different challenges. Structured and semi-structured data like CSV, XML, JSON, etc., are difficult to compress, optimize and store for long durations.
The ability to generate insights from these datasets depends on how the data is organized. Due to the growing complexity of the business data enterprises are recording, they might find that instead of collecting 20 fields for each data event, they are now capturing fields in hundreds. While this data is easy to store in a data lake, querying it will require scanning a significant amount of data if stored in row-based formats.

The ability to query and manipulate data with data processing, artificial intelligence/machine learning (AI/ML), business intelligence and reporting workloads becomes critical. These workloads will invariably need to fetch small data ranges from enormous datasets. Hence the need for data formats.

If you have interacted with any data engineer, you must have heard of different data formats like Parquet, ORC, Avro, Arrow, Protobuf, Thrift, MessagePack, etc. What are all these files and data formats? Where are they applicable? How to choose the right design for the right job? These are some of the questions we will decipher in this article.

Data Formats for Microservices

Before diving deep into the data formats for efficient storage and retrieval in a data lake, let’s look at the layouts not so relevant to the data lake. Data formats like Protobuf, Thrift and MessagePack are more relevant to the intercommunication between microservices.

Protocol Buffers

Google’s Protocol Buffers (also known as Protobuf) is a language- and platform-agnostic data serialization format used for efficiently encoding structured data for transmission over a network or for storing it in a file. It is designed to be extensible, allowing users to define their own data types and structures, and provides a compact binary representation that can be efficiently transmitted and stored. Protocol Buffers are used in various Google APIs and are supported in many programming languages, including C++, Python and Java. Protocol Buffers are often used in situations where data needs to be transmitted over a network or stored in a compact format, such as in network communication protocols, data storage and data integration.

Implementation Overview

gRPC is a modern, open source, high-performance RPC (remote procedure call) framework that can run in any environment. It enables interactions between client and server applications’ methods directly, similar to a method call in object-oriented programming. gRPC is built on top of HTTP/2 as a transport protocol and ProtoBuf framework for encoding request and response messages. gRPC is designed to be efficient, with support for bidirectional streaming and low-latency communication, and can run in any environment. It is used in various Google APIs and is supported in many programming languages, including C++, Python and Java.

gRPC is used in a variety of contexts and applications where low latency, high performance and efficient communication are important. It is often used for connecting polyglot systems, such as connecting mobile devices, web clients and backend services. It is also used in microservices architectures to connect services and to build scalable, distributed systems. In addition, gRPC is used in IoT (Internet of Things) applications, real-time messaging and streaming, and for connecting cloud services.

*Source: https://developers.google.com/protocol-buffers*

Thrift

Another data format like Protobuf is Thrift. Thrift is an interface definition language (IDL) and communication protocol that allows for the development of scalable services that can be used across multiple programming languages. It is similar to other IDLs such as CORBA and Google’s Protocol Buffers, but is designed to be lightweight and easy to use. Thrift uses a code-generation approach, where code is generated for the desired programming language based on the IDL definitions, allowing developers to easily build client and server applications that can communicate with each other using Thrift’s binary communication protocol. Thrift is used in a variety of applications, including distributed systems, microservices and message-oriented middleware.

Languages Supported

Apache Thrift supports a wide range of programming languages, including functional languages like Erlang and Haskell. This allows you to define a service in one language, then use Thrift to generate the necessary code to implement the service in a different language, as well as client libraries that can be used to call the service from yet another language. This makes it possible to build interconnected systems that use a variety of programming languages.

Implementation Overview

IDL compiler-generated code creates client and server stubs that underneath the hood would interact between the two via native protocols and transport layer, thus enabling RPC between the processes.

MessagePack

MessagePack is a data serialization format that provides a compact binary representation of structured data. It is designed to be more efficient and faster than other serialization formats, such as JSON, by using a binary representation of data instead of a text-based one. MessagePack can be used in a variety of applications, including distributed systems, microservices and data storage. It is supported in many programming languages, including C++, Python and Java, and is often used in situations where data needs to be transmitted over a network or stored in a compact format. In addition to its efficiency and speed, MessagePack is also designed to be extensible, allowing users to define their own data types and structures.

Languages Supported

MessagePack supports many programming languages, mainly attributable to its simplicity. See the list of implementations on its portal.

Implementation Overview

We at MinIO decided on MessagePack as our serialization format. It keeps extensibility with JSON by allowing adding/removing of keys. The initial implementation is a header followed by a MessagePack object with the following structure:

{
  "Versions": [
    {
      "Type": 0, // Type of version, object with data or delete marker.
      "V1Obj": { /* object data converted from previous versions */ },
      "V2Obj": {
          "VersionID": "",  // Version ID for delete marker
          "ModTime": "",    // Object delete marker modified time
          "PartNumbers": 0, // Part Numbers
          "PartETags": [],  // Part ETags
          "MetaSys": {}     // Custom metadata fields.
          // More metadata
      },
      "DelObj": {
          "VersionID": "", // Version ID for delete marker
          "ModTime": "",   // Object delete marker modified time
          "MetaSys": {}    // Delete marker metadata
      }
    }
  ]
}

The metadata conversions occur from previous versions, and new versions include a “V2Obj” or “DelObj” depending on the active operation when update requests are received. Essentially in cases where we only need to read the metadata, we can merely stop reading the file when we have reached the end of the metadata. We can achieve updates with at most two continuous reads.

The representation on disk is also changed to accommodate this. Previously, all metadata used to be stored as a big object that included all versions. Now, instead, we write it like this:

Signature with version
A version of Header Data (integer)
A version of Metadata (Integer)
Version Count (integer)

If you’d like to better understand how versioning with this MessagePack works in MinIO, read this excellent blog.

Data Formats for Streaming

Avro

Apache Avro is a data serialization system that provides a compact and fast binary representation of data structures. It is designed to be extensible, allowing users to define their own data types and structures, and provides support for both dynamically and statically typed languages. Avro is often used in the Hadoop ecosystem and is used in a variety of applications, including data storage, data exchange and data integration. Avro uses a schema-based approach, in which a schema is defined for the data and is used to serialize and deserialize the data. This allows Avro to support rich data structures and evolution of the data over time. Avro also supports a container file to store persistent data, RPCs and integration with multiple languages.

Avro relies on schemas. Avro schema is saved in the file when writing Avro data. An Avro schema is a JSON document that defines the structure of the data that is stored in an Avro file or transmitted in an Avro message. It defines the data types of the fields in the data, as well as the names and order of those fields. It can also include information about the encoding and compression of the data, as well as any metadata that is associated with the data.

Avro schemas are used to ensure that data is stored and transmitted in a consistent and predictable format. When data is written to an Avro file or transmitted in an Avro message, the schema is included with the data, so that the recipient of the data knows how to interpret it.

The schema approach allows for serialization to be both small and fast, while supporting dynamic scripting languages.

Files may be processed later by any program. If the program reading the data expects a different schema, this can be quickly resolved since both schemas are present.

Languages Supported

Apache Avro is the leading serialization format for record data and the first choice for streaming data pipelines. It offers excellent schema evolution and has implementations for the JVM (Java, Kotlin, Scala, …), Python, C/C++/C#, PHP, Ruby, Rust, JavaScript and even Perl.

To see more about how Avro compares to other systems, check out this helpful comparison in the documentation.

Apache Kafka and Confluent Platform have made some special connections for Avro, but it will work with any data format.

Big Data File Formats for Data Lakes

Parquet

Apache Parquet is a columnar storage format for big data processing. It is designed to be efficient for both storage and processing, and it is widely used in the Hadoop ecosystem as a data storage and exchange format. Despite the decline of Hadoop, the format is still relevant and widely used due in part to its continued support from key data processing systems including Apache Spark, Apache Flink and Apache Drill.

Parquet stores data in a columnar format, organizing data in a way that is optimized for column-wise operations such as filtering and aggregation. Parquet uses a combination of compression and encoding techniques to store data in a highly efficient manner, and it allows for the creation of data schemas that can be used to enforce data type constraints and enable fast data processing. Parquet is a popular choice for storing and querying large datasets because it enables fast querying and efficient data storage and processing.

Implementation Overview

This schema allows for capturing metadata efficiently, enables the evolution of file format and simplifies data storage. Parquet compaction algorithms reduce storage requirements, enable faster retrieval and are supported by many frameworks. There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. Parquet leverages Thrift’s TCompactProtocol for serialization and deserialization of the metadata structures.

ORC

ORC, or Optimized Row Columnar, is a data storage format designed to improve the performance of data processing systems, such as Apache Hive and Apache Pig, by optimizing the storage and retrieval of large volumes of data. ORC stores data in a columnar format, which means data is organized and stored by column rather than by row. This allows for faster querying and analysis of data, as only the relevant columns need to be accessed instead of reading through entire rows of data. ORC also includes features such as compression, predicate pushdown, improved concurrency in using separate RecordReaders for the same files and type inference to further improve performance.

Another area of improvement in ORC when compared to RCFile format, specifically in Hadoop deployments, is it dramatically reduces the load on NameNodes.

It should be noted that ORC is tuned to the Hadoop architecture and workloads. Given most modern data stacks are moving away from Hadoop, ORC has limited utility in cloud native environments.

Implementation Overview

The row data in an ORC file is organized into groups called stripes, which contain a series of rows and metadata about those rows. Each stripe contains a series of rows, and each row is divided into a series of columns. The data for each column is stored separately, which allows ORC to efficiently read only the columns that are needed for a particular query, rather than reading the entire row. ORC also includes metadata about the data, such as data types and compression information, which helps to improve read performance. ORC supports a variety of compression formats, including zlib, LZO and Snappy, which can help to reduce the size of the data on disk and improve read and write performance. It also supports various types of indexes, such as row indexes and bloom filters, which can be used to further improve read performance.

The output of orcfiledump includes a summary of the file’s metadata, such as the number of rows, columns and stripes in the file. It also includes the data types of the columns and any index information that is stored in the file.

In addition to the metadata, orcfiledump also prints the actual data in the ORC file. This includes the values of each column in each row, as well as any NULL values that are present. Each row represents a record, which consists of each column that is defined as a field. The rows combine to represent a tabular structure.

Arrow

Relatively new, Apache Arrow is an open source in-memory columnar data format designed to accelerate analytics and data processing tasks. It is a standardized format used to represent and manipulate data in a variety of systems, including data storage systems, data processing frameworks and machine learning libraries.

One of the key benefits of Apache Arrow is its ability to efficiently transfer data between different systems and processes. It allows data to be shared and exchanged between systems without the need for serialization or deserialization, which can be time-consuming and resource-intensive. This makes it well-suited for use in distributed and parallel computing environments, where large volumes of data need to be processed and analyzed quickly.

In addition to its performance benefits, Apache Arrow also has a rich feature set that includes support for a wide range of data types, support for nested and hierarchical data structures, and integration with a variety of programming languages. The Arrow format is often used in big data and analytics applications, where it can be used to efficiently transfer data between systems and process data in-memory. It is also used in the development of distributed systems and real-time analytics pipelines.

Implementation Overview

The challenge most data teams face is that dealing with the internal data format of every database and language by serializing/deserializing without a standard columnar data format leads to inefficiency and performance issues. Another major challenge would be to write standard algorithms for each data format resulting in duplication and bloat.

Arrow’s in-memory columnar data format enhances data analytics significantly by reducing IO reads and writes to disk and enabling query acceleration. It uses a fixed-width binary representation for data, which allows for fast reading and writing of data in memory. In addition to improving operational efforts due to its in-memory computing, the frame also allows for adding custom connectors to simplify data aggregation across varied formats.

*Courtesy: https://arrow.apache.org/overview/*

Dremio is the query acceleration tool built on the Arrow framework. Dremio is designed to provide fast query performance on a variety of data sources, such as databases, data lakes and cloud storage systems. It uses the Arrow in-memory columnar data format to store and process data, which allows it to take advantage of the performance and efficiency benefits of Arrow. For further reading refer to “Apache Arrow and how it fits into today’s data landscape.”

Horses for Courses — Picking the Best Format

Organizations have invested in a microservices architecture for better software management, eschewing monolithic approaches. Adopting containerization and deploying on large Kubernetes clusters is a natural evolution. These microservices are built in heterogeneous languages and have adopted a polyglot poly-paradigm.

Adopting Protocol Buffers, Thrift or MessagePack is helpful in these scenarios. They simplify intercommunication between microservices and enable faster event processing. Other benefits that help improve the applications’ supportability are the ability to deploy independently and frequently.

With the advent of streaming- and messaging-driven architectures, the widespread need for data formats and compression has brought Avro to the forefront. Arvo, as mentioned above, is remarkably flexible and can be adopted across microservices, streaming applications and message-driven architectures and is implemented heavily in data lakes and lake house architectures. Open table formats like Iceberg and Hudi leverage Avro with schemas enabling snapshot isolation.

Parquet has the adoption lead for data lakes and lake house architectures and remains a standard in the space. ORC’s ability to support large stripe sizes made it widespread in the Hadoop Distributed File System (HDFS) world, but as we know, that world is shrinking. Still, it has utility for backup use cases. Arrow, the new kid on the block, is great for in-memory and can be leveraged for use cases for object storage where the need for object persistence is shorter.

Conclusion

The modern data stack is one of choice, and there are more and more good options when it comes to data and file formats. MinIO supports all of them, leaving the choice up to you and your cloud architect to decide which to use. Feel free to try them out with MinIO by downloading here or checking out our Slack channel or engaging directly with the engineering team through the commercial license.