The Challenge in Big Data is Small Files

Small files can cause big problems for storage platforms and the applications they power. A Google search for “small files performance” yields 2M+ results. This blog post will take a deeper look at the small files problem, digging into its origins and concluding with solutions.

Problem Statement

For the purposes of this discussion, a small file is generally considered to be any file less than 64 KB. When we’re working with customers to tune their clusters, we are seeing more and more (as in billions and trillions) files between 16 KB and 1 MB. Small files such as this are frequently the result of saving machine-generated event-based streams. Writing small files to object storage is straightforward, but queries over them will run much slower, or even fail to complete. Querying many small files incurs overhead to read metadata, conduct a non-contiguous disk seek, open the file, close the file and repeat. The overhead is only a few milliseconds per file, but when you’re querying thousands, millions or even billions of files, those milliseconds add up.

Analytics engines struggle to run queries over large volumes of small files. Many enterprises face this challenge when dealing with streaming sources such as IoT devices, servers, networking equipment and application logs - all of which can generate thousands of event logs per second, each log stored in a separate JSON, XML or CSV file. Querying just a day’s worth of logs can take hours.

Technologies built to solve yesterday’s big data problems cannot rise to the challenge of large numbers of small files. Hardware and applications, designed to work with small numbers of big files, are ill equipped to ingest, catalog, and query large numbers of small files. The key measure of a system's ability to flourish while storing large numbers of small files is IOPS, the amount of input and output (reads and writes) per second. An IOP comprises seek time, read time and data transmission time. For mechanical media, such as a hard drive, sequential reads and writes are significantly faster than random reads and writes. The efficiency of randomly reading and writing a single file is lower than reading and writing multiple files continuously.

Metadata management, data distribution across nodes and disks, I/O management, cache management, and network overhead can cause low performance and decreased storage efficiency. These are the areas to focus on when optimizing for large numbers of small files. Optimization requires a total understanding of system engineering, including the combination of and interaction between hardware and software. The problem caused by a large number of small files must be attacked at multiple levels, and bottlenecks must be corrected to achieve significant optimization.

Metadata management, in particular, can cripple a storage system’s ability to effectively store large numbers of small files. When operating on large contiguous files, the metadata operation overhead is offset by the much larger data operation overheads. When the number of small files increases dramatically, metadata operations begin to seriously detract from system performance.

Hadoop and Small Files

Hadoop, in particular, has been hard hit by the shift to small files. Hadoop is efficient for storing and processing a small number of large files, rather than a large number of small files. The default block size for HDFS is now 128MB (it was previously 64MB). Storing a 128MB file takes the same one 128MB block as storing a 16KB file. In addition, every file, directory and block in HDFS is tracked in metadata that occupies between 150 and 300 bytes per record of the NameNode’s memory. A hundred million small files will consume hundreds of GB of namenode memory, and wastes more than 10 TB by storing blocks that are mostly empty. Efficiency is further compromised as the amount of communication between nodes increases as greater numbers of files must be written, mapped and queried.

SAN / NAS and Small Files

SAN and NAS solutions also fall short when it comes to large numbers of small files. Both technologies were designed to provide high IOPS, but neither was designed for multitudes of concurrent reads from applications and writes from data sources. Both rely on RAID and replication to achieve durability and high availability, both of which add latency to writes and decrease storage efficiency. SAN, provides very low latency and high throughput, but only to servers directly connected to it. NAS, as a network mounted volume, faces the inefficiencies of block storage and the limitations of the file system when storing massive numbers of small files. But the major weakness of NAS is that it just can’t provide sufficient performance at scale, and performance degrades when faced with high numbers of concurrent requests.

Using Conventional Databases

A typical response to the small files problem is to write these tiny bits of data to a conventional relational database. Unfortunately, this will not solve performance woes either. It will for a time, but no database can provide durability and performance over a petabyte of small files. Yes, historically using a database to store and query small files was a sound idea - databases offer ACID transactions, indexes and can execute detailed queries over those records, but they can't do either quickly when faced with the enormous number of records required to solve the large numbers of small files problem that organizations face today.

Databases don’t do a very good job ingesting massive numbers of small files quickly, but that’s exactly what is needed in streaming data use cases. Small objects representing data records, log entries, or device telemetry come in from countless applications and devices at massive scale and speed. This data can’t be written to a database. No database can operate at the speed and scale required to support analytics in real time.

Architectures are moving away from conventional databases and file systems for storing and querying large numbers of small files. Databases are excellent tools for schema on write, partitioning/sharding, building indexes in advance to speed queries, but none of this works with large numbers of small files.

Data Lakehouses for Small Files

A data Lakehouse is one one-part data warehouse and one one-part data lake, and both parts use object storage under the hood for storage. This gives engineers options when deciding what to do with large numbers of small files. Files arriving as Parquet, AVRO, or ORC can be easily ingested into the data warehouse side of the data lakehouse. Other files can be sent to the data lake where they can be analyzed or transformed for ingestion into the data warehouse.

The data warehouse is not your ordinary data warehouse—it is based on an open table format that provides modern features such as time travel, schema evolution, partition evolution, zero-copy branching, external tables, and acid transactions. Of particular note with respect to small files is that an OTF-based data warehouse is schema-on-read, providing performance benefits when ingesting large numbers of small files.

This is a powerful emerging storage solution that leverages object storage for structured and unstructured data. Since a data lakehouse is built upon a distributed object store, it can be easily scaled out. Additionally, compute and storage are disaggregated in a data lakehouse, allowing further optimizations for the processing engines that process the SQL used to query the data warehouse.

MinIO as the Storage Layer for a Data Lakehouse

MinIO excels as the storage layer for a data lakehouse. MinIO has created a comprehensive blueprint for data infrastructure to support exascale AI and other large scale data lake workloads. It is called the MinIO DataPod. Why? Because exascale data is the reality that is common today in today's enterprise. MinIO stores metadata and objects in line, negating the need to query an external metadata database. MinIO can auto-extract .tar files after upload and download individual files from ZIP archives.

MinIO’s implementation of erasure coding is a key component of leading performance, storage efficiency and functionality for small objects. Fast erasure coding allows small objects to be captured at high scale and distributed with parity across multiple drives and nodes to immediately be protected for durability and high availability. For example, with maximum erasure code parity, you can lose half the drives in your MinIO cluster and still maintain durability.

The Small Files Solution

Many of today’s workloads - especially streaming and log analytics - place great demands on applications and storage systems by forcing them to work with massive numbers of small files. Big data rarely means analyzing a big file. More frequently, big data means millions or billions of files that are less than 1 MB. Databases and file systems can’t scale to provide the performance needed for real time analytics.

A data lakehouse built with MinIO is the answer to the problem of small files. Industry-leading performance speeds ingestion, querying, and retrieval, while erasure coding provides durability. Never lose data or cause a query to time out again.

If your systems are struggling (or even failing) to keep up with the demands of ingesting, storing, querying, and retrieving massive numbers of small files, then learn more about Data Lakehouses and download MinIO today.

Have a question? Please feel free to reach out to us on our Slack channel.