The Challenge in Big Data is Small Files

The Challenge in Big Data is Small Files

Small files can cause big problems for storage platforms and the applications they power. A Google search for “small files performance” yields 2M+ results. This blog post will take a deeper look at the small files problem, digging into its origins and concluding with solutions.

For the purposes of this discussion, a small file is generally considered to be any file less than 64 KB. When we’re working with customers to tune their clusters, we are seeing more and more (as in billions and trillions) files between 16 KB and 1 MB. Small files such as this are frequently the result of saving machine-generated event-based streams. It’s straightforward to write small files to object storage, but queries over them will run much slower, or even fail to complete. Querying many small files incurs overhead to read metadata, conduct a non-contiguous disk seek, open the file, close the file and repeat. The overhead is only a few milliseconds per file, but when you’re querying thousands, millions or even billions of files, those milliseconds add up.

Analytics engines struggle to run queries over large volumes of small files. Many enterprises face this challenge when dealing with streaming sources such as IoT devices, servers, networking equipment and application logs - all of which can generate thousands of event logs per second, each log stored in a separate JSON, XML or CSV file. Querying just a day’s worth of logs can take hours.

Technologies built to solve yesterday’s big data problems cannot rise to the challenge presented by large numbers of small files. Hardware and applications, designed to work with small numbers of big files, are ill equipped to ingest, catalog and query large numbers of small files.  The key measure of a system's ability to flourish while storing large numbers of small files is IOPS, the amount of input and output (reads and writes) per second. An IOP is made up of seek time, read time and data transmission time. For mechanical media, such as a hard drive, sequential reads and writes are significantly faster than random read and write. The efficiency of randomly reading and writing a single file is lower than reading and writing multiple files continuously.  

Low performance and decreased storage efficiency can be caused by metadata management, data distribution across nodes and disks, I/O management, cache management and network overhead. When optimizing for large numbers of small files, these are the areas to focus on. Optimization requires a total understanding of system engineering, including the combination of and interaction between hardware and software. The large numbers of small files problem is one that must be attacked at multiple levels and bottlenecks to achieve significant optimization.  

Metadata management, in particular, can cripple a storage system’s ability to effectively store large numbers of small files. When operating on large contiguous files, the metadata operation overhead is offset by the much larger data operation overheads. When the number of small files increases dramatically, metadata operations begin to seriously detract from system performance.

Hadoop, in particular, has been hard hit by the shift to small files. Hadoop is efficient for storing and processing a small number of large files, rather than a large number of small files. The default block size for HDFS is now 128MB (it was previously 64MB). Storing a 128MB file takes the same one 128MB block as storing a 16KB file. In addition, every file, directory and block in HDFS is tracked in metadata that occupies between 150 and 300 bytes per record of the NameNode’s memory. A hundred million small files will consume hundreds of GB of namenode memory, and wastes more than 10 TB by storing blocks that are mostly empty. Efficiency is further compromised as the amount of communication between nodes increases as greater numbers of files must be written, mapped and queried.      

SAN and NAS solutions also fall short when it comes to large numbers of small files. Both technologies were designed to provide high IOPS, but neither was designed for multitudes of concurrent reads from applications and writes from data sources. Both rely on RAID and replication to achieve durability and high availability, both of which add latency to writes and decrease storage efficiency. SAN, provides very low latency and high throughput, but only to servers directly connected to it. NAS, as a network mounted volume, faces the inefficiencies of block storage and the limitations of the file system when storing massive numbers of small files.  But the major weakness of NAS is that it just can’t provide sufficient performance at scale, and performance degrades when faced with high numbers of concurrent requests.

A typical response to the small files problem is to write these tiny bits of data to a database instead of as separate files. Unfortunately, this will not solve performance woes either. It will for a time, but no database can provide durability and performance over a petabyte of small files. Yes, historically using a database to store and query small files was a sound idea - databases offer the ability to store records as ACID transactions, build indexes and execute detailed queries over those records, but they can't do either quickly when faced with the enormous number of records required to solve the large numbers of small files problem that faces organizations today.

Architectures are moving away from databases and file systems for storing and querying large numbers of small files. Databases are excellent tools for schema on write, partitioning/sharding, building indexes in advance to speed queries, but none of this works with large numbers of small files.

A powerful emerging paradigm is one in which systems are built around schema on read and self-describing small files using formats such as parquet, AVRO and ORC, combined with applications that can query them after they are written to object storage. Distributed object storage can store huge numbers of small files and share them between many applications concurrently. The trend towards schema on read sends writes directly to object storage. This allows databases to become stateless, elastic and cloud-native - eliminating the need for them to manage storage.

Databases shouldn’t have to be responsible for data storage. They don’t do a very good job  ingesting massive numbers of small files quickly, but that’s exactly what is needed in streaming data use cases. Small objects - records, log entries, files - are coming in at massive scale and speed from countless applications and devices. This can’t be written through a database, no database can operate at the speed and scale required to support analytics in real time.

Currently, the most successful strategy for optimizing storage of massive numbers of small files is to write to object storage, removing dependencies on databases and file systems.  

MinIO excels at storing, querying and retrieving small objects. In a recent performance benchmark, we measured PUT throughput of 1.32 Tbps and GET throughput of 2.6 Tbps. MinIO stores metadata and objects in line, negating the need to query an external metadata database. MinIO can auto-extract .tar files after upload and download individual files from ZIP archives.

MinIO’s implementation of erasure coding is a key component of leading performance, storage efficiency and functionality for small objects. Fast erasure coding allows small objects to be captured at high scale and distributed with parity across multiple drives and nodes to immediately be protected for durability and high availability. For example, with maximum erasure code parity, you can lose half the drives in your MinIO cluster and still maintain durability.  

In contrast, databases replicate for durability and high availability, leaving them at a severe disadvantage when operating at scale. Three copy multiphase commit cannot operate at the speed required at the tremendous scale that organizations are ingesting small files. Replication involves, at a minimum, creating and distributing three copies of data. This is resource and performance intensive, plus, if you lose two copies you no longer have durability guarantees. Having a hot tier or a cache for streaming data is a good start towards improving ingestion, but you can still lose data before it is written and replicated.

The Small Files Solution

Many of today’s workloads - especially streaming and log analytics - place great demands on applications and storage systems by forcing them to work with massive numbers of small files. Big data rarely means analyzing a big file. More frequently, big data means millions or billions of files that are less than 1 MB. Databases and file systems can’t scale to provide the performance needed for real time analytics.

MinIO is the answer to the small files problem. Industry-leading performance speeds ingestion, querying and retrieval while erasure coding provides durability. Never lose data or cause a query to time out again.  

There is an added advantage to solving the small files problem with object storage - you are creating converged storage that is available to all of your cloud native applications, not just to a specific database. You can run all of your analytics and AI/ML apps against object storage. Your data is now free to be accessed by any cloud-native database, not locked inside a specific format or product.


If your systems are struggling (or even failing) to keep up with the demands of having to ingest, store, query and retrieve massive numbers of small files, then download MinIO today.