Why Small Objects Are Such a Big Deal

Jonathan Symonds Jonathan Symonds on Performance 22 December 2020

Over the last decade or so, object storage use cases have evolved considerably as they replace traditional file and block use cases. Specifically the need to work with small data objects is becoming commonplace. Yes, there’s still plenty of large objects but small objects are becoming more prevalent than large for specific workloads and application environments.

Traditional object storage systems were designed for large objects that were infrequently accessed. With today’s smaller objects, object storage systems need to be much more dynamic and active. For legacy systems, that transition is proving difficult - even impossible. The reason lies in the architectural design. But before we get to that let’s define what a small object is and look at what’s driving the need for them.

A small object can be generally defined as objects that are smaller than 1MB. When objects are this size, it poses two challenges. First, when treated at scale (10s of PB), the aggregate number of small objects quickly reaches the billions, even trillions. This is well beyond the capability of traditional data storage systems. Second, the sheer number of objects creates an additional challenge - that of chattiness (LIST, PUT, GET, PUT, LIST DELETE etc.). Dealing with the metadata management at this scale leads is what breaks most systems. Doing this at scale without losing data or compromising performance is the grand challenge.

Let’s turn our attention to the drivers of the small object revolution:

AI/ML/DL workloads: New AI driven applications are trained with massive amounts of small data files. These are typically created once and read multiple times for training and validating DL models that classify data, recognize images, translate languages, etc. But that just gets you an initial inferencing model. New data comes in all the time. This means inferencing models will need periodic (re-)training using new data as well as select subsets of the original data for training. All this training data takes the form of small objects. As an example, HDFS uses a 128MB chunk as a block size and a file size itself is GB to TB. This is incompatible with the streaming data approaches present today and is why organizations have moved away from HDFS and onto high-performance object storage.
Archival/Backup: The traditional thinking around archival and backup is that objects are large and reads are infrequent. That thinking does not hold in the modern data world. Now, we see an expectation for 1MB chunks (this is the default from Veeam, Actifio, etc). This size is a function of the de-dupe environment. Further these workloads are replacing SAN storage and small blocks was their very strength. The reason for the migration is the scale of data has increased many-fold and object storage has proven to be a better fit.
Monitoring and Log Data: IT always generated massive log files of individual events, but recently, frameworks to process each individual event, in real time, have become commonplace. Logs were always sequences of event data that were batched up and at best processed offline once and then purged or archived. But then data analytics apps started to analyze this data and batching event data didn’t work as well. And as analysis became more important, organizations wanted it done more often. With the new frameworks, performing analysis on individual events could easily be done and in some cases, in real time. Due to the data structure employed here (JSON documents) they are inherently small objects.
IoT applications: As the IoT becomes more prevalent, sensors, imaging systems, strain gauges, etc., are all generating data to be processed. For some IoT applications such as self-driving cars, this has to be done in real time as close to the edge as possible, but for others, IoT data can be sent and be processed elsewhere. Even for processing done at the edge in real time, further validation and analysis may well require processing elsewhere. Text, numbers and pictures are small by definition and account for the bulk of IoT data.
Analytic Databases: Traditional databases were made up of tables and rows in monolithic data sets. But recently, new databases have emerged which hold and index JSON or other structured document data. This includes the likes of Elastic, Splunk and other databases whose data structures (table segments) are inherently small and numerous.

These are just some examples of the new small objects phenomena. Additional ones include web, mobile and messaging applications. As a result, small objects are becoming the norm, at least for organizations using these new applications, workloads and systems.

With small objects proliferating, what does this mean for object storage systems:

Read:Write ratios are changing - Historically, object storage was almost considered WORM storage or write once read many times. Yes, there’s some types of small object data that fits as WORM data, such as AI-ML-DL training data sets. But that's certainly not the case for streaming data in motion processing and probably for IoT or new document database data as well.
Access latencies must shrink - With WORM data, latencies could be measured in seconds or longer and no one cared. But with near real time streaming data in motion processing, object access must be measured in objects pre second versus the traditional measures of throughput (GB per second). IoT data processing and AI-ML-DL training can also be very sensitive to access latencies.
Storage efficiency matters – Large objects are typically broken into multiple chunks and this works well for storing these. But small objects may be much smaller than the chunk size used for large objects. Storage efficiency is a measurement of actual data to raw data capacity. Traditional object storage implementations (and HDFS) depend on replication (3x) for small objects whereas modern approaches like MinIO employ erasure coding and are far more efficient (12:4 being the recommended ratio).
Strict Consistency - To be strictly consistent, all objects must be committed to the storage media before acknowledging back to the applications. It is hard to achieve strict consistency and low latency at the same time but it is very much the requirement when you are replacing block storage systems for the use cases outlined above.

It’s our view that these new applications, workloads and systems that make use of small objects represent the leading edge of what should be happening to every organization in the future. At some point, all organizations are going to be deploying streaming data-in-motion processing, AI-ML-DL applications and new document database systems. Perhaps IoT is less generalizable than these other activities, but it too will become much more pervasive in time.

The challenge is that some object storage systems handle small objects better than others. The only way to truthfully tell is to do a PoC that loads up an object storage system with bunches of small objects and runs applications that use them and see how well it stores, performs and works. Only then will you have a good understanding of how well an object storage system handles the oncoming flow of small objects.

MinIO is particularly well suited for small objects. This is a by-product of our design choices and not necessarily something that we set out to do. Because we are relentless in our pursuit of simplicity, we have fewer moving parts, most notably the absence of a database to manage metadata.

This is perhaps the biggest advantage MinIO possesses in the small objects realm.

Metadata databases are functionally incompatible with large numbers of small objects. You cannot list them at scale, you cannot delete them at scale. Small objects are corrosive to external metadata database architectures.

The world will continue to produce more small objects - this we know. As an architect, it may be possible to create a series of workarounds to address these issues, but ultimately, adopting an object store that will scale seamlessly with your small objects is a better path - and one that should be undertaken sooner rather than later.

We are going to follow up on this post with benchmark data and instructions on how to measure small object performance. This will include detail on the storage media as well as the bandwidth requirements.

In the interim, if you are keen to get started, MinIO can be downloaded here. We are ready to help and have a Slack channel for the community that supports more than 10K members of our ever expanding community. If you want an SLA or 24/7 direct to engineer support, the answer can be found in the MinIO Subscription Network. Priced on capacity and billed monthly it brings the tools, talent and technology that powers MinIO to your deployments.

Feel free to shoot us a note at hello@min.io if you have any questions.

Previous Post Next Post

S3 Select Security Modern Data Lakes Apache Presto SQL Performance S3 Brand/Design Golang Programming Cloud Computing Microservices Docker AWS Kubernetes Apache Spark Open Source Benchmarks Integrations SUBNET Edge Computing Sidekick Secure-by-Design Splunk Veeam Intel Apache Nifi Immutability Software Defined Storage VMware Apache Arrow Hybrid Cloud Red Hat OpenShift Multicloud Scalability Cloud Field Day Cloud Native Apache Kafka Architect's Guide Awards Operator's Guide Security Advisory AI/ML AGPLv3 Apache Hadoop SFD Azure GCP Observability Analytics R H20 DirectPV DevOps Apache Iceberg Apache Hudi YouTube Summaries EKS Elastic Load Balancers CI/CD Object Storage Compliance opentelemetry BC/DR Storage Newsletter Predictions Best Practices Dremio New MinIO Features partners Small Files Databases DuckDB PostgreSQL Delta Lake Cloud Repatriation Python Object Lambdas Data Pipelines Cloud Operating Model Webhook ClickHouse Vector Database Events Value Engineering Change Data Capture Enterprise Object Store GitOps Case Study Equinix Certifications Snowflake Repatriation Migration Tabular Databricks

Get a Quote

Select Plan

Choose Capacity