Recent Launch of Amazon S3 Express One Zone Validates That Object Storage is Primary Storage for AI

Recent Launch of Amazon S3 Express One Zone Validates That Object Storage is Primary Storage for AI

We have made the case for several years that in modern data stacks object storage is primary storage.. This is even more true in the age of AI where enterprises focus almost exclusively on object storage. The modern data stack relies on disaggregated compute and storage alongside cloud-native microservices running in containers on Kubernetes. As more enterprises shift to this architecture, object storage becomes primary storage - upping the stakes for performance and scalability. 

Performance is king when it comes to primary storage and this is why MinIO is frequently used as the on-premise primary storage for AI/ML and datalakes. MinIO is capable of tremendous performance - a recent benchmark achieved 325 GiB/s on GETs and 165 GiB/s on PUTs with just 32 nodes of off-the-shelf NVMe SSDs. MinIO more than delivers the performance needed to power demanding workloads like Apache Spark, Kubeflow, Ray Data and just about any other cloud-native AI framework you can think of.

Amazon recently announced Amazon S3 Express One Zone, a high-performance version of its venerable S3.  S3 Express One Zone is optimized for high throughput and low latency. Able to process millions of requests per second, Amazon S3 Express One Zone was designed to accommodate large-scale parallel operations required for machine learning training and real-time machine learning. Amazon claims that Amazon S3 Express One Zone provides 10x the data access speed of S3 Standard with single-digit millisecond latency and reduced cost. Amazon S3 Express One Zone buckets are confined to a single Availability Zone. Pricing is on a consumption basis at $0.16/GB/month, 8x the cost of S3 Standard. Amazon's intention is for customers to "bring your frequently accessed data next to your high-performance computing resources."

Amazon S3 Express One Zone opens up the possibility of fast serverless computing within AWS. Stream processing gets a shot-in-the-arm with lower latency and greater concurrency – WarpStream is already taking advantage of this. Applications built on Open Table Formats like Apache Hudi, Iceberg and Delta also benefit from faster object storage. AI, requiring massive amounts of data to be read, benefits tremendously from high-performance object storage.   

Analysis

Let's unpack the details around the Amazon S3 Express One Zone announcement. 

Amazon S3 Express One Zone is a temporary object store that exposes data to local compute. It is not meant to replace a data lake. Amazon customers will continue to store data in S3 Standard. The only difference is that moving forward they'll replicate or tier it into S3 Express to work with it and then delete it from S3 Express when processing completes. The original data stored in S3 Standard remains intact. 

This is somewhat of a requirement. The reason being that S3 Express is not a viable option for long-term storage. At 8x the cost of S3 Standard it provides between 3x and 10x better performance. It's anywhere between 30% and 200% more expensive than EBS SSD. Pricing like this defeats one of the greatest drivers of early S3 growth – affordability. At 8x the cost of S3 Standard, enterprises must carefully select their workloads.

Yet, the introduction of this new storage class takes advantage of disaggregated modern data architecture and uses this modularity to provide enterprises with the ability to tune each workload for low latency and higher cost (S3 Express) or high latency and lower cost (S3 Standard). This modularity is enabled by object storage. There is no reason for an enterprise to ever store huge data sets on local filesystems or block storage – at Amazon or anywhere else.

This is a critical point: When it comes to modern workloads, the introduction of S3 Express has further exposed file and block storage as obsolete at AWS and everywhere else. Enterprises can now architect and build cloud-native systems that only work with data via the S3 API. A single programming interface simplifies architecture, no special code needs to be written to push AI training data around, it's now only migrating temporarily to a faster object storage tier.

Welcome to the Party

Nothing echoes our case that "object storage is primary storage for AI" more than the world's largest cloud provider bringing out a new service designed to meet the needs of data-intensive AI/ML applications. It's even built to work best with large numbers of small objects, and that's a common workload profile for AI/ML. ML training at scale must rely on object storage because it runs in parallel across hundreds of compute nodes, many times relying on expensive GPUs for computations. 

We can be close to certain that all major cloud providers will bring similar high-performance object storage options to market, priced similarly. This is a great upsell opportunity for them to add a more expensive storage option. It probably won’t stop the trend towards data repatriation, a cost-savings phenomenon that also enables greater AI/ML performance and control over data, but it is a calculated attempt to slow it. The real losers are the block and file folks (see NetApp’s recent quarter).

Summary

We are ultimately flattered by the introduction of S3 Express. It validates much of the work we have done over the recent few years - on the performance front, but also on the scalability, resiliency and security front. More importantly, we think it is an important signal to the market that file and block are increasingly obsolete technologies and the modern data stack starts and ends with object storage. 

The recent rise of object storage as primary storage has been driven by performance. Data-hungry AI/ML applications need low-latency, high-throughput and high concurrency object storage. Amazon S3 Express One Zone looks to be a valuable service for those already invested in the AWS ecosystem.  

If you want on-premise or colocated high-performance object storage for use as primary storage for AI/ML, then MinIO is your best choice. 

You don't have to take my word at face value, download MinIO and explore it yourself. If you have any questions, please join our community Slack channel.

Previous Post Next Post