The Architect’s Guide to Using AI/ML with Object Storage
This post first appeared in The New Stack.
With the constant evolution of the enterprise, machine learning and artificial intelligence have become board-level initiatives.
Marketing claims aside, capabilities that seemed almost mythical a few years ago are now taken for granted as AI/ML becomes baked into every software stack and architecture. This is becoming known as AI-first architecture.
In the world of AI/ML, the focus is on finding models that accurately capture the intricate relationships in the data and in using those models to create business value through accurate predictions.
The reality of AI/ML is that before that lofty goal can be achieved, there needs to be a LOT of data munging. Although the hype and focus on AI/ML is on using the latest, coolest modeling techniques, it has been shown over and over again that the greatest gains in the ability to model a complex relationship are achieved through proper data curation and finding a way to present the data to a model so that the model learns the nuances while training.
In short, it’s mostly about the data, not the model.
There are several key requirements that arise when building an AI-first architecture, particularly as it relates to storage. In this installment of the Architect’s Guide, we will outline what needs consideration and why.
Scalability
The first consideration in designing an AI/ML architecture is scalability. While scalability has multiple facets, we are focused on the scalability of the data infrastructure. Some very interesting work is being done in constrained environments where there is not much training data available, yet the best results continue to come from work being done in use cases that involve very large-scale training.
Large-scale training is not hundreds of terabytes — it is often tens to hundreds of petabytes. This volume exceeds the capabilities of legacy SAN/NAS architectures, both from a management and a performance perspective.
Once you get past a few PBs, you are looking at object storage. Object storage is uniquely qualified for this scale of problem as it can scale infinitely, do so across the network and deliver linear performance as you grow.
Additionally, object storage is inherently comfortable with different types of data — semi-structured, unstructured, structured. As the AI/ML framework that accesses the data seeks to create features, more and different types of data matter and the ability to store, version and manage all of them in a single place takes on real importance.
Additionally, as these varying types of data grow into the many PB realms, it becomes very expensive to stand up and maintain different storage solutions for different types of data. Consolidating the persistence into object storage saves on infrastructure costs.
RESTful APIs/S3
Because of the aforementioned requirement around scalability, virtually every AI/ML platform supports object storage. Object storage provides a single repository for all types of training data and can scale almost infinitely. Having a single storage architecture simplifies the deployment and decreases operational cost.
The S3 API is the de facto standard for object storage, and as a result, it is the de facto standard in the AI/ML data architecture world. In fact, most of the modern AI/ML platforms were built for the S3 APIs and later extended, often by the community, to support legacy SAN/NAS solutions.
The reasoning is simple: RESTful APIs are the modern approach for designing distributed software systems and for object persistence, S3 fits the definition precisely. Add to that the prevalence of AI/ML projects deployed on AWS and built using S3, and it becomes clear that the S3 API, thus object storage, is effectively a requirement for large-scale AI/ML projects.
Can you do small-scale work with POSIX (Portable Operating System Interface)? Yes, but that is more sandbox work. For real AI/ML at scale, S3 will be the API of your data infrastructure.
Object Locking (Regulatory or Compliance Holds)
In regulated environments such as financial services, health care and government, object locking is table stakes. Having said that, not all object stores support object locking and few are optimized for operational deployment.
The core capability is to ensure that an object cannot be deleted or overwritten for a set period of time. There are different modes that need to be accommodated, but the general goal is to ensure no tampering can occur on the source. Versioning can be accommodated easily.
This is especially important for AI/ML models and training files where the goal is a scientific experiment that will be operationalized. Ensuring the validity of training data is as important as validating the model itself.
Object Life Cycle Management
Models are not static in modern enterprises. As time passes and more and different data becomes available, models need to be updated accordingly. This should not be a manual process because that will make the model static to start with.
Object storage can provide full life cycle management capabilities. This includes tiering from hot to warm tiers as models age, as well as managing policies regarding updates, transitioning and deletion of data.
Related to this area is the nearly infinite scalability of object storage. In a world where you can have as much storage as you can imagine, they can all exist within a single namespace. This presents myriad possibilities from an object life cycle management perspective — all automated through RESTful APIs.
Having differing data types all within a single namespace significantly simplifies the data curation and validation process. At scale, this increases operational efficiencies and saves money.
Performance
Like scale, performance has multiple facets. Let’s look at READ and WRITE performance before turning to performance at scale.
Discovering a set of hyper-parameters for a given model that optimizes the ability to train is challenging. There is no way to determine optimal hyper-parameters for a given model with a given set of training data a priori.
Hyper-parameter tuning is an art more than a science and often comes down to an intelligent or non-intelligent search of discrete points through the ranges of each parameter until a decent set is discovered (a “grid-search”).
Making it more complex, the rate of convergence of a model throughout training, given a chosen set of hyper-parameters, isn’t linear, meaning as a given set of hyper-parameters is evaluated for a given model on a given training set, one must allow each to complete training to convergence in order to evaluate the fitness of the resulting model and the desirableness of the hyper-parameter set.
Simply put: It can be a LOT of repetitive trial-and-error training. With very large data sets, this is a lot of reading of the training files.
Much of this work is hidden from the data scientist or developer inside the current “Auto ML” libraries. Just because it is hidden doesn’t mean it isn’t happening. And as we increase the sizes of our training clusters to hundreds or thousands of compute nodes in order to parallelize the “Auto ML” process, we create a situation where a given training file is read hundreds or thousands of times.
If that training file is large, then the amount of I/O increases at a rate roughly equal to the number of models being evaluated multiplied by the number of discrete points we decide to test per hyper-parameter multiplied by the number of hyper-parameters for the given model.
In short, the READ performance of the training file from the persistence store matters. You can optimize code all you want, but model training will still be dependent on READ performance. Caching helps, certainly. But ultimately, it’s a file I/O challenge.
How fast is fast? For context, MinIO running on 32 nodes of NVMe reads at 325 GiB/sec. That should be the target for AI/ML setups.
A More Complex AI/ML Use Case — Lambda Compute Eventing
Once a model has been developed that seems to work well, it typically needs to be validated before being put into production. In financial services organizations, this is usually done by a separate model validation team, which is not part of the data science development effort. They are intentionally separate and are tasked with validating the correctness of the math/models that the organization uses. In addition to validating the correctness of the model, the model validation team often is responsible for testing and understanding how the model will behave in various unanticipated adverse economic conditions that might not have been part of the model’s training.
As an example, if we are talking about financial models and the training data that is used is recent historical data, which is common, the model validation team might run the models against adverse data, for example historical data from the Great Depression or from periods of global conflict such as a war, extreme market volatility, an inverted yield curve or negative real interest rates. They may also test the model with theoretical data to assess the stability. The model validation team has a role in assessing the behavior of the math/model, and the overall risk to the organization. This is not a small effort.
To operationalize AI/ML with object storage, a really powerful feature is Lambda Compute Eventing (LCE). LCE facilitates automating this complex model validation workflow. Generally, separate buckets are created for each step in the life cycle of the modeling process, and LCE is used to notify interested parties of the arrival of a new object into each of the buckets. The event triggers the appropriate processing for that stage of the progression of the model, along with whatever business-level auditing is required for satisfying compliance requirements or internal checks.
Summary
Although recent technology hype would have us all believe that finding the next great, complex modeling approach is the Holy Grail of data science, in practical terms, it’s the collection and proper curation of the data, along with proper MLOps to guarantee safety and reproducibility of the modeling process, that really creates value for an organization. MinIO intrinsically provides the capabilities needed to facilitate the creation and use of large-scale AI/ML in the modern enterprise.