Hungry GPUs Need Fast Object Storage

A chain is as strong as its weakest link - and your AI/ML infrastructure is only as fast as your slowest component. If you train machine learning models with GPUs, then your weak link may be your storage solution. The result is what I call the “Starving GPU Problem.”  The Starving GPU problem occurs when your network or your storage solution cannot serve training data to your training logic fast enough to fully utilize your GPUs. The symptoms are fairly obvious. If you are monitoring your GPUs, then you will notice that they never get close to being fully utilized. If you have instrumented your training code, then you will notice that total training time is dominated by IO.

Unfortunately, there is bad news for those who are wrestling with this issue. Let’s look at some advances being made with GPUs to understand how this problem is only going to get worse in the coming years.

GPUs Are Getting Faster

GPUs are getting faster. Not only is raw performance getting better, but memory and bandwidth are also increasing. Let’s take a look at these three characteristics of Nvidia’s most recent GPUs the A100, the H100 and the H200

GPU

Performance

Memory

Memory Bandwidth

A100

624 TFLOPS

40GB

1,555GB/s

H100

1,979 TFLOPS

80GB

3.35TB/s

H200

1,979 TFLOPS

141GB

4.8TB/s

(Note: the table above uses the statistics that align with a PCIe (Peripheral Component Interconnect Express) socket solution for the A100 and the SXM (Server PCI Express Module) socket solution for the H100 and the H200. SXM statistics do not exist for the A100. With respect to performance, the Floating Point 16 Tensor Core statistic is used for the comparison.)

A few observations on the statistics above are worth calling out. First, the H100 and the H200 have the same performance (1,979 TFLOPS), which is 3.17 times greater than the A100. The H100 has twice as much memory as the A100 and the memory bandwidth increased by a similar amount - which makes sense otherwise, the GPU would starve itself. The H200 can handle a whopping 141GB of memory and its memory bandwidth also increased proportionally with respect to the other GPUs.

Let’s look at each of these statistics in more detail and discuss what it means to machine learning.

Performance - A teraflop (TFLOP) is one trillion (10^12) floating-point operations per second. That is a 1 with 12 zeros after it (1,000,000,000,000). It is hard to equate TFLOPs to IO demand in gigabytes as the floating point operations that occur during model training involve simple tensor math as well as first derivates against the loss function (a.k.a. gradients). However, relative comparisons are possible. Looking at the statistics above, we see that the H100 and the H200, which both perform at 1,979 TFLOPS, are 3 times faster - potentially consuming data 3 times faster if everything else can keep up.

GPU Memory - Also known as Video RAM or Graphics RAM. The GPU memory is separate from the system's main memory (RAM) and is specifically designed to handle the intensive graphical processing tasks performed by the graphics card. GPU memory dictates batch size when training models. In the past batch size was decreased when moving training logic from a CPU to a GPU. However, as GPU memory catches up with CPU memory in terms of capacity, the batch size used for GPU training will increase. When performance and memory capacity increase at the same time, the result is larger requests where each gigabyte of training data is getting processed faster.

Memory Bandwidth - Think of GPU memory bandwidth as the "highway" that connects the memory and computation cores. It determines how much data can be transferred per unit of time. Just like a wider highway allows more cars to pass in a given amount of time, a higher memory bandwidth allows more data to be moved between memory and the GPU. As you can see, the designers of these GPUs increased the memory bandwidth for each new version proportional to memory; therefore, the internal data bus of the chip will not be the bottleneck.

A Look into the Future

In August 2023, Nvidia announced its next-generation platform for accelerated computing and generative AI - The GH200 Grace Hopper Superchip Platform. The new platform uses the Grace Hopper Superchip, which can be connected with additional Superchips by NVIDIA NVLink, allowing them to work together during model training and inference.

While all the specifications on the Grace Hopper Superchip represent an improvement over previous chips, the most important innovation for AI/ML engineers is its unified memory. Grace Hopper gives the GPU full access to the CPU’s memory. This is important because, in the past, engineers wishing to use GPUs for training had to first pull data into system memory and then from there, move the data to the GPU memory. Grace Hopper eliminates the need to use the CPU’s memory as a bounce buffer to get data to the GPU.

The simple comparison of a few key GPU statistics as well as the capabilities of Grace Hopper, has got to be a little scary to anyone responsible for upgrading GPUs and making sure everything else can keep up. A storage solution will absolutely need to serve data at a faster rate to keep up with these GPU improvements. Let’s look at a common solution to the hungry GPU problem.

A Common Solution

There is a common and obvious solution to this problem that does not require organizations to replace or upgrade their existing storage solution. You can keep your existing storage solution intact so that you can take advantage of all the enterprise features your organization requires. This storage solution is most likely a Data Lake that holds all of your organization’s unstructured data - therefore, it may be quite large, and the total cost of ownership is a consideration. It also has a lot of features enabled for redundancy, reliability and security, all of which impact performance.

What can be done, however, is to set up a storage solution that is in the same data center as your compute infrastructure - ideally, this would be in the same cluster as your compute. Make sure you have a high-speed network with the best storage devices available. From there, copy only the data needed for ML training. 

Amazon’s recently announced Amazon S3 Express One Zone exemplifies this approach. It is a bucket type optimized for high throughput and low latency and is confined to a single Availability Zone (no replication). Amazon’s intention is for customers to use it to hold a copy of data that requires high-speed access. Consequently, it is purpose-built for model training. According to Amazon, it provides 10x the data access speed of S3 Standard at 8x the cost. Read more about our assessment of Amazon S3 Express One Zone here.

The MinIO Solution

The common solution I outlined above required AWS to customize its S3 storage solution by offering specialty buckets at an increased cost. Additionally, some organizations (that are not MinIO customers) are buying specialized storage solutions that do the simple things I described above. Unfortunately, this adds complexity to an existing infrastructure since a new product is needed to solve a relatively simple problem. 

The irony to all this is that MinIO customers have always had this option. You can do exactly what I described above with a new installation of MinIO on a high-speed network with NVMe drives. MinIO is a software-defined storage solution - the same product runs on bare metal or the cluster of your choice using a variety of storage devices. If your corporate Data Lake uses MinIO on bare metal with HDDs and it is working fine for all of your non-ML data - then there is no reason to replace it. However, if the datasets used for ML require faster IO because you are using GPUs, then consider the approach I outlined in this post. Be sure to make a copy of your ML data for use in your high-speed instance of MinIO - a gold copy should always exist in a hardened installation of MinIO. This will allow you to turn off features like replication and encryption in your high-speed instance of MinIO, further increasing performance. Copying data is easy using MinIO’s mirroring feature.

MinIO is capable of the performance needed to feed your hungry GPUs - a recent benchmark achieved 325 GiB/s on GETs and 165 GiB/s on PUTs with just 32 nodes of off-the-shelf NVMe SSDs.

Download MinIO today and learn just how easy it is to build a data lakehouse. If you have any questions be sure to reach out to us on Slack!