Iterable-Style Datasets using Amazon’s S3 Connector for PyTorch and MinIO
In November of 2023 Amazon announced the S3 Connector for PyTorch. The Amazon S3 Connector for PyTorch provides implementations of PyTorch's dataset primitives (Datasets and DataLoaders) that are purpose-built for S3 object storage. It supports map-style datasets for random data access patterns and iterable-style datasets for streaming sequential data access patterns.
In a previous post, I introduced the S3 Connector for Pytorch and described the problem it intends to solve in detail. I also described yesteryear libraries that are about to be deprecated in favor of the S3 connector. Specifically, do not use the Amazon S3 Plugin for PyTorch and the CPP-based S3 IO DataPipes. Finally, I covered map-style datasets. I will not recap all this introductory information here, so if you have yet to read my previous post, check it out at your earliest convenience. In this post, I will focus on iterable-style datasets. The documentation for this connector only shows examples of loading data from Amazon S3 - here, I will show you how to use it against MinIO.
The S3 Connector for PyTorch also includes a checkpointing interface to save and load checkpoints directly to an S3 bucket without first saving to local storage. This is a really nice touch—if you are not ready to adopt a formal MLOps tool and just need an easy way to save your models. I will cover this feature in a future post.
Building an Iterable Style Dataset Manually
An iterable-style datasets is created by implementing a class that overrides the __iter__() method in PyTorch’s IterableDataset base class. Unlike a map-style dataset, there is no __len__() method and there is no __getitem__() method. If you query an interable dataset using Python’s len() function, you will get an error since the __len__() method does not exist.
You can return multiple samples when your iterable-style dataset is called during a training loop. Specifically, you return an iterator object that the data loader will iterate over to create the needed batches. Let’s build a very simple custom iterable-style dataset to understand better how they work. The code below shows how to override the __iter__() method. The full code download can be found here.
Notice that you have to keep track of sharding yourself. When you create a data loader that has more than one worker with this dataset, then it is up to you to make sure that you figure out each worker’s share of the data to process (each shard). This is done using the worker_info object and some simple math.
We can create this dataset and loop through it using the code below, which is similar to a training loop.
The output will be:
Now that we understand iterable datasets, let’s use the S3 connector’s iterable dataset. But before we do that, let’s look at how to get the S3 Connector to connect to MinIO.
Connecting the S3 Connector to MinIO
Connecting the S3 Connector to MinIO is as simple as setting up environment variables. Afterwards, everything will just work. The trick is setting up the correct environment variables in the proper way.
The code download for this post uses a .env file to set up environment variables, as shown below. This file also shows the environment variables I used to connect to MinIO directly using the MinIO Python SDK. Notice that the AWS_ENDPOINT_URL needs the protocol, whereas the MinIO variable does not. Also, you may notice some odd behavior with the AWS_REGION variable. Technically, it is not needed when accessing MinIO, but internal checks within the S3 Connector may fail if you pick the wrong value for this variable. If you get one of these errors, read the message carefully and specify the value it requests.
Creating an Iterable-Style Dataset with the S3 Connector
To create an iterable-style dataset using the S3 Connector, you do not need to code and create a class as we did previously. The S3IterableDataset.from_prefix() function will do everything for you. This function assumes that you have set up the environment variables to connect to your S3 object store, as described in the previous section. It also requires that your objects can be found via an S3 prefix. A snippet showing how to use this function is below.
Notice that the URI is an S3 path. Every object that can be recursively found under the path mnist/train is expected to be an object that is part of the training set. If you want to use more than one worker from you data loader (the num_workers parameter), then be sure to set the enable_sharding parameter on your dataset. The function above also requires a transform to transform your object into a tensor and to determine the label. This is done via an instance of the callable class shown below.
That is all you need to do to create a map-style dataset using the S3 Connector for PyTorch.
Conclusion
The S3 Connector for PyTorch is easy to use and engineers will write less data access code when using it. In this post, I showed how to configure it to connect to MinIO using environment variables. Once configured, three lines of code created an iterable dataset object, which was transformed using a simple callable class.
Next Steps
If your network is the weakest link within your training pipeline, consider creating objects containing multiple samples, which you can even tar or zip. Unfortunately, the S3 connector does not have a way to do this with either map-style or iterable-style datasets. In a future post, I will show how this can be done using a custom-built iterable style dataset.
If you have any questions, be sure to reach out to us on Slack.