An Easier Path to Scalable AI: Intel Tiber Developer Cloud + MinIO Object Store

One of the biggest challenges facing organizations today for AI and data management is access to reliable infrastructure and compute resources. The Intel Tiber Developer Cloud is purpose-built for engineers who need an environment for proof-of-concepts, experimentation, model training, and service deployments. Unlike other clouds, which can be unapproachable and complex, the Intel Tiber Developer Cloud is simple and easy to use. The platform is especially valuable for AI/ML engineers developing models of all types. Using Intel’s cloud, AI/ML engineers can easily acquire compute and storage for running training and inference workloads, and deploying applications and services.

Intel selected MinIO as the object store for its cloud because it brings simplicity, scalability, performance, and native integration with the cloud and AI ecosystems. We are delighted to be the object store of choice and happy to produce this “how-to” post to accelerate your adoption of this platform. I’ll show how to train a model using an Intel Gaudi AI accelerator and how to set up and use MinIO (object storage) in the Intel Tiber Developer Cloud.

Let’s get started. The complete code demo for this post can be found here.

Creating an Account and Starting an Instance

Intel’s cloud documentation contains step-by-step guides for setting up your account and acquiring resources for your AI/ML experiments, optimizations, and deployments. The code presented in this post assumes you completed the following guides.

Get Started - This guide walks you through creating and signing into an account. It also shows you how to use the cloud’s Jupyter server. What is nice about the Jupyter Lab is that you do not need an SSH Key to start it up and use it.

SSH Keys - To create a compute instance, you will need to upload an SSH public key to your account. This guide shows you how to create the key according to the cloud’s specifications and upload it to your account. Be sure to save public and private keys in a safe place. You will need the private key to SSH into a compute instance. This guide also shows SSH commands for connecting to an instance.

Manage Instance - This guide shows how to select a compute node, start it, and shut it down when you are done with your experiments.

Object Storage - This guide walks you through creating a bucket within your account.

Once you have an account you will have access to the hardware, software, and service choices, as shown below. We will use a compute instance and object storage in this post.

Getting MinIO Ready for Programmatic Access

The screenshot below shows the dialog used to create a bucket. Notice that your bucket name is prefixed with a unique identifier. This is necessary because MinIO is a platform service for Intel’s cloud and it supports all accounts in the cloud’s region. Hence, the unique identifier prevents name collisions with other accounts.

Enter a name for your bucket and enable versioning if needed. Once you click the Create button, you should see your new bucket, as shown below. This post will use the MNIST dataset and access it programmatically to train a model. Consequently, the endpoint is needed for accessing our new bucket, an access key and a private key.

To see the endpoint address, click on the new bucket and then select the Details tab as shown in the below image. Copy the private endpoint displayed in this dialog, as it is needed when setting up a MinIO SDK configuration file.

Since you need to access your bucket programmatically, create a principal and associate an access key and secret key with it. To do this, click on the Principles tab. All existing principles will show.

Next, click the Manage Principles and permissions button. This opens a dialog where you can edit the principles shown above.

Click the Create principle button. You should now see the dialog for creating a principle (see below). Select the permissions you want for the new principle and click the Create button.

Once the principal is created, go to the Manage Principals and Permissions page and click on the newly created principle.

Once you click on the principle, you should see a dialog like the one below.

This is where you create the access key and the secret key to access the bucket using the MinIO SDK (or any other S3 complaint library). Click the Generate password button to create the keys, as shown below. Copy them to a configuration file immediately, as you cannot display them again.

The configuration file used by the code sample for this post is a .env file for setting up environment variables. As shown below, put your private endpoint, access key, secret key, and bucket name in this file.

MINIO_URL=s3-phx04-5.tenantiglb.us-region-2.cloud.intel.com:9000
MINIO_ACCESS_KEY={Put access key here.}
MINIO_SECRET_KEY={Put secret key here.}
MINIO_SECURE=false
BUCKET_NAME={Put full bucket name here.}

Now, once you created a compute instance and a bucket for your training data, you are ready to write some code. Let’s upload the MNIST dataset to our new bucket.

Uploading Data to MinIO

The torchvision package makes retrieving the images in the MNIST dataset easy. The function below uses this package to download a compressed set of files, extract the images, and send them to MinIO. This is shown in the code sample below. A few supporting functions were omitted for brevity. The complete code can be found in the data_utlities.py module within the code download for this post.

import PIL.Image
from dotenv import load_dotenv
from minio import Minio
from minio.error import S3Error
import numpy as np
import PIL
import torch
from torchvision import datasets, transforms

def load_mnist_to_minio(bucket_name: str) -> Tuple[int,int]:
''' Download and load the training and test samples.'''
logger = create_logger()

train = datasets.MNIST('./mnistdata/', download=True, train=True)
test = datasets.MNIST('./mnistdata/', download=True, train=False)

train_count = 0
for sample in train:
random_uuid = uuid.uuid4()
object_name = f'/train/{sample[1]}/{random_uuid}.jpeg'
put_image_to_minio(bucket_name, object_name, sample[0])
train_count += 1
if train_count % 100 == 0:
logger.info(f'{train_count} training objects added to {bucket_name}.')

test_count = 0
for sample in test:
random_uuid = uuid.uuid4()
object_name = f'/test/{sample[1]}/{random_uuid}.jpeg'
put_image_to_minio(bucket_name, object_name, sample[0])
test_count += 1
if test_count % 100 == 0:
logger.info(f'{test_count} testing objects added to {bucket_name}.')

return train_count, test_count

def put_image_to_minio(bucket_name: str, object_name: str,
image: PIL.Image.Image) -> None:
'''
Puts an image byte stream to MinIO.
'''
logger = create_logger()

url, access_key, secret_key, secure = get_minio_credentials()

try:
# Create client with access and secret key
client = Minio(url, # host.docker.internal
access_key,
secret_key,
secure=secure)

image_byte_array = image_to_byte_stream(image)
content_type = 'application/octet-stream'
response = client.put_object(bucket_name, object_name, image_byte_array,
-1, content_type, part_size = 1024*1024*5)

except S3Error as s3_err:
logger.error(f'S3 Error occurred: {s3_err}.')
raise s3_err
except Exception as err:
logger.error(f'Error occurred: {err}.')
raise err

Now that your dataset is loaded into your bucket let’s look at how to access it to train a model.

Using MinIO from a Data Loader

Within your model training pipeline, there are two places where you can load data from durable storage. If your data fits entirely into memory, you can load everything into memory at the beginning of your training pipeline before calling a training function. (We will create this training function in the next section.) This works if your dataset is small enough to fit entirely into memory. If you load data in this fashion, your training function will be compute-bound since it will not have to make any IO calls to get data from object storage.

However, if your dataset is too large to fit into memory, you will need to retrieve data every time you send a new batch of samples to your model for training. This results in an IO-bound training function.

Since we want to demonstrate the benefits of model training using Gaudi accelerators, we will create a training function that is compute-bound. The Pytorch code to create a custom Dataset and load it into a Dataloader is shown below. Notice that all data is loaded in the constructor of the ImageDatasetFull class. All the MNIST images are retrieved from MinIO and stored in the properties of the object created from this class. This all occurs when the object is created for the first time. If we wanted to load the images every time a batch of data is sent to our model for training then we would need to create this object with just a list of object names and move the loading of the actual images to the __getitem__() function.

class ImageDatasetFull(Dataset):
def __init__(self, bucket_name: str, X, y, transform=None):
self.bucket_name = bucket_name
self.y = y
self.transform = transform

raw_images = du.get_images_from_minio(bucket_name, X)
images = torch.stack([transform(x) for x in raw_images], dim=0)
self.X = images

def __len__(self):
return len(self.y)

def __getitem__(self, index):
return self.X[index], self.y[index]

def create_mnist_training_loader(bucket_name: str, loader_type:str, batch_size:int,

smoke_test_size: float=0) -> Tuple[Any]:
# Start of load time.
start_time = time.perf_counter()

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])

# Get a list of objects and split them according to train and test.
X_train, y_train, _, _ = du.get_mnist_lists(bucket_name)

if smoke_test_size > 0:
train_size = int(smoke_test_size*len(X_train))
X_train = X_train[0:train_size]
y_train = y_train[0:train_size]

train_dataset = ImageDatasetFull(bucket_name, X_train, y_train, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,

num_workers=1)

return train_loader, (time.perf_counter()-start_time)

Now that we have data loaded into memory, let’s look at how we can use Gaudi2 and the Intel Developer Cloud to train a model.

Using Intel Gaudi accelerators from PyTorch

The PyTorch HPU package supports Intel’s multiarchitecture processing (HPU) utilities. In general, HPU allows developers to write PyTorch applications optimized for Intel's range of hardware, such as CPUs, Gaudi accelerators, GPUs, and any future accelerators Intel may build. The HPU abstraction uses a common interface provided by PyTorch. Consequently, developers can write code that dynamically switches between different hardware accelerators without extensive refactoring. This post will use it to detect Gaudi and move tensors to Gaudi’s memory.

A common coding pattern when using accelerators of any type is to first check for the existence of a GPU or AI accelerator, if one exists, then move your model, training set, validation set, and test set to the processor’s memory. This is usually done within a function that trains your model. The function below is from the code download for this post. (This function is the training function I referred to earlier.) The highlighted code shows how to check for the existence of Intel accelerators and move the model and training set to the device. For this check to work correctly with Gaudi you will need the following import. You will not use this module directly but it needs to be imported.

import habana_frameworks.torch.core as htcore

Notice that moving the tensors that hold the features and labels from your training set is done within the batch loop. The PyTorch data loaders do not have a “to” function for moving an entire dataset to a target device. There is a good reason for this: large datasets would quickly use up a processor’s memory. This is especially true for older GPUs with less memory than newer ones. It is best practice to move only the tensors needed for the current batch of training to the accelerator just before the model needs them for training.

def train_model(model: nn.Module, loader: DataLoader,

training_parameters: Dict[str, Any]) -> List[float]:
logger = du.create_logger()

device = torch.device('hpu' if torch.hpu.is_available() else 'cpu')
model.to(device)
logger.info(f'Model created on device {device}')

loss_func = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=training_parameters['lr'],

momentum=training_parameters['momentum'])

# Epoch loop.
compute_time_by_epoch = []
for epoch in range(training_parameters['epochs']):
total_loss = 0
batch_count = 0
total_epoch_compute_time = 0
# Batch loop.
for images, labels in loader:
# Start of compute time for the batch.
start = time.perf_counter()

# Move to the specified device.
# shape = [32, 1, 28, 28]
images, labels = images.to(device), labels.to(device)

# Flatten MNIST images into a 784 long vector.
# shape = [32, 784]
images = images.view(images.shape[0], -1)

# Training pass
optimizer.zero_grad()
output = model(images)
loss = loss_func(output, labels)
loss.backward()
optimizer.step()

# Loss calculations
total_loss += loss.item()
batch_count +=1

# Track compute time
total_epoch_compute_time += time.perf_counter() - start

compute_time_by_epoch.append(total_epoch_compute_time)
logger.info(f'Epoch {epoch+1} - Training loss: {total_loss/batch_count}.')

return compute_time_by_epoch

Now that we know how to move a model and tensors from our data loaders to Gaudi and have a way to get data from object storage, let’s put everything together and run a couple of experiments.

Putting it all together

The function below pulls everything together. It will create our data loaders and pass them to our train_model function. Notice that everything is instrumented to get performance metrics from our code. Once we run this function, we will see IO time vs. compute time. We will also be able to run this same code using only the CPU and then again using Gaudi.

def setup_local_training(training_parameters: Dict[str, Any], loader_type: str):
logger = du.create_logger()

device = torch.device('hpu' if torch.hpu.is_available() else 'cpu')
logger.info(f'PyTorch Version: {torch.__version__}')
logger.info(f'Using device: {device}')

#train_data, test_data, load_time_sec = ru.get_ray_dataset(training_parameters)
train_loader, load_time_sec = tu.create_mnist_training_loader(
training_parameters['bucket_name'], loader_type,
training_parameters['batch_size'])
logger.info(f'Data Loader Creation Time (in seconds) = {load_time_sec}')

# Train the model and log training metrics.
model = tu.MNISTModel(training_parameters['input_size'],
training_parameters['hidden_sizes'],
training_parameters['output_size'])

start_time = time.perf_counter()
compute_time_by_epoch = train_model(model, train_loader, training_parameters)
training_time_sec = time.perf_counter() - start_time

compute_time_sec = 0
for epoch_time in compute_time_by_epoch:
compute_time_sec += epoch_time

logger.info(f'Compute Time (in seconds) = {compute_time_sec}')
logger.info(f'I/O Time (in seconds) = {training_time_sec - compute_time_sec}')
logger.info(f'Total Training Time (in seconds) = {training_time_sec}')

test_loader, load_time_sec = tu.create_mnist_testing_loader(

training_parameters['bucket_name'], loader_type,

training_parameters['batch_size'])
tu.test_model_local(model, test_loader, training_parameters)

The snippet below will invoke this function and pass in the appropriate hyperparameters.

# Hyperparameters
training_parameters = {
'batch_size': 32,
'bucket_name': BUCKET_NAME,
'epochs': 3,
'hidden_sizes': [1024, 1024, 1024, 1024],
'input_size': 784,
'lr': 0.025,
'momentum': 0.5,
'output_size': 10,
'smoke_test_size': 0,
'use_gpu': False,
}
setup_local_training(training_parameters, 'full')

Running the setup function above on our compute instance with use_gpu = False results in the following output.

INFO | Data Set Size: 60000 samples.

INFO | Using device: cpu

INFO | Model moved to device: cpu

INFO | Epoch 1 - Training loss: 0.4534 - Compute time: 15.3998 IO time: 7.3033.

INFO | Epoch 2 - Training loss: 0.1475 - Compute time: 14.7520 IO time: 7.4030.

INFO | Epoch 3 - Training loss: 0.1046 - Compute time: 15.1024 IO time: 7.6316.

INFO | Compute Time (in seconds) = 45.2544

INFO | I/O Time (in seconds) = 22.3410

INFO | Total Training Time (in seconds) = 67.5955

INFO | Total Experiment Time (in seconds) = 67.6832

Running the same code using Gaudi results in output that shows our compute time is significantly reduced.

INFO | Data Set Size: 60000 samples.

INFO | Using device: hpu

INFO | Model moved to device: hpu

INFO | Epoch 1 - Training loss: 0.4521 - Compute time: 3.5001 IO time: 8.0007.

INFO | Epoch 2 - Training loss: 0.1481 - Compute time: 3.4549 IO time: 7.8584.

INFO | Epoch 3 - Training loss: 0.1035 - Compute time: 3.8153 IO time: 8.0103.

INFO | Compute Time (in seconds) = 10.7705

INFO | I/O Time (in seconds) = 29.7047

INFO | Total Training Time (in seconds) = 40.4752

INFO | Total Experiment Time (in seconds) = 40.5629

The results above are especially interesting because a fast accelerator can turn a compute-bound training workload (compute takes the longest) into an IO-bound training workload (data access takes the longest). Proof that fast accelerators must be used hand in hand with fast networks and fast storage.

Summary

In this post, I showed how to set up Intel Tiber Developer Cloud for machine learning experiments. This entailed creating an account, setting up a compute instance, creating a MinIO bucket, and setting up an SSH key. Once our resources were created, I showed how to write a few functions to upload and retrieve data. I also discussed data loading considerations for small datasets that can fit into memory and large datasets that cannot.

Using Intel’s Gaudi accelerator is straightforward, and developers will recognize the interface of the hpu package found in PyTorch. I showed the basic code to detect Gaudi and move tensors to it. I concluded this post by training an actual model using both a CPU and a Gaudi accelerator. These two experiments demonstrated Gaudi's performance gains and made a case for using fast storage and fast networks with fast accelerators.