How to Repatriate From AWS S3 to MinIO

on Best Practices 26 July 2023

Cloud repatriation is something we’re frequently asked about. The idea is that the public cloud is a terrifically helpful place, filled with elastic turnkey services, and is therefore a massive accelerant in enterprise adoption of cloud-native technologies. Yet, public clouds deliver productivity gains at increased cost, especially for data-intensive applications, which may cause egress fees to balloon. If you’ve got more than a petabyte in the public cloud, then the economics likely favor repatriating to an on-premise private cloud. We’ve previously discussed the trend toward repatriation of cloud data in an earlier blog post, The Lifecycle of the Cloud.

Enterprises have repatriated many exabytes of data to on-premise MinIO, either on baremetal or Kubernetes. The majority run on Kubernetes, taking advantage of MinIO’s containerized architecture. They run multiple Kubernetes clusters, on-premise, at the edge – anywhere across the multi-cloud they need cloud-native object storage.

This blog post discusses the steps needed to migrate data from AWS S3 to on-premise MinIO.

Estimating Costs

The place to scale is the private cloud, using the same technologies used in the public cloud – S3-API compatible object storage, high-speed networking, Kubernetes, containers and microservices – but without the elevated monthly cost.

We conducted a TCO analysis in a previous blog post, The Lifecycle of the Cloud, which demonstrated that MinIO offers a 61% cost savings ($23.10/TB/month vs $58.92/TB/month) over public cloud for 100PB of object storage. The model compared AWS costs (Standard S3 with monthly data transfer costs, region-to-region replication, S3 API operations, KMS, lifecycle costs) to MinIO on private cloud (hardware, MinIO licensing, storage efficiency, rack space, power, cooling).

It is likely that the most expensive part of the process will be data egress fees from AWS. Transfer costs from AWS out onto the Internet range between five cents and nine cents per GB. Some quick math will help you budget, for example, a rough estimate for pulling 500TB from S3 in us-west-01 to an on-premise MinIO in Redwood City, CA will cost you about $30k.

You can choose to download only the hot tier of your S3 data to MinIO, and send the cold tier to S3 Glacier instead of downloading it. This will certainly decrease the cost of egress fees, and, depending on the amount of data that stays in AWS, you would realize a considerable cost savings. For example, if we have 500TB of data in total, transferring 250TB to S3 Glacier and 250TB to MinIO would cut the data transfer fees roughly in half to $16k, while the monthly fee for S3 Glacier is around $300. Tiering to inexpensive cloud storage makes running MinIO in your own datacenter an even more attractive option for object storage.

Of course, you will be more familiar with your organization’s cloud costs and should develop your own analysis. Hopefully, we’ve give you a framework to use to characterize costs.

Deploying MinIO

MinIO is the highest performing object storage on the planet (349 GB/s GET and 177 GB/s PUT on 32 nodes of NVMe), capable of backing even the most demanding datalake, analytics and AI/ML workloads. Data is written to MinIO with strong consistency, with immutable objects. All objects are protected with inline erasure-code, bitrot hashing and encryption.

To determine the hardware needed to support your on-premise MinIO deployment, please consult our Reference Hardware page.

To deploy MinIO on baremetal Linux, please see Deploy MinIO: Multi-Node Multi-Drive — MinIO Object Storage for Linux.

To deploy MinIO on Kubernetes, please see MinIO Object Storage for Kubernetes.

Planning and Configuration Needed to Repatriate

A little up-front planning goes a long way in preventing service interruptions during data migration. In order to transfer configurations from S3 to MinIO, you will first need to understand how your organization has configured its S3.

First, make note of the buckets currently in S3 that you want on MinIO. Next, create these buckets in MinIO using either the MinIO Console or mc, the MinIO Client.

Next, within the AWS console or client, you must understand and list all of the IAM-related and bucket-metadata-related information to recreate it in MinIO. Make a note of all IAM policies that you’ll bring from AWS to your own data center, including users, groups, service accounts, user mappings, group mappings, service account mappings and service group mappings. Then create them in MinIO using either the MinIO Console or mc.

If you have configured S3 Lifecycle rules for notifications and tiering, then you must also make sure that you recreate them in MinIO.

Time to Repatriate

First, make sure you have IAM policies set up in S3 that allow access only to the buckets that you’re migrating. Save the credentials, we’ll use them to create aliases in a moment.

You can use Batch Replication or mc mirror to copy the most recent version of objects from S3 to MinIO.

Both options can be simplified by creating aliases. First, download mc

curl https://dl.min.io/client/mc/release/linux-amd64/mc \
--create-dirs \
-o $HOME/minio-binaries/mc

chmod +x $HOME/minio-binaries/mc
export PATH=$PATH:$HOME/minio-binaries/

Next, create an alias for the MinIO cluster you deployed on-premise, replacing the variables with the values you set during deployment.

mc config host add minio1 $MINIO_ENDPOINT $MINIO_ACCESS_KEY $MINIO_SECRET_KEY

Next, create an alias for the S3 bucket where your data is, replacing the variables with the values you set during deployment.

mc config host add aws-s3 $S3_ENDPOINT $S3_ACCESS_KEY $S3_SECRET_KEY

Using S3 to MinIO Batch Replication, introduced in release RELEASE.2023-05-04T21-44-30Z, is efficient and speedy because it is a simple one-way copy of the newest version of an object and its metadata. The only caveat is that the object version ID and Modification Time cannot be preserved at the target. This is a great way to get data out of an S3-compatible source – including AWS S3 – and into MinIO. You could also configure S3-MinIO Batch Replication to make repeated point-in-time copies of objects stored in S3.

Batch Replication has several advantages over mc mirror (described below)

Removes the client to cluster network as a potential throughput bottleneck
A user only needs permissions to start the batch job, not permissions to data itself
The job is automatically retried in the event of failure
Batch jobs provide granular control over replication

The user account configuring and running Batch Replication must include the following policy:

{
            "Effect": "Allow",
            "Action": [
                "admin:CancelBatchJob",
                "admin:DescribeBatchJob",
                "admin:ListBatchJobs",
                "admin:StartBatchJob"
            ]
        }

The first step is to create and customize a YAML description file

mc batch generate minio1/ replicate > replication.yaml

Then edit replication.yaml to configure the replication job with credentials, endpoints, origin bucket, filters/flags and a destination bucket. Sources are all on S3, targets are all on MinIO.

replicate:
apiVersion: v1
# source of the objects to be replicated
source:
type: TYPE # valid values are "s3"
bucket: BUCKET
prefix: PREFIX
# NOTE: if source is remote then target must be "local"
# endpoint: ENDPOINT
# credentials:
#   accessKey: ACCESS-KEY
#   secretKey: SECRET-KEY
#   sessionToken: SESSION-TOKEN # Available when rotating credentials are used
# target where the objects must be replicated
target:
type: TYPE # valid values are "s3"
bucket: BUCKET
prefix: PREFIX
# NOTE: if target is remote then source must be "local"
# endpoint: ENDPOINT
# credentials:
#   accessKey: ACCESS-KEY
#   secretKey: SECRET-KEY
#   sessionToken: SESSION-TOKEN # Available when rotating credentials are used
# optional flags based filtering criteria
# for all source objects
flags:
filter:
newerThan: "7d" # match objects newer than this value (e.g. 7d10h31s)
olderThan: "7d" # match objects older than this value (e.g. 7d10h31s)
createdAfter: "date" # match objects created after "date"
createdBefore: "date" # match objects created before "date"
## NOTE: tags are not supported when "source" is remote.
# tags:
#   - key: "name"
#     value: "pick*" # match objects with tag 'name', with all values starting with 'pick'
## NOTE: metadata filter not supported when "source" is non MinIO.
# metadata:
#   - key: "content-type"
#     value: "image/*" # match objects with 'content-type', with all values starting with 'image/'
notify:
endpoint: "https://notify.endpoint" # notification endpoint to receive job status events
token: "Bearer xxxxx" # optional authentication token for the notification endpoint
retry:
attempts: 10 # number of retries for the job before giving up
delay: "500ms" # least amount of delay between each retry

You will have to create configuration files for each S3 bucket you wish to mirror. The good news is that you can run multiple Batch Replication jobs at once and receive notifications when each completes.

Start Batch Replication with

mc batch start minio1/ ./replication.yaml
Successfully start 'replicate' job `E24HH4nNMcgY5taynaPfxu` on '2023-06-26 17:19:06.296974771 -0700 PDT'

You can check the status of batch jobs

mc batch status minio1/ E24HH4nNMcgY5taynaPfxu
●∙∙
Objects:        28766
Versions:       28766
Throughput:     3.0 MiB/s
Transferred:    406 MiB
Elapsed:        2m14.227222868s
CurrObjName:    share/doc/xml-core/examples/foo.xmlcatalogs

Alternatively, you can also use the MinIO Client and run mc mirror, a powerful and flexible tool for data synchronization. You can use it to copy objects from S3 or S3 API-compatible stores and mirror them to MinIO. What you’re about to do, mirror from S3 to MinIO, is one of the most common use cases for the command.

Use mc mirror to copy the data from S3 to MinIO, repeating for each bucket. You could execute all mirror commands together programmatically.

mc mirror aws-s3/mybucket minio1/mybucket

How long it takes depends on the amount of data, network speeds and the physical distance between sites. If you’re moving 500TB it could take 24 hours, petabytes will take days. You will see a message when mc is done copying all the objects.

Whether you’ve used Batch Replication or mc mirror, you can compare bucket contents to verify that the source and target contain the same objects. The command below will return nothing if the buckets contain the same objects.

mc diff aws-s3/mybucket minio1/mybucket

The only remaining task is to replace references to your S3 endpoint with your MinIO endpoint in application configurations. Your applications won’t ever realize they’re not running on S3 because of MinIO’s market-leading S3 compatibility.

If you want to leave AWS, but don’t want to purchase your own hardware, then please consult Migrate From S3 to MinIO on Equinix Metal.

Repatriate to On-Premise MinIO

This blog post showed you how to plan and repatriate data from AWS S3 to on-premise MinIO. There are many reasons why an enterprise would do this – to control cloud spend, improve control over infrastructure, improve local performance, and more. It doesn’t matter why, all that matters is that you’re about to take control of your object storage deployment.
If you have any questions about migrating from AWS S3 to on-premise MinIO, or even if you just want to sanity check your repatriation place, be sure to reach out to us on Slack.