How do I know replication is up to date?

Customers run MinIO wherever they need fast, resilient, scalable object storage. MinIO includes several types of replication to make sure that every application is working with the most recent data regardless of where it runs. We’ve gone into great detail about the various replication options available and their best practices in previous posts about Batch Replication, Site Replication and Bucket Replication.

Replications complete successfully nearly all the time, but it is always necessary in prod to ensure that replication operations are completed by tracking and graphing them using metrics, which we’ll delve deeper into in a future post.

MinIO maintains a replication queue to manage multiple concurrent replications.
MinIO monitors the queue to either replicate or remove objects while scanning for newer ones.
MinIO attempts to replicate an object up to three times before dequeuing replication operations on that failed object.

Please note that these conditions don't happen often. When they do, the most common culprits are the physical hardware, WAN network, or some other physical infrastructure issue. But for day-to-day general operations, this is not a concern.

MinIO has no problem replicating many buckets and objects, but how can you determine if replication is running as expected and track it? How can you tell if there are any pending objects that need to be replicated? Before you launch that ML algorithm running in another cloud, how do you know replication is up to date?

In this post, we’ll take a look at the various states an object can be in during the replication process and how to get back up and running as quickly as possible among other tidbits so you have a pleasant experience Day 2 of replication.

Replication States

MinIO supports two types of replication: Asynchronous and Synchronous.

Asynchronous Replication: MinIO will first return a successful response as soon as the request to replicate the object is made. This does not wait for the object to be successfully replicated before returning a response. This is the default, but why? Sometimes the object might be too large, the WAN network between the two clusters might be slow, or there could be a number of reasons it might take a few minutes for the object to get replicated. We are so confident in the resilient nature of MinIO replication that we have checks in place to ensure objects in the replication queue will get replicated successfully. With this type of replication, the processes need not wait for the data to get replicated before adding more objects to the queue. One thing to be aware of this type of replication is that it could result in stale or missing objects while mitigating slow write operations.

Synchronous Replication: When an object is replicated synchronously, MinIO will wait for the entire object to be replicated before returning a successful response. This ensures that the objects have replicated successfully to the remote cluster with no missing objects. The downside is that for every object there can be additional overhead to determine whether that object has completed the replication and to move to the next object, not just set it and forget it in the queue.

To put it another way, think of Synchronous Replication like a TCP protocol handshake where the initial packet that is sent is acknowledged before additional packets are transmitted (such as application traffic). The object is sent from one MinIO Server to another (SYN), the receiving server acknowledges the object has been replicated (SYN-ACK), then the client responds with an acknowledgment (ACK). While TCP is more robust, the “talking” back and forth causes a lot of traffic on the network. On the other hand, think of Asynchronous Replication like a UDP protocol (such as SNMP) where the objects are sent to the receiver without waiting for a SYN-ACK. It is useful to replicate in this manner in cases where the volume of the data is a lot so having to wait for ACKs between objects can cause a lot of overhead that is not conducive to running efficient replication.

For this reason, you must explicitly enable Synchronous Replication when configuring the remote target using the mc admin bucket remote add command with the add flag.

To know the actual state of the object, MinIO sets the X-Amz-Replication-Status metadata field according to the replication state of the object. You can check this using mc stat.

The object could be in one of the following:

PENDING: The object has not yet been replicated. MinIO applies this state if the object meets one of the configured replication rules on the bucket. MinIO continuously scans for PENDING objects not yet in the replication queue and adds them to the queue as space is available.

For multi-site replication, objects remain in the PENDING state until replicated to all configured remotes for that bucket or bucket prefix.

COMPLETED: When the object has successfully replicated to the remote cluster.

FAILED: MinIO continuously scans for FAILED objects not yet in the replication queue and adds them to the queue as space is available. This is the state when the object has failed to replicate to the remote cluster.

REPLICA: If the object has been replicated from another remote cluster, then the REPLICA state is assigned to differentiate it from objects that have been added by the client directly to the cluster.

The replication process generally has one of the following flows

PENDING -> COMPLETED

PENDING -> FAILED -> COMPLETED

Replicating Objects

Generally we recommend setting up replication from the beginning to eliminate the potential large overhead caused by the initial transfer of objects. When replication is enabled on an existing cluster with data, MinIO does not replicate the existing data so DevOps engineers can control how and when the data should be replicated. They replicate manually over time depending on the nature of the data and applications that access it.

MinIO provides an option to replicate existing objects. For new replication rules, include "existing-objects" to the list of replication features specified to mc replicate add --replicate. For existing replication rules, add "existing-objects" to the list of existing replication features using mc replicate update --replicate. You must specify all desired replication features when editing the replication rule.

The speed with which existing objects are replicated depends on a number of factors such as networking, network speed, disk speed and number of objects (especially during the very first replication) – among other factors. MinIO uses the same scanner and queue for existing objects as well, there is no separate “QoS” for different objects, everyone has to get in the same line. That being said, you can adjust how MinIO balances scanner performance with read/write operations using either the MINIO_SCANNER_SPEED environment variable or the scanner speed configuration setting. If versioning was not previously enabled when configuring bucket replication, existing objects have a versionid = null. These objects do replicate.

MinIO peer sites can proxy GET/HEAD requests for an object to other peers to check if it exists. This allows a site that is healing or lagging behind other peers to still return an object persisted to other sites. For delete marker replication, MinIO begins the replication process after a delete operation creates the delete marker. MinIO uses the X-Minio-Replication-DeleteMarker-Status metadata field for tracking delete marker replication status. MinIO requires explicitly enabling versioned deletes and delete marker replication. Use the mc replicate add --replicate field to specify both or either delete and delete-marker to enable versioned deletes and delete marker replication, respectively.

$ /opt/minio-binaries/mc admin replicate add minio1 minio2

Requested sites were configured for replication successfully.

This can be best illustrated with an example:

A client issues GET("data/invoices/january.xls") to Site1
Site1 cannot locate the object
Site1 proxies the request to Site2
Site2 returns the latest version of the requested object
Site1 returns the proxied object to the client

For GET/HEAD requests that do not include a unique version ID, the proxy request returns the latest version of that object on the peer site. This may result in retrieval of a non-current version of an object, such as if the responding peer site is also experiencing replication lag.

Run the command below to check if replication is successfully enabled for all sites

/opt/minio-binaries/mc admin replicate info minio1

SiteReplication enabled for:

Deployment ID | Site Name | Endpoint

f96a6675-ddc3-4c6e-907d-edccd9eae7a4 | minio1 | http://<loadbalancer_public_ip>

0dfce53f-e85b-48d0-91de-4d7564d5456f | minio2 | http://<loadbalancer_public_ip>

Once replication is setup, you can check the status on an ongoing basis

/opt/minio-binaries/mc admin replicate status minio1

Bucket replication status:

No Buckets present

Policy replication status:

● 5/5 Policies in sync

User replication status:

No Users present

Group replication status:

No Groups present

Disaster Recovery Support

It's paramount to understand the baseline performance of replication and how additional replication rules affect each others' performance. Sometimes there might be unforeseen issues, for example, replicating many small objects might go smoothly, but data-intensive replication pushes the capacity and capability of some of the largest network pipes out there. MinIO runs as fast as the underlying hardware will allow – one of the most common causes of latency in a MinIO cluster is not the disk but network latency. You can have the world's fastest NVMe disks but if the network is subpar or experiences small but frequent disruptions, then the overall cluster’s performance will be affected. Enterprise customers receive a tool called Performance Test to run checks on their clusters to ensure the baseline performance meets expectations and, in case there are any issues with replication it can be compared against past data for further analysis.

NetPerf: ✔

NODE RX TX

http://minio1:9000 1.5 GiB/s 1.3 GiB/s

http://minio2:9000 1.6 GiB/s 1.6 GiB/s

http://minio3:9000 1.6 GiB/s 1.5 GiB/s

http://minio4:9000 1.4 GiB/s 1.7 GiB/s

DrivePerf: ✔

NODE PATH READ WRITE

http://minio1:9000 /disk1 445 MiB/s 150 MiB/s

http://minio1:9000 /disk2 451 MiB/s 150 MiB/s

http://minio3:9000 /disk1 446 MiB/s 149 MiB/s

http://minio3:9000 /disk2 446 MiB/s 149 MiB/s

http://minio2:9000 /disk1 446 MiB/s 149 MiB/s

http://minio2:9000 /disk2 446 MiB/s 149 MiB/s

http://minio4:9000 /disk1 445 MiB/s 149 MiB/s

http://minio4:9000 /disk2 447 MiB/s 149 MiB/s

ObjectPerf: ✔

THROUGHPUT IOPS

PUT 461 MiB/s 7 objs/s

GET 1.1 GiB/s 17 objs/s

MinIO 2023-02-27T18:10:45Z, 4 servers, 8 drives, 64 MiB objects, 6 threads

MinIO recommends configuring the load balancer to not route traffic to clusters and nodes that are offline. If needed, the resynchronization process can take a long time depending on the number and size of the objects and the network speed, so by routing the traffic to only healthy nodes/clusters the client operations will not see any downtime.

In the event that you require help, MinIO SUBNET users can log in and create a new issue related to resynchronization. Coordination with MinIO Engineering via SUBNET can ensure successful resynchronization and restoration of normal operations, including performance testing and health diagnostics.

If you have any questions regarding MinIO replication be sure to reach out to us on Slack!