MinIO Introduces Continuous Availability and Active-Active Bucket Replication

MinIO Introduces Continuous Availability and Active-Active Bucket Replication

One of the key requirements driving enterprises towards cloud-native object storage platforms is the ability to consume storage in a multi-data center setup. Multiple data centers provide resilient, highly available storage clusters, capable of withstanding the complete failure of one or more of those data centers. Multi-data center support brings private and hybrid cloud infrastructure closer to how the public cloud providers architect their services to achieve high levels of resilience.

This has traditionally been the domain of enterprise SAN and NAS vendors like NetApp SnapMirror and MetroCluster.

While object storage is superior to these legacy technologies in many ways - it could not, until now, deliver Active Active Replication across two data center locations. We believe that MinIO is the only company offering this capability.

MinIO actually offers two different ways of achieving this - one, with server-side bucket replication and the other  with client-side mc mirror. While both work, the “enterprise-grade” solution is server-side replication and as such that is what we will focus on in this post.

Understanding the Scenarios

Let us start by looking at the different deployment scenarios where this capability would be valuable. There are at least four:

  • Same-DC replication
  • Cross-DC replication
  • Same-Region replication
  • Cross-Region replication

Of particular note are the last three. In each of these scenarios, it is imperative that the replication be as close to strictly consistent as possible (taking into account bandwidth considerations and the rate of change).

Basic Architectural Considerations

At the most basic level any design needs to account for infrastructure, bandwidth, latency, resilience and scale. Let’s take them in order:

Infrastructure: MinIO recommends the same hardware on both sides of the replication endpoints. While similar hardware will likely perform, introducing heterogeneous HW profiles introduces complexity and slows issue identification. Southwest Airlines only buys 737s to eliminate operational complexity. Follow their lead.

Bandwidth: The determination of the appropriate bandwidth occurs at multiple levels (between sites, client vs. server vs. replication target). The key here is to understand the rate of change and the amount of that data that’s changed. A clear understanding of these components will determine the bandwidth requirement. We recommend a buffer. For example, if 10% of data is changed we recommend using a 20% change rate. So for 100 TB data with a 10% change would suggest 10TB but to account for burstiness we would recommend you allocating 20TB in terms of bandwidth. Needless to say, each organization will have its own take on this.

Latency: After bandwidth, latency is the most important consideration in designing an active-active model. It represents the round-trip time (RTT) between the two MinIO clusters. The goal should be to drive latency down to the smallest possible figure within the budgetary constraints imposed by bandwidth. The lower the latency, the lower the risk of any data loss in the case of a two sided outage. We recommend a RTT threshold of 20ms at the top end - ideally less. Further, packet loss should not exceed 0.01% for both the ethernet links and the network. Both packet loss and latency should be tested thoroughly before going to production as they directly impact throughput.

Architecture: At present, MinIO is only recommending replication across two data centers. It is possible to have replication across multiple data centers, however, the complexity involved and the tradeoffs required make this rather difficult.

Scale considerations: While MinIO can support very large deployments in each data center, both for source and target, the considerations outlined above will dictate scale. There are no changes to how MinIO scales at either location (i.e. seamlessly, with no rebalancing via Zones).

Server Side Replication

Multi-site replication starts with configuring which buckets need to be replicated. It should be noted that MinIO will not replicate objects that existed before the policy was enacted. This means that you can configure a bucket for replication, but if there are objects that predate that action, those objects will not be available for replication.

To replicate objects in a bucket to a destination bucket on a target site either on the same cluster or a different cluster, start by creating version-enabled buckets on both source and destination buckets. Next, the target site and destination bucket need to be configured on the MinIO server by setting:

mc admin bucket remote add myminio/srcbucket https://accessKey:secretKey@replica-endpoint:9000/destbucket --service “replication” --region “us-east-1”


MinIO can replicate:

  • Objects and their metadata (which is written atomically with the object in MinIO). Those objects can either be encrypted or unencrypted. This is subject to the constraints outlined above regarding older objects. The owner will need the appropriate permissions.
  • Object versions.
  • Object tags, if there are any.
  • S3 Object Lock retention information, if there is any. It should be noted that the retention information of the source will override anything on the replication side. If no retention information is in place, the object will take on the retention period on the destination bucket. For more information on object locking, look at this blog post or the documentation.

What is exciting about this implementation is how easy it has become to provide resilience at scale. Some key features we have implemented in this regard include:

  • The ability for source and destination buckets to have the same name. This is particularly important for the applications to transparently failover to the remote site without any disruption. The load balancer or the DNS simply directs the application traffic to the new site. If the remote bucket is in a different name, it is not possible to establish transparent failover capability. This is a crucial availability requirement for enterprise applications like Splunk or Veeam.
  • MinIO also supports automatic object locking/retention replication across the source and destination buckets natively out of the box. This is in stark contrast to other implementations which make it very difficult to manage.
  • MinIO does not require configurations/permission for AccessControlTranslation, Metrics and SourceSelectionCriteria - significantly simplifying the operation and reducing the opportunity for error.
  • MinIO uses near-synchronous replication to update objects immediately after any mutation on the bucket. Other vendors may take up to 15 minutes to update the remote bucket.  MinIO follows strict consistency within the data center and eventual-consistency across the data centers to protect the data.  Replication performance is dependent on the bandwidth of the WAN connection and the rate of mutation. As long as there is sufficient bandwidth, the changes are propagated immediately after the commit. Versioning capability enables MinIO to behave like an immutable data store to easily merge changes across the active-active configuration. The ability to push changes without delay is critical to protecting enterprise data in the event of total data center failure.
  • MinIO has also extended the notification functionality to push replication failure events. Applications can subscribe to these events and alert the operations team. Documentation on this can be found here.

As we noted, MinIO’s mc mirror feature can also offer similar functionality. Why then, did we invest the time and effort to go the extra mile?

Performance and simplicity. Moving the replication functionality to the server-side enables replication to track changes at the source and push objects directly to a remote bucket. In contrast, mc mirror has to subscribe to lambda event notification for changes and download the object to push. Ultimately, server-side is faster and more efficient. Additionally, the server-side approach is simpler to setup and manage, without requiring additional containers or servers. No extra tooling or services are required.

As a result, we recommend server-side replication moving forward.


The HowTo

This section shows how all uploads to bucket srcbucket on sourceAlias can be replicated to destbucket bucket on a target MinIO cluster at endpoint  https://replica-endpoint:9000 identified by alias destAlias. Here both the source and target clusters need to be running MinIO in erasure or distributed mode. As a prerequisite to setting up replication, ensure that the source and destination buckets are versioning enabled using `mc version enable` command.

The source bucket needs to be configured with the following minimal policy:

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetReplicationConfiguration",
    "s3:ListBucket",
    "s3:GetBucketLocation",
    "s3:GetBucketVersioning"
   ],
   "Resource": [
    "arn:aws:s3:::srcbucket"
   ]
  }
}

On the target side, create a replication user `repluser` and setup a user policy for this user on the destbucket which has permissions to the actions listed in this policy as a minimal requirement for replication:

$ mc admin user add destAlias repluser repluserpwd
$ cat > replicationPolicy.json << EOF
{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetBucketVersioning"
   ],
   "Resource": [
    "arn:aws:s3:::destbucket"
   ]
  },
  {
   "Effect": "Allow",
   "Action": [
    "s3:ReplicateTags",
    "s3:GetObject",
    "s3:GetObjectVersion",
    "s3:GetObjectVersionTagging",
    "s3:PutObject",
    "s3:ReplicateObject"
   ],
   "Resource": [
    "arn:aws:s3:::destbucket/*"
   ]
  }
 ]
}

EOF

$ mc admin policy add destAlias replpolicy ./replicationPolicy.json
$ mc admin policy set dest replpolicy user=repluser

Create a replication target on the source cluster for the replication user created above:

$ mc admin bucket remote add myminio/srcbucket https:/repluser:repluserpwd@replica-endpoint:9000/destbucket --service “replication” --region “us-west-1”
Replication ARN = 'arn:minio:replication:us-west-1:28285312-2dec-4982-b14d-c24e99d472e6:destbucket'

Note that the admin running this command needs s3:PutReplicationConfiguration permission on the source cluster in addition to the permissions specified for srcbucket. Once successfully created and authorized, the server generates a replication target ARN. The command below lists all the currently authorized replication targets:

$ mc admin bucket remote ls srcAlias/srcbucket --service replication

Using this ReplicationARN, you can enable a bucket to perform server-side replication to the target destbucket bucket.

Add a replication rule to srcbucket on srcAlias using the replication ARN generated above:

$ mc replicate add srcAlias/srcbucket --remote-bucket destbucket --priority 1 --arn arn:minio:replication:us-west-1:28285312-2dec-4982-b14d-c24e99d472e6:destbucket

Multiple rules can be set with the above command with optional prefix and tag filters to selectively perform replication on a subset of objects in the bucket. In the event of multiple overlapping rules, the matching rule with highest priority is used.

The replication policy created can be viewed with the command `mc replicate export`

{
  "Role" : "arn:minio:replication:us-west-1:28285312-2dec-4982-b14d-c24e99d472e6:destbucket",
  "Rules": [
    {
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Filter" : { "Prefix": ""},
      "Destination": {
        "Bucket": "arn:aws:s3:::destbucket",
        "StorageClass": "STANDARD"
      }
    }
  ]
}

MinIO’s bucket replication API and the JSON replication policy document is compatible with Amazon S3’s specification. MinIO uses the Role ARN here to support replication to another MinIO target. Any objects uploaded to the source bucket that meet replication criteria will now be automatically replicated by the MinIO server to the remote destination bucket. Replication can be disabled at any time by disabling specific rules in the configuration or deleting the replication configuration entirely. MinIO client utility (mc) provides all the necessary commands for convenient DevOps tooling and automation to manage the server-side bucket replication feature.

When an object is deleted from the source bucket, the replica will not be deleted unless delete marker replication is enabled. An upcoming feature permits fully active-active replication by replicating delete markers and versioned deletes to the target if `mc replicate add` command specifies --replicate flag with “delete-marker” or “delete” options or both.

When object locking is used in conjunction with replication, both source and destination buckets needs to have object locking enabled. Similarly, objects encrypted with SSE-S3 on the server-side, will be replicated if the destination also supports encryption.

Replication status can be seen in the metadata on the source and destination objects with `mc stat` command. On the source side, the X-Amz-Replication-Status changes from PENDING to COMPLETE or FAILED after replication attempt either succeeds or fails respectively. On the destination side, an X-Amz-Replication-Status status of the REPLICA indicates that the object was replicated successfully.  Any failed object replication operation is re-attempted periodically at a later time. MinIO’s bucket replication feature is resilient to network and remote data center outages.

put
head

The Gritty Details

As we noted, we believe we are the first to deliver active-active replication for object storage. That means there are some details we want to cover off to ensure your success. We are going to frame them as questions. Feel free to drop us a note at hello@min.io if you would like to add additional questions:

What happens when the replication target goes down?

If the target goes down, the source will cache the changes and will start syncing once the replication target comes back up. There may be some delay to reach full sync depending on the length of time, number of changes, bandwidth and latency.

What happens if the crawler goes down or is disabled?

We plan to remove it. It is an environment variable added because of seagate's insistence. This can be removed.

What are the parameters on immutability?

Immutability is an immensely valuable feature and one that MinIO is pleased to support. We suggest familiarizing yourself with the concepts and how we have implemented them in this post. It should be noted that in the active-active replication mode, immutability is only guaranteed if the objects are versioned. Versioning cannot be disabled on the source. If versioning is suspended on the target, MinIO will start to fail replication. Immutability requires versioning…

What are the other implications if versioning is suspended or there is a mismatch?

In these cases, replication could fail. For example, ff you attempt to disable versioning on the source bucket, an error is returned. You must remove the replication configuration before you can disable versioning on the source bucket. Additionally, if you disable versioning on the destination bucket, replication fails. The source object will return the replication status Failed.


How is object locking handled if it is not enabled on both sides?

There is a potential for inconsistency if object locking settings are not configured on both ends. MinIO will silently fail in this case. Object locking must be enabled on both the source and the target. There is a corner case. If, after bucket replication has been set, the target bucket can be deleted and (A NEW ONE?) created with object lock not enabled - replication can fail.


What happens if credentials change?

If credentials for the target change, everything will fail. If credentials for the target which are stored on the source changes, replication will fail as the access credentials have changed. All credentials need to be updated/current on the source for replication to continue to work.


Conclusion

In this post we demonstrated how to effectively design an active-active two data center MinIO deployment to ensure a resilient and scalable system that can withstand a DC failure, without any downtime for end clients. This architecture is proven and already deployed in the wild by our customers and users and allows a simple yet efficient mechanism for the modern enterprise to build large scale storage systems.

As always we encourage you to try it out for yourself by downloading MinIO today. If you have questions check out our documentation and our amazing Slack channel. To understand how much it costs to get a commercial license to MinIO, check out the pricing page.