Business Continuity/Disaster Recovery (BC/DR)

Dumb ways to die

Everyone remembers the wildly catchy ad campaign designed to get people to take train safety seriously. We see a similar theme around protecting one’s data: a little preplanning and a bit of awareness combine to avoid finding yourself at the wrong end of a data disaster. Most data disasters are avoidable or preventable, with a bit of thought, planning, and execution.

Gartner’s determination that business downtime can cost companies over $5600 USD per minute means potential losses of up to $340,000 per hour (or more), so preplanning is critical. MinIO’s approach is that your business should be running 24x7, and the focus of any good BC/DR strategy should be minimizing downtime, no matter the source of disruption, and minimal data loss, no matter the nature of the crisis. Read on to discover why.

Data as the primary business asset

Your business runs on data. It is the primary asset; the thing you leverage to be a business at all, so protecting it, and taking the time to think through its resilience, is foundational to your bottom line. Thinking through data protection involves considering continuity and recovery as different aspects of the same planning.

Business Continuity (BC) deals with the operations side. It involves designing and creating policies and procedures that ensure that essential business functions and processes are available during and after a disaster.
Disaster Recovery (DR) is primarily focused on the IT side. It defines how an organization’s IT assets will recover from a disaster, whether natural or artificial. The processes within this phase can include server and network restoration, copying backup data, and provisioning backup systems.

The cost of downtime is your credibility

When discussing disasters and recovery from them, it’s often a hard sell, just based on numbers. That’s unfortunate, because the first casualty of downtime is your credibility as a corporation. It’s important to be clear on the idea that cost of backup/replication will never be higher than the costs of business lost due to poor BC/DR planning.

As mentioned above, MinIO’s approach is to minimize downtime. It starts with erasure coding to protect data against any kind of loss or corruption. Then, a focus on replication creates a stronger BC/DR position than simple backup — your downtime is however long it takes for you to point your load balancer at the replicated data. This software-based approach means that your Recovery Time Objective (RTO) can be reduced to the smallest possible time increment, rather than the entire arc of restoration and then validation, assuming it all comes back up right on the first try.

RTO vs RPO: A disaster recovery plan is not enough

Figuring out your business’s specific disaster response goal, and acceptable recovery, is an exercise that should be gone through regularly with an array of stakeholders, just like security risk tolerance, because it’s an aspect of that practice. Every business is different in its tolerances not only for downtime, but for cost. That’s why RLO — Recovery Level Objective — is your base metric, augmented by an understanding of Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Recovery Point Objective (RPO) speaks to data integrity; it’s the degree to which the data you recover is correct and operating correctly. This is often a function of how frequently you run backups, versus how much your data changes within those time increments.
Recovery Time Objective (RTO) speaks to data recovery; it’s the amount of time between when you go down, and when you get back up again. Or to state it a little less like Chumbawamba, RTO is the assessment of how long it will take to retrieve and reinstate your data after an event.

Different businesses have different tolerances in both dimensions. For example, a Slack outage disturbs businesses all over the world, and there is only slight tolerance for any downtime whatsoever, but a small business website going down probably has a longer tolerance for outage.

In thinking about how your choice of storage impacts these metrics, you want to think about things like

how much data is lost if you go down in between snapshots, and is that loss acceptable? If no data loss is acceptable, then replication is absolutely the correct choice for your situation.
If your business has more than a terabyte of data stored, how long will it take to bring that back up, and is that time acceptable? Or is every minute of downtime chipping away at your business’s credibility?

If data loss and downtime are acceptable to your business, then backup is an acceptable choice. However, if data loss and downtime sound like they’re problematic, then business continuity is your largest value metric, and replication is absolutely the correct choice for your situation.

Replication = Uptime

Backups are the default, bare minimum that is usually considered to “check the box” of BC/DR. The problem, though, is that backups are rarely tested, so you never really know, really know, how fast you can get them back online – if you can get them back online at all.

Replication is the superior option between replication and backup, because replication never sacrifices data availability. Both deployments are always up, and always ready, and identical. It does require more hardware space, but that would be necessary even in a pure backup situation, because you’d require the space to test your backup anyway. And you always test your backup before you need it, right? Right?

Limitations on backup strategy

There’s always something more critical going on that requires attention than the status and robustness of your backups, and so you end up being unaware of limitations of your strategy until it’s too late to remedy them.

A replicated copy, even passive, is much better at any data volume. With replication, your downtime is however long it takes for you to redirect your load balancer. Otherwise, your downtime is the time of restoration and validation… and that assumes it’s right on the first shot.

Most businesses have increasing amounts of data that need to be attended to and included in whatever BC/DR solution they are running. It’s important that you test your backups and ensure that bringing them back up into a fully functioning production environment can be done within an acceptable RTO for your business. But be aware that if you’re still a smallish business running within terabytes of data, backups are relatively quick, whereas a petabyte of data can take a really long time to bring up, test, and validate, which adds cycles between disaster and full restoration of a functioning uptime. Additionally, in a large, regionally-dispersed enterprise, the logistics of rolling out a solution become much more complicated, impacting your RTO even further.

MinIO has a client that previously suffered a ransomware attack (read more about how MinIO protects from ransomware). They discovered, painfully, that it took months to return to normal. Think about that, months. The mere process of getting an “all clear” from the security side to start to restore took weeks.

The customer had to rethink everything from prioritizing sites to building replication into their strategy.

Simply put, the more you need your data to function, the more data you presumably have, the longer it will take you to come back from disaster if you’re only using a backup model, whereas a replication model, which is an active/active state, has you back up and running instantly.

Just Replicate

MinIO’s approach is that your business should be running 24x7 no matter what, and that minimal downtime (RTO) and minimal loss (RPO) are paramount. The way to ensure this is through replication, ideally, because whether you have it in active or passive mode, you’ll need it not only in case of disaster, but also to test your backup capability.

Think of it like this

you’re not going to test in your production environment, because taking down your main business just isn’t a thing.
You’ll need a replicated environment with which to test anyway – and really, it could stop right there.
If you want to take it further and have a cold backup, you’d still need to test it, which requires the same hardware as your replication would have, which means that backup and replication have the same cost to the organization, but with disparate outcomes.

So either way, with the hardware in place, you can achieve a backup strategy that meets all the criteria for compliance, regulatory, and BC/DR needs, through MinIO.

Continuity > Recovery

A key factor that creates problems post-disaster is that BC/DR is not considered a primary budgetary concern. It’s incredibly common for the lowest-cost option, static backup on hard media, to be considered adequate. But this misses a significant piece of the recovery picture — what happens when your untested backup fails, or your restore takes longer than you expect or than your business can absorb?

Continuity is always better than recovery, and because object storage has become primary storage, it becomes entirely straightforward, not so say anything about forward-thinking and resilience-first, to establish a topology in your environment that includes replication as your standard BC/DR solution.

It’s always harder to plan and architect your recovery strategy after the disaster than it is before, but without the precipitating event of a disaster, it can be a hard sell to the bottom line. We hope this blog gives you some tools you need to make a stronger argument to rethink and implement a solution far ahead of need.