HDD Data Durability: Block-level RAID vs. Object Storage Erasure Coding

We have blogged extensively about how object storage with erasure coding is a better approach than block storage RAID when it comes to data durability, fault tolerance and restore time. The rest of the storage technology world is starting to realize that we’re right, and that RAID, essentially a 1980’s technology, has outlasted its invitation and cannot provide the data protection needed in the always-on data-driven cloud-native datacenter.

Two recent large-scale studies, one from Secure Data Recovery and one from Backblaze, found that the average HDD fails somewhere between 2 years 10 months and 2 years 6 months, respectively. Secure Data Recovery studied failure rates in 2,007 hard drives, while Backblaze studied failure rates in 17,155 drives made up of 72 different models. The two results are consistent and the Backblaze study is statistically significant.

As shown below, Backblaze broke out failures by manufacturer, model number and size:

Backblaze noted in their report that the 12TB Seagate ST12000NM0007 fails much earlier than the average, 1 year and 6 months, and it fails more than any other drive they had in their datacenter. The only HDD with more failures was the 4TB Seagate ST400DM000, although those drives averaged a 3 year 3 month lifespan. Looking at all data from all HDD at Backblaze and not just the most recent quarter, the overall annualized failure rate (AFR) is 1.4 percent, while some models, such as the 4TB Seagate ST400DM000 displayed an AFR that was twice as high. Backblaze's full data set is available on its Hard Drive Test Data page.

Storage admins know from experience that continually-used HDDs fail, and now we can put a number on it to declare that the average HDD fails in less than 3 years – less than 2 if you’re using Seagate. The likelihood of losing multiple drives in a single RAID array increases with the number of hard drives in the datacenter (more HDD equals more failures). Another problem with RAID is that drives must be replaced as soon as they fail. This increases operational costs. Running data scrubbing every few months to detect and heal silent data corruption further increases the operational effort required while decreasing performance.

A better solution is to rely on distributed erasure coding with Bitrot protection, as found in MinIO. RAID doesn't have enough parity to protect against what we now know is the statistical rate of HDD failure. EC will mask these failures and greatly reduce operational cost while extending infrastructure life.

Even though it is widely implemented, block-level RAID causes a lot of pain when used in large deployments. The technology was designed for a single-server environment, not to support distributed data storage. Cloud-native architectures require object storage, but block-level RAID has no idea what an object is. RAID can’t heal objects saved on drives distributed across multiple nodes, it can’t even heal objects, it heals blocks and that means it must rebuild an entire drive to heal a single corrupt object.

Rebuilding drives using block-level RAID is time consuming. Take RAID 6 for example. When a drive fails, it is physically taken out of service, replaced with a new one and the RAID controller rebuilds the existing block-level data from the old drives to the new drive. Looking at the table above, there are plenty of drives in use that are 10+ TB. The time needed to rebuild a drive can be thought of as drive capacity/RAID throughput – for a 10 TB HDD this is about 14 hours.

Now we can assess the impact of HDD failure. HDDs fail in about 2.5 years and it takes roughly a work day to rebuild them (if not longer, we’re seeing plenty of 20 TB drives these days). A large datacenter like the one operated by Backblaze has about 225,000 HDD. They need staff to swap out and rebuild roughly ten hard drives (225,000 x 1.4% / 365) and rebuild ten RAID arrays every day. That’s a lot of downtime.

MinIO implements erasure coding on an object level in a distributed manner and, in the event of hardware failure, rebuilds objects without incurring downtime or hindering performance. Erasure coding is better suited for object storage than RAID because objects are immutable once written and they are frequently read. MinIO erasure codes and decodes as fast as the underlying hardware can support.

MinIO heals at the object level across multiple drives and nodes. This is a significant advantage over RAID that heals at the volume level. A corrupted object could be restored in MinIO in seconds – a major improvement over the hours it would take for RAID to rebuild the entire volume.

And then there’s BitRot, or silent data corruption. MinIO uses the HighwayHash algorithm to compute a hash on read and verify it on write from the application. MinIO does this in a highly efficient manner and can achieve hashing speeds of 10+ GB/sec on a single core Intel CPU. Neither study referenced above distinguished between BitRot and other types of failure, but we know that as HDD age the likelihood of BitRot increases. As BitRot corrupts blocks and RAID is a block level technology, it is possible to lose the entire RAID group after a single drive fails because the corrupt blocks will be used to rebuild the new drive. This is a non-issue with object storage erasure coding.

MinIO Erasure Coding for Greater Data Durability

This blog post explained that RAID doesn't have enough parity to protect against what we now know is the statistical rate of HDD failure. Datacenters that rely on RAID arrays of HDD will inevitably experience drive failures and the hours of downtime that are needed to rebuild the RAID set with the replacement drive. This is not the way to compete in the modern always-on world.

MinIO erasure coding protects and heals objects with zero downtime and negligible impact on performance.

Check out our erasure code calculator to learn more about MinIO erasure coding.