Using RAID-5 Means the Sky is Falling!
Why disk URE rate does not guarantee rebuild failure.
Editorial article by Olin Coles for Highly Reliable Systems
Today’s appointment brought me out to a small but reliable business, where I’m finishing the hard drive upgrades for their cold storage backup system. It was an early morning drive into the city, with enough ice on the roads to contribute towards the more than 30,000 fatality accidents that occur each year1. The backup appliance I’m servicing has received 6TB desktop hard disks to replace an old set with a fraction of the capacity, so rebuilding the array has taken considerable time.
Their primary storage spans eight disks in a RAID-10 set, which gets archived to the server backup appliance for long-term retention. That backup appliance has a unique cartridge system that safely holds three disks in a redundant array. Later this evening when the project is finally finished, I’ll count myself as lucky for surviving the treacherous roadways and lethal cold, but I won’t give it a second thought to the risks I took by using RAID-5 on their cold storage devices.
You might not agree, but there are people out there who believe we should not be driving because the statistics indicate it’s clearly a dangerous activity. Nearly every driver in America will be involved in an auto-accident at some point in their life3, some of which will cause serious injury or death. For those people not involved in an accident this year, which is more than 96% of all licensed drivers, we’ll drive to our destinations unharmed. Sure, a statistical risk exists, but it’s not an absolute guarantee I’ll be killed on the drive home. The sky is not falling.
Every year, no matter where you live, it gets cold in the winter. This natural occurrence drops temperatures, which could lead to hypothermia and for a very small portion, death. Winter temperatures in Sierra Nevada can chill you to the bone, which explains how over 1500 people succumb to hypothermia annually2. When I walked from the parking lot to the client’s office this morning it was extremely cold, but just because there is a statistical risk of hypothermia does not mean I’m surely going to freeze to death. The sky is still not falling.
For some strange reason people seem to think that everything changes when we talk about hard disk drives, and that a statistical possibility becomes absolute certainty. Manufacturers conduct abbreviated testing on hard disk components4, sampling a set number of drives to determine a relative mean time between failure (MTBF) or maximum unrecoverable read errors (URE). Nevertheless, there are people using fear tactics that claim redundant arrays of large capacity disks, such as the 6TB hard disks I used in those RAID-5 sets, are risky business. Some even go so far as to say RAID-5 will stop working on a particular year, reminiscent of pre-apocalypse Y2K.
In reality, most hard disks seldom see operating temperatures below the chill of a server room or beyond the warmth of rack space, and most disks will not commit an URE that crashes a RAID-5 rebuild. While it is agreed that better parity schemes exist, the exception is not the rule. My customer could have retained cold storage data to individual disks via removable drives, with no redundancy at all. In fact, most organizations already use a single removable disk or cloud container for their nightly backup routine. My customer choose a special backup appliance that fits three disks into a single cartridge, further protecting archived data and proving RAID-5 still has business applications.
But if the opinion of an Internet personality vocal on storage technology is to be revered as the gospel truth5, then we must forego these large capacity disks because they’re all purported to carry an “almost certain” unrecoverable read error rate… something to the tune of 1014. A guaranteed URE, you ask? Well, it’s not as certain as freezing to death or being killed in an accident, or even both of these statistics combined, but according to the often-cited but seldom verified test methodology, your hard drive will fail to read a sector once every 12TB of data. Such a failure could happen as a RAID-5 array is being rebuilt, striking a sector with a guaranteed URE on the parity disk happening at exactly 100,000,000,000,000 bits – unless it doesn’t.
Some writers build their reputation by making audacious claims that create controversy, done solely to help propel traffic onto the website they write for. Common sense and real-word experience be damned; let the lack of evidence claiming otherwise and the use of complex math help prove their confusing point! After all, it’s not like anybody knows exactly how any particular manufacturer came up with a 10^14 error rate, which arbitrarily changes from time to time, or where people can find these clearly documented test procedures. You’re not supposed to question the numbers – you’re just supposed to believe what the manufacturer tells you, and know that regardless of capacity per disk or number of drives involved that after reading 12TB you will experience an unrecoverable read error. Oh, and that RAID-5 also stopped working in 2009 – except that it didn’t.
We all survived Y2K unscathed, and not surprisingly the end of RAID-5 did not actually happen as predicted. That same author later wrote a follow-up article, and instead of admitting defeat he doubled down and claimed RAID-5 was as doomed as ever because URE rates remained the same in the largest capacity drives. Never mind that there are countless real-world scenarios where RAID-5 continues to be used with great success well into 2015, that’s not important. The people forget that the 10^14 bit URE rate is not an absolute; it’s a predictive failure specification, measured for a single disk based on a unknown test sample size of disks. It’s also a marketing ploy, since nearly all consumer desktop hard drives typically receive the same failure rate while enterprise drives magically receive a 10^15 bit URE rate… an entire order of magnitude greater reliability, all without quantified explanation.
It’s possible that people who claim the sky will fall have failed to envision a future with solid state storage, or they’ve misinterpreted a suggested error rate as a predictable mechanical function. Both are likely, yet facts being facts, they still weren’t able to prevent an entire subculture from embracing the notion that RAID-5 does not work, or that all desktop hard disks will have read failure precisely at that 10^14 bit. All that is necessary to disprove this is the successful rebuilding of a RAID-5 set with 12TB or better capacity, as many primary and backup storage administrators have done countless times.
As we approach an era where Solid State Drive products reach multi-Terabyte capacity with built-in error checking and data management technologies, the argument for unrecoverable errors and subsequent RAID rebuild failures becomes even less valid. It’s foolish to claim a proven technology will one day fail in the far off future7, when that future involves dramatic improvements with every product cycle that nobody can predict. If the sky really is falling, next time they’ll just have to shout louder and use proven math.
- NHTSA, 2012: http://www-nrd.nhtsa.dot.gov/Pubs/811856.pdf
- CDC, 2010: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6151a6.htm
- Karen Aho, 2011: http://www.carinsurance.com/Articles/How-many-accidents.aspx
- Adrian Kingsley-Hughes, 2007: http://www.zdnet.com/article/making-sense-of-mean-time-to-failure-mttf
- Robin Harris, 2007: http://www.zdnet.com/article/why-raid-5-stops-working-in-2009
- Robin Harris, 2013: http://www.zdnet.com/article/has-raid5-stopped-working
- Robin Harris, 2010: http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019