Raid 5 nearly useless for 1.5TB drives?

So, the main question right now is what would a good raid controller do if it ran into a URE? Right?

Is there any proof to support either side at this time?

all i know is rebuilding a 1TB disk sucks, lol
 
Would this then be an argument for keeping multiple smaller <12TB arrays? If there is a problem of URE at that 12TB spot... would RAID50/60 be some thing to look into? I would like to see someone actually test this out with a number of different RAID cards/software RAID with 1.5TB drives.
And isn't this also why they recommend using <8 drives in a RAID5 array?
 
maybe im dense, but if huge drives in a RAID statistically fail before they can be rebuilt, how do they get built in the first place without errors?
 
maybe im dense, but if huge drives in a RAID statistically fail before they can be rebuilt, how do they get built in the first place without errors?

Don't worry about it, it's really parnoid people who are worried about statistically failing drives because of the rebuilt time of their size... by this logic, a person would have a failing drive every week.
 
Don't worry about it, it's really parnoid people who are worried about statistically failing drives because of the rebuilt time of their size... by this logic, a person would have a failing drive every week.

I guess it depends though Ockie. While the guy from the article might be off on his numbers, however If you are a business, these probabilities matter because they will help determine your maintenance costs over the next year(s). If you start approaching 100% on a rebuild failure...it may land up costing you weeks of downtime trying to get the array back online. It is a risk assessment.

In terms of the home, I guess it really is coming to the point where the average somewhat compitent but still ignorant user no longer gets the put a half dozen drives in RAID5 and have a aweseome chance of success even if a drive fails down the road. I think we are going to start seeing more and more people who setup RAID5 arrays and have failures pop up on the forums has HDD density increases.
 
And Im sure one drive somewhere probably does fail every week.........

All things aside, its a possibility. One that manufacturers and persons in the business seem to care about a lot more than you or I.

Id definately makes for some interesting theorycrafting though.
 
I guess it depends though Ockie. While the guy from the article might be off on his numbers, however If you are a business, these probabilities matter because they will help determine your maintenance costs over the next year(s). If you start approaching 100% on a rebuild failure...it may land up costing you weeks of downtime trying to get the array back online. It is a risk assessment.

In terms of the home, I guess it really is coming to the point where the average somewhat compitent but still ignorant user no longer gets the put a half dozen drives in RAID5 and have a aweseome chance of success even if a drive fails down the road. I think we are going to start seeing more and more people who setup RAID5 arrays and have failures pop up on the forums has HDD density increases.

Thats why SAS drives are there, enterprise systems should use SAS, not SATA to store business critical data,

If you do use SATA for your data in business, be sure to have a darn good proven backup system.
 
Youre right, most of this pertains to raid 5 in a desktop environment it seems. But a lot of the argument isnt based on when a drive will fail, its based on ERU
 
Thats why SAS drives are there, enterprise systems should use SAS, not SATA to store business critical data,

If you do use SATA for your data in business, be sure to have a darn good proven backup system.

I do not disagree with you...even in the slightest.

Still on topic, I think what people may not completely understand what I am saying is that there has been a proven track record of people with RAID5 controllers and moderately sized arrays with consumer grade drives (as proven by Ockie). It didn't matter if what they were doing was not in their best interest....it landed up being that the cost of a RAID5 array with consumer drives was good and the chance of failure was low. I think Ockie has already pre-empted quite a bit of this by having full copies of his data on a different server.

He was ahead of the game...whether it was intentional or gut instinct, that I do not know.
 
I think Kipper may have said a gem without realizing it. Most of the people here seem to realize raid isnt a backup solution. Correct me if I am wrong, but raid 5 is meant to allow you to lose one drive and keep running temporarily in a degraded state. I think the problem is when people rely on raid 5 rebuilds as the only means of backup. That coupled with larger drive sizes and the ERU factored in equals disaster. I also read that since it takes so long to rebuild an array you have a far better chance of a second drive failing due to the stress of long periods of high usage, and while that seems to make sense, I didnt find any facts to back that up, so if anyone has any facts on that I would be interested.
 
I think Kipper may have said a gem without realizing it. Most of the people here seem to realize raid isnt a backup solution. Correct me if I am wrong, but raid 5 is meant to allow you to lose one drive and keep running temporarily in a degraded state. I think the problem is when people rely on raid 5 rebuilds as the only means of backup. That coupled with larger drive sizes and the ERU factored in equals disaster. I also read that since it takes so long to rebuild an array you have a far better chance of a second drive failing due to the stress of long periods of high usage, and while that seems to make sense, I didnt find any facts to back that up, so if anyone has any facts on that I would be interested.

Technically, you could increase your chances of rebuilding by using RAID5 with a hotswap spare, but still i don't think this is a HUGE deal, but something to keep in mind. I really don't know why you would use raid 5 with greater than 1TB drives, but thats me.

if your going with drives that large, use RAID10 so you have the performance.
 
Thats why SAS drives are there, enterprise systems should use SAS, not SATA to store business critical data,

If you do use SATA for your data in business, be sure to have a darn good proven backup system.
The interface of a drive has little to do with how reliable it is. Seagate makes the same physical drive with either a SAS or sata interface; would you argue that one is more reliable than the other?
You know answer as well as I do, they're called backups. And RAID is NOT a backup.
Agreed---but if your business depends on uptime, backups are not the end-all be-all here, and some level of RAID (or other uptime/replication mechanism) is necessary.

Running backups is damn important, but that's not what I see as the problem here. If you have a 20TB raid 5 array and it goes down, restoring from tape, even at 100 MB/s, will still take 200k seconds = two months! Even if you are restoring from another 20TB array that can transfer over network at 10gigE speeds (1 GB/s), that *still* takes 20k seconds = five days! RAID is sufficient to keep such an array functioning while disks are resilvered, if it's used right (single-digit-sized raid 6 groups are a good rule of thumb), but most businesses can't deal with five days of downtime to get the data back online, let alone two months.
 
The interface of a drive has little to do with how reliable it is. Seagate makes the same physical drive with either a SAS or sata interface; would you argue that one is more reliable than the other?

Agreed---but if your business depends on uptime, backups are not the end-all be-all here, and some level of RAID (or other uptime/replication mechanism) is necessary.

Running backups is damn important, but that's not what I see as the problem here. If you have a 20TB raid 5 array and it goes down, restoring from tape, even at 100 MB/s, will still take 200k seconds = two months! Even if you are restoring from another 20TB array that can transfer over network at 10gigE speeds (1 GB/s), that *still* takes 20k seconds = five days! RAID is sufficient to keep such an array functioning while disks are resilvered, if it's used right (single-digit-sized raid 6 groups are a good rule of thumb), but most businesses can't deal with five days of downtime to get the data back online, let alone two months.

Uhm yea your math is a little off...=). 20k seconds is ~6 hours...not 5 days...you forgot there's more than 1 hour in a day =p.
 
I guess it depends though Ockie. While the guy from the article might be off on his numbers, however If you are a business, these probabilities matter because they will help determine your maintenance costs over the next year(s). If you start approaching 100% on a rebuild failure...it may land up costing you weeks of downtime trying to get the array back online. It is a risk assessment.

In terms of the home, I guess it really is coming to the point where the average somewhat compitent but still ignorant user no longer gets the put a half dozen drives in RAID5 and have a aweseome chance of success even if a drive fails down the road. I think we are going to start seeing more and more people who setup RAID5 arrays and have failures pop up on the forums has HDD density increases.

Well enterprise storage systems with well known reliabilty are starting to sell these drives.

Also, within the scope being argued here, you can't have a drive failing weekly, unless if you have one massive array, but then you wouldn't be a business and an idiot for running raid 5 arrays on hundreds+ of disks.

And Im sure one drive somewhere probably does fail every week.........

All things aside, its a possibility. One that manufacturers and persons in the business seem to care about a lot more than you or I.

Id definately makes for some interesting theorycrafting though.

Nothing fails every week within the scope presented in this thread. Yes, we lose a drive at work nearly on a bi-daily-weekly basis, but we have several THOUSAND.

Thats why SAS drives are there, enterprise systems should use SAS, not SATA to store business critical data,

If you do use SATA for your data in business, be sure to have a darn good proven backup system.

Not accurate. You get enterprise SATA storage devices. SAS is becoming more of a performance oriented storage device ideal for virtualization enviroments, SATA is becoming more of an enterprise standard for bluk storage.
 
I just dont see what the Interface has to do with reliability and the context of the article?

How could you say that my 1TB Seagate ES2 SATA drive is less reliable then my 1TB Seagate ES2 SAS drive? Thats completely irrational.
 
I just dont see what the Interface has to do with reliability and the context of the article?

How could you say that my 1TB Seagate ES2 SATA drive is less reliable then my 1TB Seagate ES2 SAS drive? Thats completely irrational.

The drive itself isn't more or less reliable, the SAS may be build slightly better but not enough to make it matter in this thread.

I'm more referring to SAS controllers for enterprise systems. This article seemed to direct tword SATA drives only and the SATA URE issue. What would this article look like when written about the SAS interface?
 
Uhm yea your math is a little off...=). 20k seconds is ~6 hours...not 5 days...you forgot there's more than 1 hour in a day =p.
Whoops, yeah. Divide numbers by 24 and they're more reasonable. The point stands, though: even with backups there can be significant downtime when an array fails.
the SATA URE issue
The unrecoverable error rate is because the bits are stored on a hard drive. All drives, regardless of interface, get unrecoverable errors. SAS drives are usually quoted in the range of 10^15 BER, compared to 10^14 for sata, because of stronger ECC or better firmware that prevents errors somehow. This makes them safer, perhaps, but still not completely reliable.
. What would this article look like when written about the SAS interface?
About the same. SAS and sata both have decent checksums on every block sent over the wire, so corruption on the wire isn't usually a big problem.

The solutions to this problem, IMO, are double parity of some sort and software-based checksums on the OS side. ZFS gives me both and I'm very happy with it.
 
The solutions to this problem, IMO, are double parity of some sort and software-based checksums on the OS side. ZFS gives me both and I'm very happy with it.

Right, with checksums in place HD's will always have errors, the key is how the controller handles the errors, so in bottom line is that with a good controller with ECC/Checksums, rebuilding an array would be completely possible and will handle any errors that do happen.

So in the end, RAID 5 still has a happy home.
 
Right, with checksums in place HD's will always have errors, the key is how the controller handles the errors, so in bottom line is that with a good controller with ECC/Checksums, rebuilding an array would be completely possible and will handle any errors that do happen.

So in the end, RAID 5 still has a happy home.


Define "good controller"
 
So all the low dollar cards and Mobo raid is no good for raid 5 anymore, maybe?? I mean, it seems like a game of chance if your running a controller with no ECC or some type of system built in to overcome an error.
 
If you do any form of streaming media playback, I'd avoid these drives until there's a firmware update. See http://forums.seagate.com/stx/board...&thread.id=2390&view=by_date_ascending&page=4. I bought 6 of these drives to stick in a ReadyNAS Pro and with write cache enabled, playing quicktime video podcasts results in periodic pauses. Even with FLAC files. The price point of these drives is damn good. I just hope they fix this with firmware soon.

Edit: Sorry if this seems like a hijack. Just realized this thread is more on Raid 5's usefulness with high capacity drives. All I know is I'd rather have RAID 5 vs no raid at all and lose a drive. :)
 
So all the low dollar cards and Mobo raid is no good for raid 5 anymore, maybe?? I mean, it seems like a game of chance if your running a controller with no ECC or some type of system built in to overcome an error.

They are "ok". Basically, let me put it into car analogy.

You want a car thats serious for track racing, while the Civic is a great car and is good, it's not ideal, you'd need something like a Lotus Elise or a Corvette.


Same goes for the controller, while the cheap ones will do the trick, if you are serious and you really value your data, you will get a better controller.

IMO good controllers start at $300. Great controllers starts at 700.
 
I agree with you 100% on that, you definately get what you pay for.

And I am not big on home raid, as I dont use it, except a raid 0 stripe on a backup pc that I rarely use. I am just going with what I do know and the numbers. I am sure the better raid 5 setups ( the ones with some sort of way to overcome a URE ) wont have problems. I think a lot of the problem is gonna come from uninformed or misinformed people running larger raid 5 arrays. I mean, with almost all the mobos on the market putting an onboard raid controller there, it opens up a huge can of worms. I know all the mobo manufacturers do it to keep it a level playing field, thats simple business, BUT they almost make it too easy.

And just to clarify, although I dont use any raid, I definately see its worth when implemented properly. Just for my needs ( small network of 3 PC's, 2.5 Tb total storage ) it isnt worth the hassle.

Edit : Also, to add to my point about the uninformed, to date, if you wanted to use raid as a backup ( I know you not supposed to but it happens) RAID 5 has been the standard to use. While I dont believe every raid 5 array over 12 tb will fail, I believe we will see more "Lost my Raid 5 and cant recover help me please" threads here.

In summary, the facts are there, if youre running a raid 5 in a "lower end" setup, you MAY have problems recovering. I guess its up to the end user to decide how valuable their data is, but they should definately be aware, raid 5 in a desktop environment is changing.
 
That article about drive errors was extremely misleading. The unrecoverable read error rate as referenced, 10^14, is a drive manufacturer's spec, which is spindle based - NOT the total amount of data the RAID system must read (from multiple drives) during a rebuild. So, every 12 terabytes read from any given drive will result in one URE. Your RAD5 system will be just fine.
 
That article about drive errors was extremely misleading. The unrecoverable read error rate as referenced, 10^14, is a drive manufacturer's spec, which is spindle based - NOT the total amount of data the RAID system must read (from multiple drives) during a rebuild. So, every 12 terabytes read from any given drive will result in one URE. Your RAD5 system will be just fine.

Your logic is also misleading. Sorta like saying condoms are 99% effective and thus you can have sex 99 times before they will fail. It just doesn't work that way.
 
That article about drive errors was extremely misleading. The unrecoverable read error rate as referenced, 10^14, is a drive manufacturer's spec, which is spindle based - NOT the total amount of data the RAID system must read (from multiple drives) during a rebuild. So, every 12 terabytes read from any given drive will result in one URE. Your RAID5 system will be just fine.

This isn't like some counter that ticks down to an URE, it's an estimate of probability that one will happen. So 12 terabytes read from any *set* of similar drives will, on average, result in one URE. If your disks are 2TB, and you have seven, raid 5 recovery from the six disks that remain is problematic.
 
After reading the article a few times and understanding the spirit of the article, I think the article...for the most part, is correct. While it is a brilliant level of FUD, it does touch on truth of "the larger individual disks for a RAID 5 array get with consumer drives, the larger the chance of losing a file on a rebuild". This used to be able to be ignored for the most part since URE was high compared to the array size. That ratio though has shrunk considerably in the past 3-4 years.

Consider that video files are getting stupidly large with absolutely no form of protection on them. While some formats are somewhat "okay" with a flipped bit or two, if we start having numerous 4+GB files from camcorders or other "videos"..the chance of getting enough corrupted data could approach "oh crap".

I have tested this a few times and it is quite easy to do. What you need to do is get about 100GB or so of torrents that are actively running and save the ".torrent" files. Copy the files to a new drive and then reload those torrents on the copied drive. When it does a "verify", you will find out quite often that at least ONE of the files now has corrupted data on it and will need to rebuild the corrupted pieces. It truly is scary IMO.

I would like to see in the future OS cover "data integrity" or even the file formats themselves. There is just no excuse anymore to not put some sort of "parity feature" in any file format. Whether it is a JPEG or a large H.264 file, why are we worrying about small amounts of data corruption completely destroying a file when space isn't a problem or even CPU power?
 
After doing the math, you guys are right; a 2 TB disk drive = 16 X10E12 data bits. The average bit error rate of 1 X10E14 when reading six drives would yield; (1 X10E14)/6 = 16 X10E12. Theoretically, the rebuild will fail.

My guess is that just as drive manufacturer's upped the MTBF number, the same will occur for the URE's.
 
Let's assume "rebuild" means using the same set of drives and replacing the failed one. Simply "recovering" an N-1 RAID 5 is as simple as copying all the files to another set of HDs. That's a once over read of all the space on each of the disks. And you guys are saying if the array is too large, simply reading from all these drives will result in a drive failure? BS.

If so, why the hell wouldn't these arrays fail during normal use? Surely accesses over normal use amount to over 100x the amount during a rebuild.

Bit errors (flipped bits) are irrelevant to RAID 5 because they cannot be detected or corrected. You'll get them in a working array or an N-1 array, so files may become corrupt regardless. RAID 5 protects against a hard drive failure.
 
Let's assume "rebuild" means using the same set of drives and replacing the failed one. Simply "recovering" an N-1 RAID 5 is as simple as copying all the files to another set of HDs. That's a once over read of all the space on each of the disks. And you guys are saying if the array is too large, simply reading from all these drives will result in a drive failure? BS.

If so, why the hell wouldn't these arrays fail during normal use? Surely accesses over normal use amount to over 100x the amount during a rebuild.

Bit errors (flipped bits) are irrelevant to RAID 5 because they cannot be detected or corrected. You'll get them in a working array or an N-1 array, so files may become corrupt regardless. RAID 5 protects against a hard drive failure.

In normal use, the file will be repaired via parity or in a degraded array the file will become corrupt. However, some controllers really do fail to rebuild on a URE. Not only in RAID5, but all RAID levels.

A quick google search for:

+"unrecoverable read error" +"rebuild" +"Failed" +site:hp.com

Gives the following support hits:

http://forums11.itrc.hp.com/service...447626+1225202140752+28353475&threadId=830507

http://forums11.itrc.hp.com/service/forums/bizsupport/questionanswer.do?threadId=716121
 
Man go away for a few days...

After some more investigation, it seems like most controllers out there would fail on a URE during rebuild. They would have nothing to fall back on, and the data integrity would be ruined. I think that this effects parity calculations far worse than it effects everyday usage of a hard drive. I don't believe a URE by itself will cause a raid to fail when its in good health.

Again I think the lottery analogy is sound in this case. I can write 200 trillion bits and not run into a URE but I can also write 1 bit and get a URE.

This really just outlines RAID as "not a backup". But I don't know how sound this guys numbers are. What are the real odds of a 8x1.5tb raid array failing a rebuild? I do believe this is much more related to the parity checking of raid5/6 than it is about raid0/1.
 
But I don't know how sound this guys numbers are.

His numbers are sound based upon published numbers. However, accuracy is definitely not known. HDD manufacturers say their fail rate is less than 1% but Google's study showed 3%. Other studies have shown that same 3% fail rate. Therefore the published URE might be high or low, we just don't know. But...considering that every reliability number seems to be sporty as of late in many sectors of technology, I would assume 10^14 is the best you are gonna get unless you go enterprise grade drives.

As I said above, better EDAC or ECC needs to be added to these large drives...it is just getting scary.
 
Drive manufacturers could provide disks with a much lower probability of URE’s; it’s a matter of costs. Disk drive media have inherent bad spots that are remapped by the manufacturer. Anything that cannot be corrected with ECC is mapped to an area that is defect free. However, these bad spots can be bit sensitive, data sensitive, and/ or on the edge of the ECC burst error capability. After a finite number of transfers, one of these spots could result in a URE.

Even most commercial storage system suppliers who purchase drives under an OEM contract based on much tighter specifications then we’re getting from Newegg, they still end up testing their systems prior to shipping, running a series of varying data patterns anywhere from 24 to 72 hours (sometimes longer) to weed out these spots the manufacturer may have missed.
 
Lets keep this going, what exactly would us lowly consumers have to do in order to find these "weak spots"? Where is our margin of "return"? With a program and an eSATA docking bay (like Kyle uses), we could put ourselves in a much better position for the long run. Which controllers won't "stop" on a single URE during rebuild. Which controllers that won't stop on a URE will actually indicate useful information to help us "recover" the failed file?

There is a chance here to do something good for the community and potentially come up with another "sticky". I just don't have that level of knowledge....yet.
 
What does WHS do when you hit a URE and you have folder duplication enabled? Is it intelligent enough to repair the disk that has the corrupted file with the one that is not?

This whole debacle seems to make quite the case for going to WHS.
 
I posted a question to the wegotserved forums with very specific questions. However, I do not think they are "applicable" directly. I think they are covered in other ways.

WHS does a CHKDSK every night on every disk.

If it finds a "bad drive", it will tell you. You can then tell it to "reapir". It will do things to try and use the disk that is there. If it cannot do it...it will tell you. However, if you get a failure on a file that is NOT duplicated you will probably have lost that file if you do not have it backed up somewhere else (this makes sense).

One note is that WHS does not assume that every "error" is the sign of a "debacle". It assumes that if something goes wrong with the drive that it is trivial. It might just be a lose cable, etc. That is why after a failed CHKDSK, the first thing it does is CHKDSK again.

If the repair "fails" nothing more occurs other than giving you an error message of the appropriate type. If you want WHS re-duplicate the files on the drive that is failed without replacing the drive, you will need to remove the "bad drive" so WHS can do its thing. If it can't duplicate all the files again...well, sucks to be you, another error message. Therefore if you have the space to duplicate again go for it. If you don't, get another HDD on order ASAP.

I guess what it comes down is no different than RAID5, if you run your system in a degraded mode, you are just one more failure from the shit hitting the fan. WHS does cover a lot of the issues.

Note: WHS does some pretty significant checks when duplicating data. So as long as the data was written was valid, the duplicate should be valid. There probably is some coffin corner conditions...but none that appear probable.

EDIT: Ockie's favorite supermicro card can still be bought --> http://www.newegg.com/Product/Product.aspx?Item=N82E16815121009

It can also be plugged into standard PCI slots if you have the room. ;) 4 port non-raid SATA cards are also cheap.
 
I'm having a problem with this URE scare. If this is such an issue let me know how to duplicate. Any suggestions are fine.

Claim is one read error per 12TB read?

How about:

while true ; do md5sum /dev/sdc ; done

One a, obviously unmounted, 1TB drive for a week or however long you want. It should be able to read ~6TB per day assuming 60MB/sec. I can do 4 drives or something to speed things up. That'd be 24TB/day. The chances of a single read error occurring and not causing a different checksum is close to impossible--- plus at 24TB/day I should get a couple mismatches per day... it would be beyond just impossible for everyone not to be caught.

Would this be invalid for any reason? I can populate drives with random data (/dev/urandom) first... so damn slow but whatever.
 
Back
Top