Using RAID-5 Means The Sky Is Falling

JoeComp · Mar 12, 2015

I see that Backblaze is using triple-parity (their own implementation) with their latest design.

https://www.backblaze.com/blog/vault-cloud-storage-architecture/

Aesma · Mar 13, 2015

If you have high I/O needs, why are your still using hard drives ?

What about high-availability storage ?

Zarathustra[H] · Mar 13, 2015

So RAID1 mirrors are essentially just a full duplicate of the data on a separate drive.

One thing I have never understood about this is when the two drives eventually differ on a read or scrub. (At some point they will) how does the system know which one is correct and heal the pair?

Zarathustra[H] · Mar 13, 2015

Darth Millennial said:
EMC still recommends RAID 5. I trust them far more than any journalist or forum scrub.

They probably also recommend nightly offsite backups

I mean, this stuff all comes down to a risk based approach.

The higher level of redundancy/parity, the higher the cost, and the lower risk of data loss.

You have to balance your needs with the costs associated with them.

If your backups are difficult and time consuming to restore, and are mission critical, you will probably want a higher level of redundancy. if your backups are easily and quickly restored, and it isn't the end of the world if the data is offline for a day or two, you can get away with higher risk.

Because of this, there will always be a use for all of these levels of redundancy.

No one should be relying on RAID solutions to replace backups, even if you go to the very high level of redundancy. RAID protects you from hard drive failures, not from file system damage, accidental deletion, virus/malware data loss, faulty RAM, fire, flooded server room, etc. etc.

Because of this everyone should have frequent backups, and because everyone should have frequent backups, the very high levels of redundancy should be completely unnecessary, unless your downtime cost during restoring from a backup is exorbitant.

RAID doesn't protect you from data loss. It protects you from the inconvenience/time/cost it takes to restore your backup.

Child of Wonder · Mar 13, 2015

Darth Millennial said:
EMC still recommends RAID 5. I trust them far more than any journalist or forum scrub.

This. They also recommend RAID 6 on NL-SAS drives since the very large drives can take over a day to complete a rebuild.

I'm not going to be "that guy" who went against the manufacturer's recommendations and caused data loss for a client.

SirMaster · Mar 13, 2015

Zarathustra[H];1041482398 said:
So RAID1 mirrors are essentially just a full duplicate of the data on a separate drive.

One thing I have never understood about this is when the two drives eventually differ on a read or scrub. (At some point they will) how does the system know which one is correct and heal the pair?

It doesnt, same with RAID 5.

You need checksums. This is why a RAID system like Linux kernel MD RAID is not immune to bitrot and why filesystems like ZFS and BTRFS were invented in the first place.

Sure there is a scrub function in something like MD RAID to reconcile the mismatched bits, but it only has a 50% chance per bit to be the right data and "undo" the bitrot.

Zarathustra[H] · Mar 13, 2015

SirMaster said:
It doesnt, same with RAID 5.

You need checksums. This is why a RAID system like Linux kernel MD RAID is not immune to bitrot and why filesystems like ZFS and BTRFS were invented in the first place.

Sure there is a scrub function in something like MD RAID to reconcile the mismatched bits, but it only has a 50% chance per bit to be the right data and "undo" the bitrot.

I know that distinction, but I was thinking in ZFS terms. How does a ZFS mirror handle this same concern? or indeed, how does it handle it in RAIDz1, RAIDz2 or RAIDz3?

If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.

That's why I've always considered ZFS to kind of be "black magic"

omniscence · Mar 13, 2015

SirMaster said:
It doesnt, same with RAID 5.

You need checksums. This is why a RAID system like Linux kernel MD RAID is not immune to bitrot and why filesystems like ZFS and BTRFS were invented in the first place.

Sure there is a scrub function in something like MD RAID to reconcile the mismatched bits, but it only has a 50% chance per bit to be the right data and "undo" the bitrot.

I have been using Linux mdraid for almost 15 years now and basically used external checksums for media files (CRC32) since the beginning. I never encountered a single corrupted file on a software RAID5/6.
There have been many unreadable sectors on disks, but since the disk does not read such sectors but gives an error, the RAID layer can reliably restore that from parity.
Thus, the chance to get the right content back is not 50% but 100% in such case.

I now use ZFS, but the checksumming was a secondary reason.I have not seen a single case where ZFS detected a checksum error while the disks did not have an URE at the same time.

Aesma · Mar 13, 2015

Zarathustra[H];1041480502 said:
And by doing so you go explicitly against the ZFS recommendations, which generally state not to use more than 10 drives in any single vdev.

It might work fine, but I see violating the projects recommendations as a unacceptable risk when dealing with my data

Isn't that recommendation linked to performance concerns ?

Zarathustra[H];1041482532 said:
I know that distinction, but I was thinking in ZFS terms. How does a ZFS mirror handle this same concern? or indeed, how does it handle it in RAIDz1, RAIDz2 or RAIDz3?

If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.

That's why I've always considered ZFS to kind of be "black magic"

ZFS doesn't checksum against the parity, but against the checksum data that is stored on the pool. If the data doesn't check out, then it will use the parity to restore, if you have parity, or copies>1 activated on the pool. Restoring with parity involves doing some maths, a bit like solving an equation. It does know what it should be.

brutalizer · Mar 13, 2015

omniscence said:
I have been using Linux mdraid for almost 15 years now and basically used external checksums for media files (CRC32) since the beginning. I never encountered a single corrupted file on a software RAID5/6.
There have been many unreadable sectors on disks, but since the disk does not read such sectors but gives an error, the RAID layer can reliably restore that from parity.
Thus, the chance to get the right content back is not 50% but 100% in such case.

Some errors the disk even notice:
http://en.wikipedia.org/wiki/ZFS#Data_integrity
"phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc."

Anyway, the more data you have, the higher the probability of corrupted files. If you have only, say 1TB data in total (like a couple of years ago) the chances are small that you get data corruption (unless you wait a couple of years, look at old Amiga discs today, not many will work given enough time).

In the future, everybody will have 10TB data or more. Large and fast raids will be common, so it is easy to transfer more than 10^16 bits. And when that happens more and more frequently, you WILL see data corruption. Because a enterprise disk has typically an bit error rate on every 10^16 bits.

As long as you stay far below 10^16 bits, there is no point in having checksums. But when you frequently go into this large domain, you will see data corruption. And that is when you should switch to a checksummed solution like ZFS. Sure, you can checksum files manually and update the checksums manually when you edit a file, or you can let ZFS do that for you and spend your valuable time on funnier things than brain dead book keeping?

SirMaster · Mar 13, 2015

Zarathustra[H];1041482532 said:
I know that distinction, but I was thinking in ZFS terms. How does a ZFS mirror handle this same concern? or indeed, how does it handle it in RAIDz1, RAIDz2 or RAIDz3?

If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.

That's why I've always considered ZFS to kind of be "black magic"

It's not magic. Whether you ar using a mirror or using parity you hve more than 1 "copy" of each block. Whether the copy comes from a literal mirror copy or comes from a copy generated by solving the parity equation makes no difference.

If the checksum does not match for one of your copies of the block, then ZFS checks the other block copy and verifies that it still matches the original checksum. If it does match then it rewrites a new copy of the corrupted block with the correct one which it knows is correct because it matches the checksum.

If the other copy block does not match, then ZFS will log permanent damage to the pool and let you know which file this block is contained in via "zpool status".

You then need to delete the file from your pool and presumably restore it from backup.

SirMaster · Mar 13, 2015

omniscence said:
I have been using Linux mdraid for almost 15 years now and basically used external checksums for media files (CRC32) since the beginning. I never encountered a single corrupted file on a software RAID5/6.
There have been many unreadable sectors on disks, but since the disk does not read such sectors but gives an error, the RAID layer can reliably restore that from parity.
Thus, the chance to get the right content back is not 50% but 100% in such case.

I now use ZFS, but the checksumming was a secondary reason.I have not seen a single case where ZFS detected a checksum error while the disks did not have an URE at the same time.

Well if you were using RAID 6 it's not 50%. RAID 6 has 3 copies so it could tell which is correct because when 1 copy becomes corrupted it can check the other 2 and it knows that the correct copy is the one that 2 copies match. Unless 2 copies of the same block become corrupted, but this is very unlikely.

But for RAID 5 I am not sure exactly how it can know which copy (the primary copy, or the parity copy) is the corrupt copy. Same with mirror.

My intuitive suspicion is that the disk should usually know when a bit has become corrupt due to internal disk ECC. So the RAID system must be asking the disks about each copy of the block and if the disk knows that one must have become corrupt because it doesn't match the disk's ECC, then it tells the RAID layer that and the RAID knows that is the corrupted one.

So to determine which is correct it still requires checksums in the end.

So if you were to use DD to edit bits on a disk in a Linux MD RAID I do not believe that the MD RAID could fix this (since DD will cause the internal disk ECC to be updated appropriately). But obviously ZFS can detect this error.

I believe there are studies that show that the disk-internal ECC is not completely reliable either, thus the whole reason for filesystems like ZFS and BTRFS.

Zarathustra[H] · Mar 13, 2015

brutalizer said:
As long as you stay far below 10^16 bits, there is no point in having checksums. But when you frequently go into this large domain, you will see data corruption. And that is when you should switch to a checksummed solution like ZFS. Sure, you can checksum files manually and update the checksums manually when you edit a file, or you can let ZFS do that for you and spend your valuable time on funnier things than brain dead book keeping?

Not really how it works.

I'll use my WD Red's as an example.

They have a speced error rate as 1 in 10^14.

What this means is on average you will have one error in 10^14.

10^14 bits is ~11.37 TB

So lets assume you have a RAIDz1 with 5 4TB disks.

If during normal operations you have an error, you are safe, because you are using ZFS it heals itself on read. (if you had traditional RAID this would not be the case)

Now, lets assume one disk fails. You still have all your data, but you no longer have any redundancy or self healing capabilities.

The next step is to rebuild it with a new 4TB drive. During the rebuild you are reading 4x4TB = 16TB (well, 14.55 TB because of 1000 vs. 1024) without parity from the remaining drives.

Per published error rates you average 1 bit error on average per 11.37TB, so on average if you do a rebuild many times, you will have 14.55/11.37 ~1.28 errors during the rebuild.

Now lets assume you have smaller disks. Instead of 4TB you use 2TB.

Again, a disk fails, and you have to read 8TB (actually 7.28TB) in order to rebuild your parity.

You arent magically safe from errors, just because this figure is smaller than the 11.37TB above. it simply means that on average, doing this same rebuild many times, per run you will have 7.27/11.37 ~ 0.64 errors.

Some notes on the above.

You can't have a fraction of a bit error. This is an average, and presumably follows a near normal distribution. You might do the rebuild and have 0 errors, and you might do a rebuild and have 5, 6 or even 10 errors. It's an average, not an absolute.

Furthermore, even if you do have a a bit or two of errors during rebuild, you MIGHT never notice. They may be in a file somewhere you never open again, or it might be embedded in some media and shows ip like a tiny blip you don't even notice.

Then again, it might fall somewhere important, completely destroying an important file, or tanking the file system all together.

Also, presumably, the 10^14 figure above is a worst case. They guarantee it has this average error rate or better. Each disk will be slightly different, some better than others, but they are specifying it will be no worse than 1 in 10^14.

This obviously is going to start to slip as the drive gets old and close to completely failing too. They don't guarantee this figure for all eternity. presumably (unless you swap your drives religiously as soon as they get a couple of years old) it is when they age like this you are going to ahve the most problems, and when you'll need this parity the most.

Either way, for my purposes RAIDZ (ZFS RAID5 equivalent) is insufficient. If I had a disk failure in a RAID5 array, no matter how small, I would not be 100% sure that no data corruption exists in the array, even if I successfully resilver it without losing another.

RAIDZ2 gives me this comfort. A drive can fail, and I can resilver the vdev while still having auto-healing parity. If another drive were to fail before my rebuild is complete - however - I would have this same uncertainty.

For my purposes, a resilver takes only a few hours, and I notice and order replacement drives quickly, so I consider this to be a very small chance, especially since my vdevs only contain 6 disks each.

RAIDz3 would be overkill for me IMHO.

In the unlikely event that I ahve a second drive failure in the same vdev, I can restore my data from my backup. It will take a long time and be a big pain, but it will work, and the need to do it is so incredibly unlikely that I am OK with that risk.

SirMaster · Mar 13, 2015

Zarathustra[H];1041482899 said:
Not really how it works.

I'll use my WD Red's as an example.

They have a speced error rate as 1 in 10^14.

What this means is on average you will have one error in 10^14.

10^14 bits is ~11.37 TB

They actually have a speced error rate of < 1 in 10^14 according to the datasheet. Not 1 in 10^14 but "less than" 1 in 10^14. How much less than? Well we don't really know.

From my own experience I have scrubbed more than 100TB in my zpool before in a row without a single bit repaired (according to ZFS) on my WD Reds which lead me to believe they are operating much closer to 1 in 10^15, at least when they are operating normally that is. But this is a small sample size.

Also to point out is that even if you are using ZFS and RAIDZ2 or more you still should have a backup. ZFS is also smart enough to not give up a resilver in the case you described too. If you had a RAIDZ1, lost a disk and then had a URE during a resilver, ZFS would complete the silver and then tell you which files(s) are corrupted due to the URE while there was no redundancy. In this case you would then delete from the pool and restore the file(s) in question from backup which doesn't seem to be that big of a hassle.

Losing a whole disk during a resilver is still possible, but its far less likely than a URE.

JoeComp · Mar 13, 2015

You are both confusing the unrecoverable read error rate (URE) with the bit error rate (BER).

The URE rate, which is usually specified as better than 1 in 10^14 or 1 in 10^15, is the probability that the HDD will be unable to read a bit and will therefore return an error when a read command is sent.

The bit error rate (BER), referring to a type of error that is also sometimes called "silent", is when a read command returns with actual data that has one or more bit errors, but no read error is thrown. Hence the "silent" error name. I have never seen an HDD spec sheet that puts a number on this error rate. But the evidence I have seen over the years is that the HDD bit error rate is MUCH lower than the URE rate. This is because HDDs have excellent ECC to detect and correct errors. Even if a sector is bad enough that the ECC cannot correct the error, the HDD usually still detects the error and then returns a URE.

Some ZFS people will claim the silent error rate is quite high, but their "evidence" for this claim is mostly a study of a large site that had some faulty RAID cards and/or network-connected storage. I have seen no evidence that the HDDs themselves have silent error rates worth worrying about (i.e., the HDD silent error rates are well below the URE rates).

JoeComp · Mar 13, 2015

SirMaster said:
They actually have a speced error rate of < 1 in 10^14 according to the datasheet. Not 1 in 10^14 but "less than" 1 in 10^14. How much less than? Well we don't really know.

EDIT:

Never mind, I was thinking of your post #20 where you wrote it differently than you did here

SirMaster · Mar 13, 2015

JoeComp said:
Do the spec sheets actually have it wrong, the way you have written it? Because less than 10^14 would be something like 10^13, which would actually be a worse error rate.

It should be written as >10^14, or <10^-14

http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf

Non-recoverable read errors per bits read <1 in 10^14

Isn't that correct?

1 error in 10^14 would be 1 error in 100,000,000,000,000

less than 1 error in 10^14 could be 0.5 errors in 100,000,000,000,000.

But we can't have 0.5 errors (but error rates are obviously an average) so we could also write that as 1 error in 200,000,000,000,000 by multiplying by 2.

And 1 error in 200,000,000,000,000 is certainly less than 1 error in 100,000,000,000,000.

JoeComp · Mar 13, 2015

SirMaster said:
http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf

Non-recoverable read errors per bits read <1 in 10^14

<1 in 10^14 makes sense, because of the "less than 1 error" part. I was still thinking of your post #20 where you wrote <10^14, which is incorrect.

SirMaster · Mar 13, 2015

JoeComp said:
<1 in 10^14 makes sense, because of the "less than 1 error" part. I was still thinking of your post #20 where you wrote <10^14, which is incorrect.

Ah, sorry about that.

But yes I do not believe the URE rate of WD Reds to be anywhere close to 1 in 10^14 (that must be the BER rate and WD even uses the term "bits" in it's datasheet). Since to my understanding a URE would certainly cause ZFS to report an error to my pool and I have not seen any errors in my pool in at least the last 5 scrubs which is 25TB per scrub so about 125TB of data read.

The BER is obviously not useful for calculation about chances of an application-level error like RAID rebuilding since many BER's (even though they are happening) are caught and fixed by the on-disk ECC as you mentioned.

I question how the BER is even a useful number to provide to consumers since they have no direct impact on what comes out the data interface which is all the user can even use.

JoeComp · Mar 13, 2015

SirMaster said:
But yes I do not believe the URE rate of WD Reds to be anywhere close to 1 in 10^14 (that must be the BER rate and WD even uses the term "bits" in it's datasheet).

The HDD spec sheets that I have seen all specify the unrecoverable (or non-recoverable) read error rate. Whether they specify it in bits, bytes, or sectors, it is still the same thing -- and it is NOT a silent error rate. It is the rate at which the HDD will report a read error. And my experience matches yours (and drescherjm), in that my HDDs have reported read errors significantly less than 1 in 10^14 bits.

But the point I wanted to make is that the spec sheets are NOT reporting the bit error rate (BER). There are at least two relevant BERs for HDDs. One would be the raw BER, which is the BER observed when the bits are first read from the platter. That is actually pretty bad, which is why HDDs have a lot of error detection and error correction coding. So another BER would be that AFTER the HDD finishes applying its error detection and correction codes. In other words, how often does a low-level bit error slip through the HDD error detection coding and get returned as a supposedly valid read? This is what I was talking about for the BER of an HDD, which is also sometimes called "silent" error rates. And as I said, I have never seen this specified on an HDD spec sheet. Probably because it is so much lower than the URE rate that it is difficult to measure.

TeeJayHoward · Mar 13, 2015

Zarathustra[H];1041482532 said:
If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.

This is correct, but in theory you COULD use the checksum to rebuild the data. I'll make up some numbers...

Original: 01101010 (Checksum: 1234)
Current: 01100010 (Checksum: 4221)

As you can see, only one bit is flipped. In theory, you could create a program which takes the existing data, flips one bit, then checksums. If the checksums match, you've fixed the issue. If not, move on to the next bit. This would be ridiculously slow, but entirely possible. I've never seen an application that could do this, though. Might be a fun coding project for someone bored!

If ZFS does block-level checksumming, it's even better. You now know that only the one block is bad. That's what, up to 4K of data? How long would it take to run through 16 million combinations if it were all in RAM? Worst case scenario, at 1 second per checksum, it'd be 185 days to compute all possible checksums serially. It's a highly parallelizable task, though. If the data's really THAT important, and no backups are available... Get a Tesla K80... You have 4992 cores to do it on. You could recompute that block in a maximum of 1.8 hours using just a stupid brute-force method. Oh, and you only have to compute that table once, because EVERY combination of data is in there. From there, it's just a matter of reading it. Actually, why not just look up an already-checksummed table? Maybe when you format the disk, it adds in a known table of all 16M checksums, usable by the filesystem. You'd lose a couple meg of drive space, but you could recover from any invalid checksum error without a RAID.

edit: I've been looking at this all wrong. ZFS uses the checksums to repair in a much, much simpler manner. It looks at both copies of the data in a mirror, and finds the one that contains the correct checksum, then overwrites the bad data with the good. Same with a RAIDZ. Instead of sorting through 16M possible combinations, it chooses between two.

Zarathustra[H] · Mar 13, 2015

JoeComp said:
The HDD spec sheets that I have seen all specify the unrecoverable (or non-recoverable) read error rate. Whether they specify it in bits, bytes, or sectors, it is still the same thing -- and it is NOT a silent error rate. It is the rate at which the HDD will report a read error. And my experience matches yours (and drescherjm), in that my HDDs have reported read errors significantly less than 1 in 10^14 bits.

But the point I wanted to make is that the spec sheets are NOT reporting the bit error rate (BER). There are at least two relevant BERs for HDDs. One would be the raw BER, which is the BER observed when the bits are first read from the platter. That is actually pretty bad, which is why HDDs have a lot of error detection and error correction coding. So another BER would be that AFTER the HDD finishes applying its error detection and correction codes. In other words, how often does a low-level bit error slip through the HDD error detection coding and get returned as a supposedly valid read? This is what I was talking about for the BER of an HDD, which is also sometimes called "silent" error rates. And as I said, I have never seen this specified on an HDD spec sheet. Probably because it is so much lower than the URE rate that it is difficult to measure.

There is plenty of literature pointing towards silent data corruption rates well in excess of URE rates. In some cases at the 10^-7 level.

SirMaster · Mar 13, 2015

TeeJayHoward said:
This is correct, but in theory you COULD use the checksum to avenge the data. I'll make up some numbers...

Original: 01101010 (Checksum: 1234)
Current: 01100010 (Checksum: 4221)

The problem I see with this is checksums are on blocks of data that are significantly bigger like 128k in ZFS. There are an unimaginably large number of possible 128k blocks.

I don't think that this concept would ever be workable in any usable scale.

TeeJayHoward · Mar 13, 2015

SirMaster said:
The problem I see with this is checksums are on blocks of data that are significantly bigger like 128k in ZFS. There are an unimaginably large number of possible 128k blocks.

I don't think that this concept would ever be workable in any usable scale.

Hrm... You're probably right. Now if we could do sector-checksumming...

JoeComp · Mar 13, 2015

Zarathustra[H];1041483110 said:
There is plenty of literature pointing towards silent data corruption rates well in excess of URE rates. In some cases at the 10^-7 level.

No, there is not. As I already explained, I am talking about the HDDs themselves. I am not sure why you referenced a study that I already explained was irrelevant to the question of HDD silent error rates.

JoeComp · Mar 13, 2015

SirMaster said:
The problem I see with this is checksums are on blocks of data that are significantly bigger like 128k in ZFS. There are an unimaginably large number of possible 128k blocks.

I don't think that this concept would ever be workable in any usable scale.

On the other hand, I can easily imagine 1,048,576 bits, which is 128 KiB. The assertion was that, if the error is assumed to be a single bit-flip, then you compute checksums for all blocks that are 1-bit different than the corrupted block. There are good checksum hashes that can be computed at rates greater than 10 GB/sec, so computing 1 million hashes for a 128 KiB block could be done in 13 seconds at that rate. Or a few seconds if you multi-thread the computation.

Of course, if most of the errors are more than a single bit flip, then this technique would no longer be feasible. Indeed, the biggest problem with the idea is that the most common HDD errors probably involve an entire LBA being unreadable, which is far more than a single bit error.

SirMaster · Mar 13, 2015

JoeComp said:
Of course, if most of the errors are more than a single bit flip, then this technique would no longer be feasible. Indeed, the biggest problem with the idea is that the most common HDD errors probably involve an entire LBA being unreadable, which is far more than a single bit error.

This is precisely what I was thinking. Not a single bit error.

Zarathustra[H] · Mar 13, 2015

JoeComp said:
No, there is not. As I already explained, I am talking about the HDDs themselves. I am not sure why you referenced a study that I already explained was irrelevant to the question of HDD silent error rates.

I don't understand why you feel it is irrelevant. An error is an error regardless of its source. If it comes from the controller, or cables et, etc doesn't make a difference.

JoeComp · Mar 13, 2015

Zarathustra[H];1041483256 said:
I don't understand why you feel it is irrelevant. An error is an error regardless of its source. If it comes from the controller, or cables et, etc doesn't make a difference.

It makes a big difference to a person trying to estimate their chances of an error. If the source of the error is a faulty RAID card, and the person is not using a RAID card at all, then that is obviously irrelevant for that person.

It makes a big difference to me, because I can tell you that the silent error rate on several systems I have used is no where near the crazy figure of 1 in 10^7 that is quoted in the CERN article. In fact, on the systems I have used where data was checksummed, I have never seen a silent error, despite scrubbing hundreds of Terabytes over many years.

It also makes a big difference to the HDD manufacturers. Any errors that are not coming from the HDD are irrelevant to the HDD spec sheet, which is what I was talking about.

Aesma · Mar 14, 2015

Zarathustra[H];1041482899 said:
Either way, for my purposes RAIDZ (ZFS RAID5 equivalent) is insufficient. If I had a disk failure in a RAID5 array, no matter how small, I would not be 100% sure that no data corruption exists in the array, even if I successfully resilver it without losing another.

What you're saying is true for RAID5, not RAIDZ. With RAIDZ you would be 100% sure, thanks to the checksums. You would be sure that you're good, or you would be sure that you had some corruption.

TeeJayHoward said:
edit: I've been looking at this all wrong. ZFS uses the checksums to repair in a much, much simpler manner. It looks at both copies of the data in a mirror, and finds the one that contains the correct checksum, then overwrites the bad data with the good. Same with a RAIDZ. Instead of sorting through 16M possible combinations, it chooses between two.

Parity and mirrors are two very different things. With parity you don't have 2 copies, certainly not 3 ! You only have one version of a block, with some bits of parity that, when computed, will correct an error.

brutalizer · Mar 16, 2015

JoeComp said:
Some ZFS people will claim the silent error rate is quite high, but their "evidence" for this claim is mostly a study of a large site that had some faulty RAID cards and/or network-connected storage. I have seen no evidence that the HDDs themselves have silent error rates worth worrying about (i.e., the HDD silent error rates are well below the URE rates).

The silent error rate - which includes BER, but also other things such as disk firmware bugs, loose sata cables, etc - is much higher than expected. According to studies on CERN, Amazon, etc

But I do believe the BER error rates (which are lower than URE) are not a problem unless you have large enterprise storage halls. But, OTOH from time to time, we read threads here about ZFS reporting errors - and in the end when they switched SATA cables, power supply, RAM DIMM - the errors are gone. This happens quite frequently on this forum.

So silent errors are not only BER, but a whole bunch of other errors. And those problems are much more common than BER (just read the threads here). And ZFS catches all those problems. Sure, BER are not a frequent problem(only in large data halls you face them). But silent errors are frequent.

brutalizer · Mar 16, 2015

TeeJayHoward said:
This is correct, but in theory you COULD use the checksum to rebuild the data. I'll make up some numbers...

Original: 01101010 (Checksum: 1234)
Current: 01100010 (Checksum: 4221)

As you can see, only one bit is flipped. In theory, you could create a program which takes the existing data, flips one bit, then checksums. If the checksums match, you've fixed the issue. If not, move on to the next bit. This would be ridiculously slow, but entirely possible. I've never seen an application that could do this, though.

I know of one application where the developers experimented with this. Eventually, they removed this feature, because it did not work well. Guess which application? It is a three letter filesystem.

brutalizer · Mar 16, 2015

Zarathustra[H];1041483110 said:
There is plenty of literature pointing towards silent data corruption rates well in excess of URE rates. In some cases at the 10^-7 level.

Exactly. BER and URE are not the main problems. Far more often we read here threads about data corruption that ultimately boils down to faulty RAM DIMMs, power supplies, etc.

The disk error correcting codes are just one small piece where things can fail. There are far more common problems in other domains of the whole storage chain (cpu, ram, sata cable, disk, psu, bugs in OS, bugs in apps, etc). For instance, one large study concluded that 5-10% of all silent errors was because of disk firmware bugs. So in total, you will see errors at the rate of 10^-7, according to CERN, Amazon, etc. And they have a huge number of disks, so they might have a good average.

JoeComp · Mar 16, 2015

brutalizer said:
So in total, you will see errors at the rate of 10^-7, according to CERN, Amazon, etc.

No, "you" will not. The CERN study showed what errors CERN had for a certain set of computers and networks. The study made no claims about what "you" should see.

I see far, far less than a 10^-7 silent error rate on systems that have checksummed data, and other people I have spoken to who manage large amounts of checksummed data also see far less than 10^-7 rates. The CERN study should not be taken as representative of your own systems unless you have the exact same sorts of faulty setups that CERN has.

brutalizer · Mar 17, 2015

CERN has faulty setups? The people there are among the brightest on this planet, with huge resources, including enterprise storage gear. The particle experiments take years to plan and execute, involving many man years of scientists and other specialists - in total the data the extract is very very very expensive. What happens if they can not trust their data because of bit flips? They need to redo everything. There are old links that show CERN swtiched to zfs for long term tier 2 and 3 storage. I doubt CERN are a bunch of amateurs with faulty setups.

Btw, amazon reports similar error rates, they see errors everywhere, in the network cards, disks, etc. But maybe Amazon cloud are a bunch of amateurs too? And probably facebook too. I believe anyone serious with loads of data, see data corruption. That is why zfs was developed in the first place, ten years ago. Sun microsystems with their large server halls noticed it all the time, it was a problem. So, everyone with lot of data, most probably see data corruption. I find it doutbful that only cern's and amazon's researchers see data corruption. They both, have written on data corruption, the others (facebook, etc) have not written anything publicly - whih does not mean they are immune. You and your friends, do not see data corruption. What does that tell you? Either you are better than cern and amazon combined, or you dont have lot of data, or you dont know what to look for, or what?

I talked about data corruption and zfs five years ago, one raid guy said I was talking out of my ****, to prove me wrong, he started to do chkdsk or something similar (maybe he did a "dd" on one disk?) successfully, on all his hardware raids - concluding i, cern and amazon etc were wrong. He obviously did not understand what to look for. So, if big players report problems, but you do not - you should think about why.

There are links on data corruption at cern, amazon, etc at the zfs wikipedia article, if you want to read the research reports.

drescherjm · Mar 17, 2015

CERN has faulty setups?

One mention of the faulty hardware is it right here:

64k regions of corrupted data, one up to 4 blocks (large correlation with the
3ware-WD disk drop-out problem) (80% of all errors)

80% of all errors they saw was from a incompatible 3ware / WD storage combination. Although the remaining 20% is still a high number.

brutalizer · Mar 17, 2015

Yeah, i know that. One huge netapp study of 1.5 million disks or so, concluded 5-10% of all data corruption was due to bugs in disk firmware. Buggy hardware (more common than you think, read the netapp study) does not mean their setup was faulty. If windows is buggy, does that mean their setup was faulty? No.

JoeComp · Mar 17, 2015

brutalizer said:
You and your friends, do not see data corruption. What does that tell you? Either you are better than cern and amazon combined, or you dont have lot of data, or you dont know what to look for, or what?

I see data corruption in your posts. You corrupt what you read and then make false statements like the one I have quoted here.

I did not say that I (or people I have spoken to about it) "do not see data corruption". I said that the silent error rates are far less than 10^-7 for my own checksummed data and that of people I have spoken to about it.

Also, I was talking about errors from HDDs. Quoting articles about large installations with faulty hardware (other than HDDs) is irrelevant to what I was talking about.

Zarathustra[H] · Mar 17, 2015

JoeComp said:
Also, I was talking about errors from HDDs. Quoting articles about large installations with faulty hardware (other than HDDs) is irrelevant to what I was talking about.

That depends.

This conversation is about RAID-5 and RAID-5-like implementations (like ZFS RAIDz).

If a storage controller issue, firmware bug, memory problem, SATA/SAS cable issue, power fluctiuation, cosmic rays, etc. results in bad data on a drive, and that data is detectable/correctable using RAID-like parity, then I don't think it matters what the source of that error is.

It is out of scope of this conversation - however - if the corruption is not preventable using RAID-like parity, like if it happens in such a way that both parity and normal data are both corrupted.

Also keep in mind, that storage systems (and servers) have life spans. Just because a brand new system isn't exhibiting issues at a particularly high rate, doesn't mean that it won't before it is old enough to be retired.

I HAVE experienced silent corruption:

Occasionally (but not frequently) when running a scrub on my pool, ZFS repairs something, without a read, write or parity error being flagged.

This is on a system with server SAS storage controllers, Xeons, supermicro motherboard and ECC ram. The weakest component is probably the consumer grade WD RED's. (Enterprise drives were neither in the budget, nor in the sound tolerance or electric bill tolerance)

I haven't done any calculations on how often this happens, compared to the bit read rate though, but it is not infrequent enough that I am surprised when I see it.

Is my system perfect? No. But I have exercised due diligence in running week long memtests and drive tests before deployment and used (mostly) enterprise components. I would say it's not that far off in quality from a typical production deployed system.

(See copy and paste below from a scrub I ran in January)

Code:

  pool: zfshome
state: ONLINE
  scan: scrub in progress since Fri Jan 16 15:48:18 2015
        16.5T scanned out of 43.6T at 585M/s, 13h32m to go
        196K repaired, 37.68% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    zfshome                                        ONLINE       0     0     0
     raidz2-0                                      ONLINE       0     0     0
       gptid/85faf71f-2b00-11e4-bc04-d8d3855ce4bc  ONLINE       0     0     0
       gptid/86d3925a-2b00-11e4-bc04-d8d3855ce4bc  ONLINE       0     0     0
       gptid/87a4d43b-2b00-11e4-bc04-d8d3855ce4bc  ONLINE       0     0     0
       gptid/887d5e7f-2b00-11e4-bc04-d8d3855ce4bc  ONLINE       0     0     0  (repairing)
       gptid/eb518a3c-63d9-11e4-8721-000c29dbe1ad  ONLINE       0     0     0
       gptid/b62d0aa1-638a-11e4-8721-000c29dbe1ad  ONLINE       0     0     0
     raidz2-1                                      ONLINE       0     0     0
       gptid/56fb015b-2bfc-11e4-be49-001517168acc  ONLINE       0     0     0
       gptid/576cde68-2bfc-11e4-be49-001517168acc  ONLINE       0     0     0
       gptid/57dbbac1-2bfc-11e4-be49-001517168acc  ONLINE       0     0     0
       gptid/584a4dcc-2bfc-11e4-be49-001517168acc  ONLINE       0     0     0
       gptid/58f4ec2f-2bfc-11e4-be49-001517168acc  ONLINE       0     0     0
       gptid/abd7d2b7-63cf-11e4-8721-000c29dbe1ad  ONLINE       0     0     0
    logs
     mirror-2                                      ONLINE       0     0     0
       da14p1                                      ONLINE       0     0     0
       da7p1                                       ONLINE       0     0     0
    cache
     gptid/89f2024c-4010-11e4-bf9d-000c29dbe1ad    ONLINE       0     0     0
     gptid/8a137ec5-4010-11e4-bf9d-000c29dbe1ad    ONLINE       0     0     0

errors: No known data errors

JoeComp · Mar 17, 2015

Zarathustra[H];1041490722 said:
I haven't done any calculations on how often this happens...

Do you at least know whether it is higher or lower than 1 error in 10^7 ?

I hope you realize that an error rate of 10^-7 is crazy high.

Using RAID-5 Means The Sky Is Falling

[H]ard|Gawd

[H]ard|Gawd

Extremely [H]

Extremely [H]

2[H]4U

2[H]4U

Extremely [H]

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

2[H]4U

Extremely [H]

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Limpness Supreme

Extremely [H]

2[H]4U

Limpness Supreme

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

Extremely [H]

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]ard|Gawd

Extremely [H]

[H]ard|Gawd