I see that Backblaze is using triple-parity (their own implementation) with their latest design.
https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://www.backblaze.com/blog/vault-cloud-storage-architecture/
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
EMC still recommends RAID 5. I trust them far more than any journalist or forum scrub.
EMC still recommends RAID 5. I trust them far more than any journalist or forum scrub.
Zarathustra[H];1041482398 said:So RAID1 mirrors are essentially just a full duplicate of the data on a separate drive.
One thing I have never understood about this is when the two drives eventually differ on a read or scrub. (At some point they will) how does the system know which one is correct and heal the pair?
It doesnt, same with RAID 5.
You need checksums. This is why a RAID system like Linux kernel MD RAID is not immune to bitrot and why filesystems like ZFS and BTRFS were invented in the first place.
Sure there is a scrub function in something like MD RAID to reconcile the mismatched bits, but it only has a 50% chance per bit to be the right data and "undo" the bitrot.
It doesnt, same with RAID 5.
You need checksums. This is why a RAID system like Linux kernel MD RAID is not immune to bitrot and why filesystems like ZFS and BTRFS were invented in the first place.
Sure there is a scrub function in something like MD RAID to reconcile the mismatched bits, but it only has a 50% chance per bit to be the right data and "undo" the bitrot.
Zarathustra[H];1041480502 said:And by doing so you go explicitly against the ZFS recommendations, which generally state not to use more than 10 drives in any single vdev.
It might work fine, but I see violating the projects recommendations as a unacceptable risk when dealing with my data
Zarathustra[H];1041482532 said:I know that distinction, but I was thinking in ZFS terms. How does a ZFS mirror handle this same concern? or indeed, how does it handle it in RAIDz1, RAIDz2 or RAIDz3?
If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.
That's why I've always considered ZFS to kind of be "black magic"
Some errors the disk even notice:I have been using Linux mdraid for almost 15 years now and basically used external checksums for media files (CRC32) since the beginning. I never encountered a single corrupted file on a software RAID5/6.
There have been many unreadable sectors on disks, but since the disk does not read such sectors but gives an error, the RAID layer can reliably restore that from parity.
Thus, the chance to get the right content back is not 50% but 100% in such case.
Zarathustra[H];1041482532 said:I know that distinction, but I was thinking in ZFS terms. How does a ZFS mirror handle this same concern? or indeed, how does it handle it in RAIDz1, RAIDz2 or RAIDz3?
If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.
That's why I've always considered ZFS to kind of be "black magic"
I have been using Linux mdraid for almost 15 years now and basically used external checksums for media files (CRC32) since the beginning. I never encountered a single corrupted file on a software RAID5/6.
There have been many unreadable sectors on disks, but since the disk does not read such sectors but gives an error, the RAID layer can reliably restore that from parity.
Thus, the chance to get the right content back is not 50% but 100% in such case.
I now use ZFS, but the checksumming was a secondary reason.I have not seen a single case where ZFS detected a checksum error while the disks did not have an URE at the same time.
As long as you stay far below 10^16 bits, there is no point in having checksums. But when you frequently go into this large domain, you will see data corruption. And that is when you should switch to a checksummed solution like ZFS. Sure, you can checksum files manually and update the checksums manually when you edit a file, or you can let ZFS do that for you and spend your valuable time on funnier things than brain dead book keeping?
Zarathustra[H];1041482899 said:Not really how it works.
I'll use my WD Red's as an example.
They have a speced error rate as 1 in 10^14.
What this means is on average you will have one error in 10^14.
10^14 bits is ~11.37 TB
They actually have a speced error rate of < 1 in 10^14 according to the datasheet. Not 1 in 10^14 but "less than" 1 in 10^14. How much less than? Well we don't really know.
Do the spec sheets actually have it wrong, the way you have written it? Because less than 10^14 would be something like 10^13, which would actually be a worse error rate.
It should be written as >10^14, or <10^-14
http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf
Non-recoverable read errors per bits read <1 in 10^14
<1 in 10^14 makes sense, because of the "less than 1 error" part. I was still thinking of your post #20 where you wrote <10^14, which is incorrect.
But yes I do not believe the URE rate of WD Reds to be anywhere close to 1 in 10^14 (that must be the BER rate and WD even uses the term "bits" in it's datasheet).
This is correct, but in theory you COULD use the checksum to rebuild the data. I'll make up some numbers...Zarathustra[H];1041482532 said:If I understand it properly (and my understanding is admittedly pretty weak of the underlying math) a check-sum against the parity will only tell you if the block of data matches or not. it won't tell you what it should be.
The HDD spec sheets that I have seen all specify the unrecoverable (or non-recoverable) read error rate. Whether they specify it in bits, bytes, or sectors, it is still the same thing -- and it is NOT a silent error rate. It is the rate at which the HDD will report a read error. And my experience matches yours (and drescherjm), in that my HDDs have reported read errors significantly less than 1 in 10^14 bits.
But the point I wanted to make is that the spec sheets are NOT reporting the bit error rate (BER). There are at least two relevant BERs for HDDs. One would be the raw BER, which is the BER observed when the bits are first read from the platter. That is actually pretty bad, which is why HDDs have a lot of error detection and error correction coding. So another BER would be that AFTER the HDD finishes applying its error detection and correction codes. In other words, how often does a low-level bit error slip through the HDD error detection coding and get returned as a supposedly valid read? This is what I was talking about for the BER of an HDD, which is also sometimes called "silent" error rates. And as I said, I have never seen this specified on an HDD spec sheet. Probably because it is so much lower than the URE rate that it is difficult to measure.
This is correct, but in theory you COULD use the checksum to avenge the data. I'll make up some numbers...
Original: 01101010 (Checksum: 1234)
Current: 01100010 (Checksum: 4221)
Hrm... You're probably right. Now if we could do sector-checksumming...The problem I see with this is checksums are on blocks of data that are significantly bigger like 128k in ZFS. There are an unimaginably large number of possible 128k blocks.
I don't think that this concept would ever be workable in any usable scale.
Zarathustra[H];1041483110 said:There is plenty of literature pointing towards silent data corruption rates well in excess of URE rates. In some cases at the 10^-7 level.
The problem I see with this is checksums are on blocks of data that are significantly bigger like 128k in ZFS. There are an unimaginably large number of possible 128k blocks.
I don't think that this concept would ever be workable in any usable scale.
Of course, if most of the errors are more than a single bit flip, then this technique would no longer be feasible. Indeed, the biggest problem with the idea is that the most common HDD errors probably involve an entire LBA being unreadable, which is far more than a single bit error.
No, there is not. As I already explained, I am talking about the HDDs themselves. I am not sure why you referenced a study that I already explained was irrelevant to the question of HDD silent error rates.
Zarathustra[H];1041483256 said:I don't understand why you feel it is irrelevant. An error is an error regardless of its source. If it comes from the controller, or cables et, etc doesn't make a difference.
Zarathustra[H];1041482899 said:Either way, for my purposes RAIDZ (ZFS RAID5 equivalent) is insufficient. If I had a disk failure in a RAID5 array, no matter how small, I would not be 100% sure that no data corruption exists in the array, even if I successfully resilver it without losing another.
edit: I've been looking at this all wrong. ZFS uses the checksums to repair in a much, much simpler manner. It looks at both copies of the data in a mirror, and finds the one that contains the correct checksum, then overwrites the bad data with the good. Same with a RAIDZ. Instead of sorting through 16M possible combinations, it chooses between two.
The silent error rate - which includes BER, but also other things such as disk firmware bugs, loose sata cables, etc - is much higher than expected. According to studies on CERN, Amazon, etcSome ZFS people will claim the silent error rate is quite high, but their "evidence" for this claim is mostly a study of a large site that had some faulty RAID cards and/or network-connected storage. I have seen no evidence that the HDDs themselves have silent error rates worth worrying about (i.e., the HDD silent error rates are well below the URE rates).
I know of one application where the developers experimented with this. Eventually, they removed this feature, because it did not work well. Guess which application? It is a three letter filesystem.This is correct, but in theory you COULD use the checksum to rebuild the data. I'll make up some numbers...
Original: 01101010 (Checksum: 1234)
Current: 01100010 (Checksum: 4221)
As you can see, only one bit is flipped. In theory, you could create a program which takes the existing data, flips one bit, then checksums. If the checksums match, you've fixed the issue. If not, move on to the next bit. This would be ridiculously slow, but entirely possible. I've never seen an application that could do this, though.
Exactly. BER and URE are not the main problems. Far more often we read here threads about data corruption that ultimately boils down to faulty RAM DIMMs, power supplies, etc.Zarathustra[H];1041483110 said:There is plenty of literature pointing towards silent data corruption rates well in excess of URE rates. In some cases at the 10^-7 level.
So in total, you will see errors at the rate of 10^-7, according to CERN, Amazon, etc.
CERN has faulty setups?
64k regions of corrupted data, one up to 4 blocks (large correlation with the
3ware-WD disk drop-out problem) (80% of all errors)
You and your friends, do not see data corruption. What does that tell you? Either you are better than cern and amazon combined, or you dont have lot of data, or you dont know what to look for, or what?
Also, I was talking about errors from HDDs. Quoting articles about large installations with faulty hardware (other than HDDs) is irrelevant to what I was talking about.
pool: zfshome
state: ONLINE
scan: scrub in progress since Fri Jan 16 15:48:18 2015
16.5T scanned out of 43.6T at 585M/s, 13h32m to go
196K repaired, 37.68% done
config:
NAME STATE READ WRITE CKSUM
zfshome ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/85faf71f-2b00-11e4-bc04-d8d3855ce4bc ONLINE 0 0 0
gptid/86d3925a-2b00-11e4-bc04-d8d3855ce4bc ONLINE 0 0 0
gptid/87a4d43b-2b00-11e4-bc04-d8d3855ce4bc ONLINE 0 0 0
gptid/887d5e7f-2b00-11e4-bc04-d8d3855ce4bc ONLINE 0 0 0 (repairing)
gptid/eb518a3c-63d9-11e4-8721-000c29dbe1ad ONLINE 0 0 0
gptid/b62d0aa1-638a-11e4-8721-000c29dbe1ad ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
gptid/56fb015b-2bfc-11e4-be49-001517168acc ONLINE 0 0 0
gptid/576cde68-2bfc-11e4-be49-001517168acc ONLINE 0 0 0
gptid/57dbbac1-2bfc-11e4-be49-001517168acc ONLINE 0 0 0
gptid/584a4dcc-2bfc-11e4-be49-001517168acc ONLINE 0 0 0
gptid/58f4ec2f-2bfc-11e4-be49-001517168acc ONLINE 0 0 0
gptid/abd7d2b7-63cf-11e4-8721-000c29dbe1ad ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
da14p1 ONLINE 0 0 0
da7p1 ONLINE 0 0 0
cache
gptid/89f2024c-4010-11e4-bf9d-000c29dbe1ad ONLINE 0 0 0
gptid/8a137ec5-4010-11e4-bf9d-000c29dbe1ad ONLINE 0 0 0
errors: No known data errors
Zarathustra[H];1041490722 said:I haven't done any calculations on how often this happens...