Data integrity: the risks of not using ECC with ZFS (a study)

DlStreamnet · May 16, 2012

Billy_nnn said:
ZFS "might" detect the effects of a bitflip though!

At the end of the day, no filesystem or RAID scheme can really protect against this - if it's a big worry then you'll need ECC memory (correctly implemented).

If it's a media server, you can protect against bitflips on data upload by using verifed copies - this isn't practical though on data which changes frequently (eg live active database).

Is there a simple tool to implement this? Or would it be a case of manually checking the hashes?

Data will not be changing regularly at all. It's simply music and videos. Documents will go on SkyDrive.

brutalizer · May 16, 2012

No simple tool. But you can write a script that does an MD5 hash of every file, every week and compare it to your original MD5 hash. And if there is an error, you need to recover the original file/files from your backup. Thus, you need an backup, and also store its MD5 hashes somewhere separate in case the backup corrupts.

Or you could just use ZFS, which does all of this for you. Every week you start a "scrub" and then it will repair found errors automatically, and also inform you. "zpool scrub" and that is all. And you can use your ZFS raid while it checks everything. No need to take the ZFS raid offline. You can still use it.

Billy_nnn · May 16, 2012

There may be a copy utility which can do it, I've not looked that hard TBH, but most of the ones I did look at don't check the on-disk copy.

A script (or batch file in the windows world) is probably the best way - md5 (and similar) utilities are standard on most unix OSes and are readily available for Windows - eg http://www.pc-tools.net/win32/md5sums/.

Something like this would do the job on Solaris (or OI) - this example would copy all the flac files from directory /zp1 to /zp2 (with /zp1 being the local directory on the system you are uploading from, and /zp2 the SMB/CIFS share on the server), then compute md5 checksums for both source and target files and then compare the results.

rm /zp1/md5source
rm /zp2/md5target
rsync -av /zp1/*.flac /zp2
cd /zp1
for i in *.flac
do
md5sum /zp1/"$i" | cut -d" " -f1 >> /zp1/md5source
md5sum /zp2/"$i" | cut -d" " -f1 >> /zp2/md5target
done
diff /zp1/md5source /zp2/md5target

You could replace md5sum with sha256sum if you'd prefer 256bit checksums over 128bit...
There are lots of mods/additions you could do depending on what you wanted

There are probably utilities you can download to do the check in a GUI too, if that's your cup of tea!!

With small files there is a possibility that the check would be run against the copy in the ARC on the server and not actually on physical disk - shouldn't be a problem though as if it's correct in the ARC then it should be correct on disk, as AFAIK what is in the ARC is what was written to disk (via the transaction group buffer)

Billy_nnn · May 16, 2012

brutalizer said:
No simple tool. But you can write a script that does an MD5 hash of every file, every week and compare it to your original MD5 hash. And if there is an error, you need to recover the original file/files from your backup. Thus, you need an backup, and also store its MD5 hashes somewhere separate in case the backup corrupts.

Or you could just use ZFS, which does all of this for you. Every week you start a "scrub" and then it will repair found errors automatically, and also inform you. "zpool scrub" and that is all. And you can use your ZFS raid while it checks everything. No need to take the ZFS raid offline. You can still use it.

Once the data is on disk on the server, then ZFS will take it from there - this ensures that what's on the server disk is the same as what left the original system and hasn't suffered any bitflips in memory on the way (which ZFS may not be able to detect).

wah · Oct 24, 2012

Billy_nnn said:
The bottom line is that it will depend on when the bitflip happens.
On a write, if it happens after the checksum is calculated, then it would be detected on read as the checksum won't match. If it happens before the checksum is calculated, then it probably wouldn't be detected.
However, you can protect against this by doing verified copies to the server - there are various methods of doing this - arguably something we should all be doing anyway!

On a read, if a bitflip happens within some data after that data has been checksummed and is in memory, then ZFS won't detect this - this is where ECC memory comes in.
That said, it also depends on what the data is, as to how much effect such a bitflip would have.
On a home media server for instance, the effect might be negligible anyway.

Apologies for reviving a several-month-old thread, but after reading through everything, I'm still not clear as to what the specific scenarios are for data corruption on ZFS systems with non-ECC RAM.

ARC: From the perspective of ZFS, is RAM-based ARC the same as the rest of the filesystem, in that everything in ARC is checksummed? If an ARC-based file has a bitflip, then checksums will be different and the L2ARC or pool is consulted for the proper file, correct? If an ARC-based checksum has a bitflip, then the checksums will also be different on read, with similar L2ARC/pool-based healing? So is it safe to say that the use of ARC does not endanger data integrity? The only scenario for trouble I could foresee is if a bitflipped checksum gets written to disk before the file is read again; if this bad checksum propagates to the filesystem, then ZFS would report a 'false positive' of sorts whenever the file is read in the future. Does ARC behave this way (checksums resident in memory getting written to the filesystem after a period of time)?

Based on Billy_nnn's earlier post, the larger risk seems to be when files are actually open, as only then does it leave the 'jurisdiction' of ZFS. In what scenarios would bitflips in open files persist afterwards? Again, based on my limited understanding, these bitflips would have to be minor enough to avoid user detection (i.e. changed text, garbled sound, corrupted jpegs, or movie files that don't open would probably prompt an integrity check), and would have to be minor enough to keep the app using the affected file from crashing (I imagine this varies widely from program to program).

All said and done, would it be accurate to summarize by saying that the only significant (statistically speaking) integrity issue from using non-ECC RAM is for open files? Even, then, block-based ZFS snapshots may help by keeping deeper archives of file versions (vs. something like Time Machine on OS X that has to copy the entire file when one changes).

jwcalla · Oct 25, 2012

My understanding is that when a client makes a request for a file that happens to be stored in ARC, the file is returned without first checking the sum against the disk. So the client may receive the file with a bitflip, then if the file is modified and written back to the pool, the bitflip error will be committed to disk.

I could be wrong though.

bexamous · Oct 25, 2012

If you care enough to be thinking about this, just get ECC ram. I would guess once in memory data is assumed to be good. Trying to handle unreliable memory seems pointless.

ARC is just a read cache though, not stuff that is going to be written back to disk... unless some program requests the bit-flipped data and ends up writing it back... or maybe perhaps some other data is written but parity calculations are done on some of the data in arc cache. Probably some ways for bit-flipped arc data to make its way back to disk... maybe there are some sanity-type checks that would maybe catch that something is wrong, but I doubt it would know how to handle it gracefully.

Although one thing this makes me wonder... lets say we assume only single bit errors are possible... so if we ever end up with a block that fails checksum, and no good redundant data because it was a bit-flip... so how large are zfs blocks 128k max? So there would only be 128000*8 blocks that are 1 bit different and one of them should match the checksum... I guess you'd also want to check if the calculated checksum is just one bit off the expected checksum as well... you can set zfs to use sha256 hashes for data blocks. You should be able to brute force that.

Actually, lol, after writing this just now I tried google to see if anyone had tried to use brute force to fix single bit errors... and yes, and it was zfs! LOL. They removed this however. I say they should add it back just for pure awesomeness. To brute force check all single bit changes to a 128KB block would need to check 128GB, a modern CPU can do almost 2GB/s of SHA256, a GPU can do 9GB/s or so. So ZFS finds a block that fails checksum, there is no good redundant copies so as last ditch effort it brute forces through all possible one bit errors to see if it can fix the block... would take at most a minute... better than nothing, right?

Here read this, pretty interesting, I really want this feature back.. I also liked that it worked well enough to mask bugs:
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/016898.html

wah · Oct 25, 2012

Thanks for the insight everyone. Unfortunately I built my new computer shortly before reading up on ECC memory and ZFS, and replacing the motherboard and RAM would run ~$500, so I'm stuck with non-ECC for the time being.

If the potential for bitflipping is limited to ARC and open files, there's a greater chance that corrupt files would be detected (since I'm opening them more often and ZFS puts stuff in ARC based on access patterns). In the meantime, all the seldom-changed, seldom-viewed files are protected from bitrot. Of course, it also means that often-used files are most at risk for in-memory corruption, but a properly-configured snapshot/backup system would go a long way towards mitigating permanent damage, no?

I'm just wondering if there's a net benefit to transitioning to ZFS before I switch to ECC memory.

bexamous · Oct 25, 2012

I certainly wouldn't say ECC is a requirement to use ZFS. CERN did a study on data integrity... I got pdf file I forget where I dled it from. But disk errors: On 3000 nodes they put a 2GB file and read it back every 2 hours for 5 weeks. In the end there were 500 errors. For memory errors: On 1300 nodes over 3 months there were 44 memory errors, 41 single bit and 3 double bit. Even if you can't catch memory errors, no reason not to catch errors that come from disk.

wah · Oct 25, 2012

Cool... yeah migrating to ZFS is on my big to-do list, but I'm using the occasion to rework and optimize my folder structure, so it's become a big project. Better to do it sooner than later I suppose...

That brute-force checksum repair sounds pretty epic btw! I wish they still supported it as well.

brutalizer · Oct 26, 2012

bexamous said:
Actually, lol, after writing this just now I tried google to see if anyone had tried to use brute force to fix single bit errors... and yes, and it was zfs! LOL.

Crazy of the ZFS creators to go over the top with the security in ZFS. I dont regret I switched to ZFS.

TCM · Nov 8, 2012

Read this: https://mailman.boum.org/pipermail/tails-dev/2012-October/001829.html

And then think about the cases where you don't notice so "quickly". Had he used ECC in the first place he could have saved himself the debugging and headscratching.

That computers don't have ECC by default in 2012 is almost absurd.

drescherjm · Nov 8, 2012

That computers don't have ECC by default in 2012 is almost absurd.

I totally disagree. The chance of a bit filp caused by a cosmic ray is so low I am not at all concerned with that at all. My ECC servers at work that have their BIOS set to record every single bit flip and these go years between having a single ECC correction. This along with the 100 TB of checksums I run on machines here at work without ECC each month for many years tells me this is nothing for me to worry about and certainly nothing for a home user with a media center application to worry about. That is not to say that I have not seen faulty ram cause bit filps. You must thoroughly test every DIMM you use in your systems for days before you can be sure it is not DOA. I have seen quite a few subtle ram problems that a single bit on a single dimm would flip after several hours of memtest86+. After replacing the ram memtest86+ went over a week without a single memory error.

blastingcap · Nov 8, 2012

drescherjm said:
I totally disagree. The chance of a bit filp caused by a cosmic ray is so low I am not at all concerned with that at all. My ECC servers at work that have their BIOS set to record every single bit flip and these go years between having a single ECC correction. This along with the 100 TB of checksums I run on machines here at work without ECC each month for many years tells me this is nothing for me to worry about and certainly nothing for a home user with a media center application to worry about. That is not to say that I have not seen faulty ram cause bit filps. You must thoroughly test every DIMM you use in your systems for days before you can be sure it is not DOA. I have seen quite a few subtle ram problems that a single bit on a single dimm would flip after several hours of memtest86+. After replacing the ram memtest86+ went over a week without a single memory error.

Given how small the shrinks are getting, I think it's irresponsible to keep going without ECC even for consumers for much longer. Older RAM at larger process sizes were more robust, but we're at the 30nm level now and no doubt will head down to 20nm and then 10nm. It will require less and less charge to flip a bit. Cosmic rays and alpha particles don't get smaller just because your node size does.

The RAM you reference was no doubt older, larger process, and probably closer to sea level than say, a businessperson on an intercontinental flight. The odds of getting a flipped bit due to cosmic radiation go up exponentially with altitude, IIRC. I think I read an estimate somewhere that implied that we're getting uncomfortably high probabilities of a cosmic ray induced bit flip on a typical laptop on a typical flight across the Atlantic, at present process size and typical memory configuration (say, 4-8 GB RAM). Most of the time that error will be meaningless, maybe change the color of a pixel or something, but occasionally you may get more serious errors.

Btw, HCI Memtest is apparently better than Memtest86+ at detecting errors quickly, and I would use that in conjunction with Memtest86+. I use both and use HCI Memtest to 1000% coverage along with a day of Memtest86+ when stress-testing.

stewartjm · Nov 8, 2012

drescherjm said:
You must thoroughly test every DIMM you use in your systems for days before you can be sure it is not DOA.

This alone is reason enough ECC should become the default, IMHO.

twmemphis · Aug 23, 2013

Hi!

If you read the field-study "DRAM Errors in the Wild" the right way, you will see that they found 8% of DIMMs with errors, but they also say that they find 25000 to 70000 FIT per Megabit.
Some maths:
If 1 Mbit has 25000 to 70000 FIT, then 1Gbit has 25 Million to 70 Million FIT.
And 1GByte then has 200 Million to 560 Million FIT

A "FIT" is a failure-in-time, meaning the number of failures per 1 billion device hours.
In other words: 1GByte of DRAM memory has 1 bit-flip every 2 to 6 hours.

Not every bit-flip is being detected, nor it necessarily hurts. It could be in unused memory area or in an area that is being overwritten soon. Or it is just in graphics area where nobody cares if a pixel is orange instead of red color. Even in software it does not immediately cause a crash. Just look at how much of the functionalities of Microsoft WORD or EXCEL you are really using. I'd say less than 1% of it.

But when such bit-flip hits a critical area, you better have ECC.
The field study also says that soft-errors - particles or rays that disturb memory cells - are less of an issue, at least on ground level. Most of the errors found they call hard-errors. I would call them "weak cells" as they are not permanently damaged, but they flip under certain circumstances, for example specific data-patterns.
Finally aging is a big issue. The number of DRAM errors increases with the age of the chips. And "high utilization" lets DRAM chips age more quickly!

To avoid trouble, all you can do is to have ECC to protect yourself.

What I really wonder is that harddisk-drives have a DRAM-cache without ECC. Even those harddisks for servers seem to work without ECC, although people pay a lot of money for server-grade harddisk drives. On our company server there are several files that won't open....maybe the data that passed through the cache was irritated by a single-bit error...who knows?

Regards,
Thorsten

TCM · Aug 23, 2013

But when such bit-flip hits a critical area, you better have ECC.

Which is, for a file server that uses almost all RAM as cache, well.. almost all RAM.

maybe the data that passed through the cache was irritated by a single-bit error...who knows?

ZFS would know - not where the error occured but that it occured at all.

Summary: Use ZFS with ECC only.

/dev/null · Aug 23, 2013

I work with HUNDREDS of Linux servers all with ECC (all dual socket Xeons). We DO get memory errors from time to time but in the last 7 years, I think I can count the number of times I've been notified of a memory error on one hand.

YMMV of course.

Also: Most of these servers use 12-24 dimms each.
Also: ECC doesn't correct all memory errors. What do you do about the errors ECC doesn't catch/correct?

TCM · Aug 23, 2013

Robstar said:
Also: ECC doesn't correct all memory errors. What do you do about the errors ECC doesn't catch/correct?

If you have requirements that aren't met by ECC alone, you can configure RAM mirroring. Beyond that, I don't know.

twmemphis · Aug 23, 2013

Robstar said:
I work with HUNDREDS of Linux servers all with ECC (all dual socket Xeons). We DO get memory errors from time to time but in the last 7 years, I think I can count the number of times I've been notified of a memory error on one hand.

YMMV of course.

Well, I am not sure why you see rare errors. Are you being notified of every corrected error or only of the double-bit errors that are uncorrectable?
And do you use memory-scrubbing? If not, then only upon reading a byte/word the ECC will check it. For all the many Gigabytes that have not been accessed, you do not see if the content is still alright. WIth memory scrubbing activated, the servers will read and write all memory periodically

On a harddisk, the cache of 64MB or 128MB (on the harddisk controller, you can also call it "buffer" instead of cache) is permanently in action, always transferring data in and out. This is a potential risk without ECC, but I think no harddisk manufacturer has ECC for this DRAM on the drive.
Regards,
Thorsten

Silhouette · Aug 23, 2013

Keep in mind that the paper states that there are huge differences between DIMMs and that only one third of the machines experience errors per year. Most of the errors observed are not random bit flips but related to faulty hardware.

In other words, test your RAM. Then test it some more. Get ECC if you can, but keep in mind that ECC will just reduce the frequency of noticeable errors.

`Q` · Aug 27, 2014

Silhouette said:
Keep in mind that the paper states that there are huge differences between DIMMs and that only one third of the machines experience errors per year. Most of the errors observed are not random bit flips but related to faulty hardware.

In other words, test your RAM. Then test it some more. Get ECC if you can, but keep in mind that ECC will just reduce the frequency of noticeable errors.

NO this is wrong. ECC will protect you from memory errors and protect you against any data corruption by halting the system upon a double parity error. ECC will save you from bad memory trashing your data.

Whoops responded to a year old thread.

SirMaster · Aug 27, 2014

I wonder if anyone has tested this flag in the case study:

The creator and main developer of ZFS says this about ZFS and ECC:

http://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271

Silhouette · Aug 28, 2014

`Q` said:
NO this is wrong. ECC will protect you from memory errors and protect you against any data corruption by halting the system upon a double parity error. ECC will save you from bad memory trashing your data.

In my experience this depends on the motherboard BIOS and how it deals with MCE correctable and uncorrectable errors. I've seen RAM failures lead to memory corruption on ECC systems in the past.

Data integrity: the risks of not using ECC with ZFS (a study)

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

n00b

2[H]4U

[H]ard|Gawd

n00b

[H]ard|Gawd

n00b

[H]ard|Gawd

Gawd

[H]F Junkie

Gawd

Limp Gawd

n00b

Gawd

[H]F Junkie

Gawd

n00b

Limp Gawd

n00b

2[H]4U

Limp Gawd