ZFS - checksum data in RAM:

brutalizer

[H]ard|Gawd
Joined
Oct 23, 2010
Messages
1,602
There are talks about broken non ECC RAM sticks are terrible for your ZFS pool, because all data might be corrupted, due to the data in RAM is not checksummed. Well, as we all know, this is not true. It is just ignorance.
http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

But, it turns out that you can in fact checksum data in RAM. Have anyone tried it? ZFS creator Matt Ahrens explains:
http://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271
ZFS creator Matt Ahrens said:
There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.
 
Are you talking on openzfs?

If yes.. Ecc indeed add second layer to detect early ecc error warning.
Basically, the system will tell early pre-failed earning, before the whole is go kaput.

On zol. The recommendation is using ecc due on not all opopenzfs functions are implemented where overshadowing Linux implementation.

Always pro and contra. In my experience, ecc help me to detect prevailing situation.
This situation has no color, regardless Linux or ipensolaris freebsd or other.

Basically, do not get burned when happens
 
Hmm, Not using ECC ram, and causing corruption is highly likely.

I have personally had this problem, and it is NOT limited to writes, or scrubs. Any read that happens, will trigger a write if the checksum fails, and with bad memory this is likely.

The qualifier used was, NOT MORE THAN ANY OTHER FILESYSTEM. Yes, Not using ECC ram will corrupt your filesystem, if you are using zfs, but NO WORSE, than if you where not using zfs. But corruption is still corruption, so if you care about your data, enough that your bothering to use zfs, you should be using ECC ram.
 
If you have non ECC ram and you get corruption, then your RAM was faulty. If your RAM is ok, you will not get corruption - no matter ECC or not. But ram gets faulty every now and then.

The point is, it is possible to checksum the data in RAM too with ZFS, catching faulty RAM sticks. Has anyone tried it?
 
Any read that happens, will trigger a write if the checksum fails, and with bad memory this is likely.
The linked article specifically debunks this claim. If the checksum fails on read, ZFS doesn't overwrite anything with the corrupted data. It reconstructs the block from parity and if the checksum matches after that, it overwrites the block but then we don't have any corruption. If the checksum fails again, nothing is overwritten.
 
But the checksum can be corrupted also, and PASS, with the corrupted data and checksum, not very likely, but it does happen. And considering atleast on my server, it was having a crapload of checksum failures, and repairs, this did happen, and corrupted data.

It was only a test/development system, that wasn't ever going to be in production use or had data on it, so I didn't care much, but a lot of data did get corrupted on it. The case would have been a lot worse, if I hadn't been using zfs though.
 
The linked article specifically debunks this claim. If the checksum fails on read, ZFS doesn't overwrite anything with the corrupted data. It reconstructs the block from parity and if the checksum matches after that, it overwrites the block but then we don't have any corruption. If the checksum fails again, nothing is overwritten.

please do not take blindly

read on openZFS implementation and distro...

I know ZoL and using currently...

as I stated, ZoL is recomended to use ecc RAM.

I do not know on freebsd implementation, since ZFS must be tight to kernel layers.

on ZoL, we are using DKMS. and LUKS for encryption.
 
But the checksum can be corrupted also, and PASS, with the corrupted data and checksum, not very likely, but it does happen.
Checksum matching to corrupted data means there is a hash collision. Since ZFS uses SHA256 hash, we need to calculate 340,282,366,920,938,463,463,374,607,431,768,211,456 (2^128) hashes before probability of matching it to corrupted data reaches 50%. "Not very likely" is an understatement.
 
zfs does not default to sha256, due to the slowness of it.

It defaults to fletcher4. Yes, using sha256 will limit this happened many times over vs the default.
 
On zol. The recommendation is using ecc due on not all opopenzfs functions are implemented where overshadowing Linux implementation.
Which missing OpenZFS feature might make the recommendation of ECC RAM particularly strong for ZoL?

If you have non ECC ram and you get corruption, then your RAM was faulty.
Your RAM doesn't have to be faulty for a bit flip to occur. And there are more ways for pools to corrupt.
 
Which missing OpenZFS feature might make the recommendation of ECC RAM particularly strong for ZoL?

ZoL:
you can check detail on ZoL implementation, some is rely-on linux DKMS functionality..
there was discussion on ECC or Not on ZoL list.

on Linux, ECC is a main role to warn pre-failing RAM situation..

this is the reason, I point to on what system: linux, freebsd, and etc...
since some functionality is relying-on kernel levels

check on Zol, They suggested to use ECC.
1.17 Do I have to use ECC memory for ZFS?

Using ECC memory for ZFS is strongly recommended for enterprise environments where the strongest data integrity guarantees are required. Without ECC memory rare random bit flips caused by cosmic rays or by faulty memory can go undetected. If this were to occur ZFS (or any other filesystem) will write the damaged data to disk and be unable to automatically detect the corruption.
......

if you need to understand on further reason, dig up on ZoL mailing list and ask the questions. There was/were discussion on ECC too.

on encryption, there is the reason using native LUKS on linux on top ZFS too...:p...
 
Good thing it is highly unlikely for a cosmic ray to cause a bit to flip.
https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creating_energetic_neutrons_and_protons
"...IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer [by cosmic radiation]..."

But it does not tell us how likely cosmic rays flips bits on disks.


But the checksum can be corrupted also, and PASS, with the corrupted data and checksum, not very likely, but it does happen. And considering atleast on my server, it was having a crapload of checksum failures, and repairs, this did happen, and corrupted data.
I would not like to get corrupt data without knowing. So ZFS told you there were data corruption problems. That is comforting.

But, the corruption you got, was it due to faulty non ECC ram or could the culprit be something else?
 
The creators of ZFS strongly recommend using ECC RAM, it was designed with that in mind and as the typical use case scenario. Most, if not all, distro's/platform's on which ZFS works on also make this recommendation. And as its been said many, many, MANY times before; if you care about your data then only use ECC RAM with your ZFS setup. Now with that in mind, remember that this is a recommendation and not a requirement so the choice is yours. BUT don't come back here or the freenas forums when things turn ugly because you decided not to follow said recommendation.
 
Good thing it is highly unlikely for a cosmic ray to cause a bit to flip.
You don't need a cosmic ray to cause a bit flip. The ubiquitous electromagnetic interference will do. And it happens more often then you'd think. I occasionally see our servers logging bit flips being corrected by ECC. Without correction those errors would go unnoticed unless they cause some obvious problem. At this time and age the lack of ECC in many devices is frankly insane.

ZoL:
you can check detail on ZoL implementation, some is rely-on linux DKMS functionality..
there was discussion on ECC or Not on ZoL list....
[more stuff]
I'm not questioning the value of ECC for ZFS, that is well established. I'm questioning it being more important on ZoL than other platforms. The quote you posted clearly says ZFS (or any other filesystem), not ZoL. Same goes for all the discussions that were had.
 
You don't need a cosmic ray to cause a bit flip. The ubiquitous electromagnetic interference will do. And it happens more often then you'd think. I occasionally see our servers logging bit flips being corrected by ECC.

This is actually part of the reason I say it is a very rare thing. From 18+ years of work experience with dozens of server class systems that have ECC (and have hardware logging of every single ECC correction) I only see ECC errors in my machine check exceptions when a system has bad ram. Replacing the bad ram in every case ended the ECC correction errors. I have many of these systems that have gone 7+ years without a single recorded memory correction. Although with that said I sometimes wish I had kept some bad ram to verify that the ECC is working...

On top of that I have ~50TB of data on systems that do not have ECC. I run raid scrubs on these systems and every single week and there is only a problem if I have a bad hard drive. I have also run memory tests on several systems with and without ecc that ran the test for an entire month without a single reported error.

At this time and age the lack of ECC in many devices is frankly insane.

Even with everything I have said above I think it is time for this. I mean I wish it was an option at least on Intel's Enthusiast platform without needing a xeon.
 
Last edited:
....


I'm not questioning the value of ECC for ZFS, that is well established. I'm questioning it being more important on ZoL than other platforms. The quote you posted clearly says ZFS (or any other filesystem), not ZoL. Same goes for all the discussions that were had.

they put in plain english to easy understand for non techie person.
as I said there were discussion in the ZoL list..
basically is a hand brake: you do not use ecc on ZoL, do not blame us when something corrupted..

yes, ECC is recomended on linux server, this has been a good practical.
as I said ZoL still rely-on linux DKMS functionality to get kernel rings. ZoL not fully infuse in kernel functionality. you can dig on DKMS white paper and the purpose this still exits

the important thingECC. Zol relies on DKMS,which has layer API to get kernel access functionality.
if you follow ZoL development, they do many workaround due on DKMS. these workarounds rely on DKMS functionalities.

if you see another approach on openZFS implementation such as freebsd, they are indeed running in kernel level. ZoL running on the top with DKMS as a middle guy.

simple sightseeing is ZoL uses LUKS ( linux native encryption) :p.
 
they put in plain english to easy understand for non techie person.
as I said there were discussion in the ZoL list..
basically is a hand brake: you do not use ecc on ZoL, do not blame us when something corrupted..

yes, ECC is recomended on linux server, this has been a good practical.
as I said ZoL still rely-on linux DKMS functionality to get kernel rings. ZoL not fully infuse in kernel functionality. you can dig on DKMS white paper and the purpose this still exits

the important thingECC. Zol relies on DKMS,which has layer API to get kernel access functionality.
if you follow ZoL development, they do many workaround due on DKMS. these workarounds rely on DKMS functionalities.

if you see another approach on openZFS implementation such as freebsd, they are indeed running in kernel level. ZoL running on the top with DKMS as a middle guy.

simple sightseeing is ZoL uses LUKS ( linux native encryption) :p.
You are parroting things you have no clue about.
DKMS provides mechanisms to automatically compile and load kernel modules when you install new kernels. It has nothing to do with what a module does once loaded (you could just as well compile and load it by other means). ZoL will not be integrated into Linux kernel due to conflict between the GPL and the CDDL, not because of technical issues. And a kernel module is still native (as in: not userland).

And all of this is completely orthogonal to requirements of ECC or not. Linux doesn't require ECC more than other platforms. Stop with the noise.
 
I'm not sure what the argument is for here. Saving a few bucks by NOT buying ECC?

You can get faster memory for cheaper, but most people that build servers don't build them for raw performance. They're built for stability and data integrity. If the difference of data security between non-ECC and ECC implementation was 98% and 100%, why not just go for that extra 2%?

Also, it's a blog. I can't take someone that says:
I’ve constructed a doomsday scenario featuring RAM evil enough to kill my data after all! Mwahahaha!
...as seriously or consider it as informative as whitepapers written by both the developers of the standard and the developers that USE that standard (FreeNAS).
 
I'm not sure what the argument is for here. Saving a few bucks by NOT buying ECC?

You can get faster memory for cheaper, but most people that build servers don't build them for raw performance. They're built for stability and data integrity. If the difference of data security between non-ECC and ECC implementation was 98% and 100%, why not just go for that extra 2%?

Also, it's a blog. I can't take someone that says:

...as seriously or consider it as informative as whitepapers written by both the developers of the standard and the developers that USE that standard (FreeNAS).

The typical use case for not going with ECC memory is this.

I have an existing computer at my disposal that's just sitting there doing nothing. I want to re-purpose it and set up a home server to back up some of my files to.

Should the fact that my existing hardware is not / does not support ECC prevent me from choosing ZFS over another filesystem?

I say no, ZFS is still a perfectly valid option for this case and choosing another filesystem wont necessarily be safer.


So it's not 2% more. It's infinity% more. I have to pay $0 to re-purpose this existing hardware but I have to buy a new CPU, motherboard, and RAM just to gain ECC support.

Obviously if you start with nothing then choosing to go with ECC makes a lot of sense, but that isn't the only scenario to consider.
 
Ahhh, OK. Seems like people were getting uptight about buying new equipment, but it makes sense with regards to old equipment. Either have a server with old parts being useful and functional with a small margin for data loss, or have nothing at all since you don't want to spend hundreds or thousands more for new equipment.
Seems like if you're just doing something like a Plex server on FreeNAS or WinMCE recorder/server then it's no biggie, but if you're keeping tax documents and baby photos you'd want something more resilient.
 
Absolutely agree. I just think I see too much of the "must have all enterprise gear and ECC and everything" being "yelled" at to simple home-server folk, especially on the FreeNAS forums.

They get turned away and are told to use unRAID (ReiserFS), Linux (MDRAID on EXT4), or SnapRAID or FlexRAID on NTFS/EXT4.

But the argument here is that ZFS is not really "more" dangerous on a non-ECC system than these other filesystem configurations are. I would argue that ZFS on non-ECC can actually do things that make you safer than the other mentioned systems on non-ECC.

For instance, if you did have RAM going bad. You will know pretty quickly on ZFS as you will see CKSUM errors on scrub (but data is very likely not actually being overwritten/corrupted). And you can know to replace the RAM whereas on a more simple non check-summing FS you may be silently writing bad data for awhile before you realize.
 
Back
Top