ZFS with non-ECC / Checksums / Backups

nry · May 26, 2013

I have built a remote backup server of old components (AM2 with 4GB DDR2 non-ECC). I plan on using a Raid z2 with 8x 1TB disks.
I know generally it's recommended to use ECC ram with ZFS, not here to discuss that at all. I don't want to spent £300 odd to upgrade this box.

This backups will contain:
Video files ranging from 1GB to 20GB
Documents: few KB to a few MB
ISO/System images 10GB to 40GB

Sources are from:
ESXi datastores
Hardware Raid 6 on main storage system

Now for my list of questions

What would be the best way of detecting any corruption?
I was thinking checksum's of the files being backed up, best to store these along side the files?
Then a simple script to compare after transfer?

Assuming could also have daily/weekly scripts which check the data for corruption using the checksums?

Should I be doing the above on my hardware RAID 6 to detect for corruption?

What is the best way to transfer the backups?
My plan was an open vpn link between the two locations then a simple rsync?
Be interested to know how others manage similar issues.

HobartTas · May 26, 2013

Greetings

nry said:
What would be the best way of detecting any corruption?
I was thinking checksum's of the files being backed up, best to store these along side the files?

None required once the data is actually on the ZFS pool as its protected by checksums automatically from that point onwards so you don't have to do anything at all.

nry said:
Assuming could also have daily/weekly scripts which check the data for corruption using the checksums?

A ZFS scrub will do that for you on the ZFS server and fix things on the fly, the data on your RAID 6 is a different story

nry said:
My plan was an open vpn link between the two locations then a simple rsync?

Yes rsync will be best as it won't matter whether an ethernet packet got corrupted or your Non-ECC memory flipped a bit because rsync will detect that source<>destination and will retransmit the differing chunk to fix it.

nry said:
Should I be doing the above on my hardware RAID 6 to detect for corruption?

Yes, you don't have checksum protections on a non-ZFS system, I presume you've heard of the term "Garbage In/Garbage out" if your source data is damaged then copying that to the ZFS server isn't going to get it magically corrected. Re-run the checksums against your original files on your Raid 6 system after you have done the transfer, if everything is OK then you know the identical duplicates on your ZFS system are also OK, if not then you have a damaged file in both places.

ZFS checksumming is done for EVERY block regardless of whether the block is data or filesystem metadata, also copies=n defaults to n=1 and this means there is one copy of your data and n+1 (i.e. 2) copies of the metadata, in addition if there is more than one drive involved (in your case 8 drives) the two copies of the metadata are then stored on seperate drives, on top of all this you have two parity drives so this means your ZFS data is well and truly safe (well short of three drives dying at once anyway).

Cheers

patrickdk · May 26, 2013

It's false that ecc-ram not being required once the data is on zfs.

I had non-ecc memory, and one stick was bad, normal memory scanners didn't see it as bad though.
Every scrub would totally destroy the pool, cause of the bad memory, ZFS was unable to trust the checksums it produces reading good data, cause that good data would become bad.

So if you don't have ecc-ram you could easily enough destroy a perfectly otherwise healthy pool.

nry · May 26, 2013

HobartTas thanks, couldn't have asked for a better reply!

staticlag · May 26, 2013

Any RAM, ECC or not, can and will go bad over time.

That is why it is very important to realize that RAID is not a backup.

kzrussian · Dec 9, 2013

patrickdk said:
It's false that ecc-ram not being required once the data is on zfs.

I had non-ecc memory, and one stick was bad, normal memory scanners didn't see it as bad though.
Every scrub would totally destroy the pool, cause of the bad memory, ZFS was unable to trust the checksums it produces reading good data, cause that good data would become bad.

So if you don't have ecc-ram you could easily enough destroy a perfectly otherwise healthy pool.

I realize this post is a couple months old, but I was wondering if disabling checksums on ZFS and not performing any scrubs - will prevent completely distroying the pool?
I do realize that with a bad non-ecc RAM - I will be corrupting the newly written data untill I realize I have bad ram, but i'm thinking at least all my old data will be good.
Can anyone confirm this?

drescherjm · Dec 9, 2013

I would leave it checksums on (otherwise you are throwing away the biggest benefit of zfs). In my opinion its pretty unlikely for ram to go bad if you are not overclocking you ram. In my small sample of a few hundred dimms over the last 17 years here at work I have not had a single dimm go bad if it was not DOA.

uOpt · Dec 9, 2013

kzrussian said:
I realize this post is a couple months old, but I was wondering if disabling checksums on ZFS and not performing any scrubs - will prevent completely distroying the pool?
I do realize that with a bad non-ecc RAM - I will be corrupting the newly written data untill I realize I have bad ram, but i'm thinking at least all my old data will be good.
Can anyone confirm this?

Definitely not. If you have a bit error in a field holding block location that block gets written into a random location which can be metadata, e.g. destroy directories, superblocks, old data you didn't touch or blast right through snapshots.

Aesma · Dec 10, 2013

Well that could happen even with ECC RAM, it's not bulletproof RAM. I guess depending on usage (home use without high availability demands) it could be a good idea to shut down the server when a scrub starts to find lots of errors, so as to test the RAM before continuing. At the moment a scrub takes about 16 hours on my server and I always do it during the week-end when I can monitor most of it.

uOpt · Dec 10, 2013

Aesma said:
Well that could happen even with ECC RAM, it's not bulletproof RAM. I guess depending on usage (home use without high availability demands) it could be a good idea to shut down the server when a scrub starts to find lots of errors, so as to test the RAM before continuing. At the moment a scrub takes about 16 hours on my server and I always do it during the week-end when I can monitor most of it.

How is your memory scrub supposed to find errors if it's not ECC?

danswartz · Dec 10, 2013

I think he's talking pool scrub.

uOpt · Dec 10, 2013

danswartz said:
I think he's talking pool scrub.

Oh.

I'm sorry but that doesn't make any sense. The people who argue that ZFS scrubbing makes it more acceptable to run without ECC seem to oversimplify and think that bad data from RAM can only end up in blocks that are contents of files. But they can end up in metadata and even in the checksums themselves.

ZFS integrity is against the disk or the cable to the disk flipping bits. It strictly requires an intact CPU + RAM chain.

danswartz · Dec 10, 2013

Having bits in memory flip undetected due to lack of ECC will circumvent ZFS checksums, sure. It doesn't make them useless, since they still detect bits flipped on the way from ram to disk or whatever. That said, I am not a proponent of non-ECC ZFS servers - just saying that it isn't an absolute black & white argument like you seem to be advancing.

Silhouette · Dec 10, 2013

Some of my servers at home don't have ECC. I test for bad RAM and problems with cache/cpu/mobo at semi-regular intervals, usually before running a scrub.

Aesma · Dec 11, 2013

uOpt said:
Oh.

I'm sorry but that doesn't make any sense. The people who argue that ZFS scrubbing makes it more acceptable to run without ECC seem to oversimplify and think that bad data from RAM can only end up in blocks that are contents of files. But they can end up in metadata and even in the checksums themselves.

ZFS integrity is against the disk or the cable to the disk flipping bits. It strictly requires an intact CPU + RAM chain.

That was not my point. What I'm saying is that ECC is good and my ZFS build sports it, but that doesn't mean my RAM is good.

Conversely without ECC you can have stable memory, or even memory that will be slightly unstable, and ZFS should cope better than other filesystems with this (but yes, all can go to hell fast).

mercnz · Dec 18, 2013

to my mind, if memtest that comes with ubuntu's installer doesn't find errors in 6 hours you're probably fine. if it doesn't find errors in a couple of minutes you "may or may not be fine", but don't have blatant memory issues. if it finds straight away you're going to have massive issues, and should try taking out one stick, to see if it gets better or worse, and there's a kind of soft area in between two minutes and 6 hours where you may or may not notice, and could be getting silent data corruption, which something like zfs may alert you to easier.

if you think your ram may be questionable, i'd suggest just running memory test overnight. and fwiw i have seen severe memory issues with one stick in dual stick configurations before, and in between memory issues with a desktop of mine (not running zfs, server has zfs, and only ddr3-1600) when i was overclocking ram to ddr3-2133 with 4x4gb sticks. with 2x4gb sticks it was fine. and with overvolting memory it was also fine, but i went down to ddr3-2000 (which it's rated at)

an easy way to replicate memory errors is to overclock ram to marginal settings.

olavgg · Dec 18, 2013

ASUS AM2 motherboards support Unbuffered ECC memory.

ZFS with non-ECC / Checksums / Backups

nry

Limp Gawd

HobartTas

Limp Gawd

patrickdk

Gawd

nry

Limp Gawd

staticlag

[H]ard|Gawd

kzrussian

n00b

drescherjm

[H]F Junkie

uOpt

[H]ard|Gawd

Aesma

[H]ard|Gawd

uOpt

[H]ard|Gawd

danswartz

2[H]4U

uOpt

[H]ard|Gawd

danswartz

2[H]4U

Silhouette

Limp Gawd

Aesma

[H]ard|Gawd

mercnz

n00b

olavgg

Limp Gawd