FreeNAS - ZFS Checksum Errors

BrandonB · Jun 30, 2015

Hello everyone,

So a weird issue has popped up that started a couple days ago. I'm running 4x WD RED 3TB drives in RAIDZ1. All of these disks are practically brand new.

I'm not suffering from any read or write errors on the disks.

Now, any time I add a large amount of data, I always end up with checksum errors. I can't figure out what's going on here, I've swapped in completely different drives, created a new zpool with them and the same exact thing happens. I highly doubt it's related to my disks.

I'm using the on board SATA controller (non RAID), and have been for the past 3 - 4 years without a problem. I've replaced the SATA cables just to rule that out.

I'm a bit out of my depth here, I've been working with computers for most of my life but have only recently started dealing with *nix systems. So, what's my next step? I'm not really sure how to diagnose this.

Code:

brandonb@freenas:~ % zpool status
  pool: Tesla
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 0h12m with 0 errors on Tue Jun 30 02:19:14 2015
config:

	NAME        STATE     READ WRITE CKSUM
	Tesla       ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    ada0p1  ONLINE       0     0     2
	    ada1p1  ONLINE       0     0     0
	    ada2p1  ONLINE       0     0     2
	    ada3p1  ONLINE       0     0     2

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	freenas-boot  ONLINE       0     0     0
	  da0p2     ONLINE       0     0     0

errors: No known data errors
brandonb@freenas:~ %

I replaced a WD Green drive with a WD Red so I'd finally have all NAS type HDD's in there, and that's when the issues started but I don't think it has anything to do with the new disk, like I said above, I'm experiencing the same issues regardless of the drives I use.

What prompted me to replace the WD Green in the first place is a few checksum errors popped up on that drive, so it gave me an excuse. The resilvering process failed the first time, saying there was a corrupted file that needed to be addressed. So I deleted said file (was an unimportant temp file) and restarted the resilvering. Second time around it was complaining about my Plex jail, so I just deleted that dataset entirely and started over for a third time. It finally worked after doing that plus a scrub right after.

Ever since it's been a nightmare getting this thing to function properly. I went as far as recreating my primary dataset entirely in order to migrate to ashift = 12 (4k sectors) for maximum performance, since my old drives were 512b sectors. It didn't help at all but at least I know there should be no issue there. In the 4 years of having this system, this is the first time I've run into this.

Oh and I have a fresh install of FreeNAS, just to rule any of that out. Put it on a brand new 16GB flash drive shortly after replacing the WD Green disk.

trackersoft · Jun 30, 2015

Hello,
i had similar issue few years ago. My problem was that integrated ICH10R controller was corrupting the data. I've disabled it from BIOS and installed external HBA adapter.

Zedicus · Jun 30, 2015

i had this issue just a couple of months ago. (there is a rather lengthy post on here covering it somewhere.)

turns out this was due to the controller and driver. you need to verify the the installed driver and the firmware that you controller thinks its using are matching. (with an onboard its a pain, heck even with an HBA its a pain.)

but the simplest fix is to use an LSI hba and make sure its on firmware p16. if you want to keep the onboard you would need to boot into bios and note the version number associated with the sata host, then maybe get on the freenas forum and see what driver version freenas is using for your sata host and make sure the numbers match. if they do not you could see if a mainboard bios was available that included the sata host firmware you needed.

also have you verified your onboard controller is set to AHCI and not IDE or RAID?

bmh.01 · Jun 30, 2015

Exact hardware used? When you say whatever disks you use do you mean you have swapped all 4 drives?

And as a side note, raidz1 with 3TB drives is a) a terrible ideal b) asking for trouble.

BrandonB · Jul 1, 2015

Zedicus said:
i had this issue just a couple of months ago. (there is a rather lengthy post on here covering it somewhere.)

turns out this was due to the controller and driver. you need to verify the the installed driver and the firmware that you controller thinks its using are matching. (with an onboard its a pain, heck even with an HBA its a pain.)

but the simplest fix is to use an LSI hba and make sure its on firmware p16. if you want to keep the onboard you would need to boot into bios and note the version number associated with the sata host, then maybe get on the freenas forum and see what driver version freenas is using for your sata host and make sure the numbers match. if they do not you could see if a mainboard bios was available that included the sata host firmware you needed.

also have you verified your onboard controller is set to AHCI and not IDE or RAID?

Gotcha. I just ordered an IBM SERVERAID M1015, which I'll flash to just be an HBA.

bmh.01 said:
Exact hardware used? When you say whatever disks you use do you mean you have swapped all 4 drives?

And as a side note, raidz1 with 3TB drives is a) a terrible ideal b) asking for trouble.

Core i3, 16GB of RAM, generic Asus motherboard. Yes, I swapped all 4 drives.

What exactly makes it a "terrible idea" and asking for trouble? I've never heard of 3TB drives being a problem in raidz1.

mwroobel · Jul 1, 2015

BrandonB said:
What exactly makes it a "terrible idea" and asking for trouble? I've never heard of 3TB drives being a problem in raidz1.

Whenever dealing with multi-terabyte drives in a RAID or RAID-like atmosphere, it is always best practice to utilize multiple parity drives to best insure uptime without going to backup. This, even in an enterprise environment where you would have people to instantly replace failed/failing hardware and would have on-hand the cold spares to do so. As a home server, you are less likely to be able to do so immediately or have the hardware needed. With a single failed drive, another single failure would bring down your array. While you order another drive, wait for it etc the increased chance of a drive failure or having another fail during the rebuild makes the extra $120-$150 you would spend on the additional drive a prudent investment (and even more so if you do not have a complete separate backup.)

BrandonB · Jul 2, 2015

mwroobel said:
Whenever dealing with multi-terabyte drives in a RAID or RAID-like atmosphere, it is always best practice to utilize multiple parity drives to best insure uptime without going to backup. This, even in an enterprise environment where you would have people to instantly replace failed/failing hardware and would have on-hand the cold spares to do so. As a home server, you are less likely to be able to do so immediately or have the hardware needed. With a single failed drive, another single failure would bring down your array. While you order another drive, wait for it etc the increased chance of a drive failure or having another fail during the rebuild makes the extra $120-$150 you would spend on the additional drive a prudent investment (and even more so if you do not have a complete separate backup.)

I guess I could get 2 more 3TB drives and go raidz2. I'd just have 6 drives total, but quite a bit more storage space even with raidz2.

Aaand there, ordered 2 more. As soon as I get my HBA installed I'll get those drives in and rebuild the dataset for raidz2. I pretty much lost everything a few days ago due to this very issue of using raidz1. I had 5TB+ of movies and TV shows, all gone. Working on rebuilding that now. I saved everything important though. So there's that.

mwroobel · Jul 2, 2015

BrandonB said:
I guess I could get 2 more 3TB drives and go raidz2. I'd just have 6 drives total, but quite a bit more storage space even with raidz2.

Aaand there, ordered 2 more. As soon as I get my HBA installed I'll get those drives in and rebuild the dataset for raidz2. I pretty much lost everything a few days ago due to this very issue of using raidz1. I had 5TB+ of movies and TV shows, all gone. Working on rebuilding that now. I saved everything important though. So there's that.

Order yourself a 6/8TB drive as well, put it in an external case and use that so you have at least a full (or almost full if 6TB) backup so even if your array fails completely you will have a backup.

BrandonB · Jul 2, 2015

mwroobel said:
Order yourself a 6/8TB drive as well, put it in an external case and use that so you have at least a full (or almost full if 6TB) backup so even if your array fails completely you will have a backup.

Perhaps one day. I'm not too concerned about it right now. Everything on my NAS box is replaceable and the important stuff is backed up to my cloud storage.

Granted the time spent on recovering my data from other sources would not be ideal, so I may do that sooner rather than later.

bmh.01 · Jul 2, 2015

As all the drives have been replaced and it's happening on multiple drives then replace motherboard -> ram -> cpu in that order. If you have something compatible you can use without buying new then even better.

May be worth a post on the freenas forum but you'll be lined up and shot for using that hardware and raidz1 first but just endure that.

edit: What does a scrub produce? Same sort of errors? Although be aware if there is a hardware problem damaging data in flight this may corrupt some/all of your data.

TCM2 · Jul 2, 2015

No one has suggested a memtest by now?

BrandonB · Jul 2, 2015

bmh.01 said:
As all the drives have been replaced and it's happening on multiple drives then replace motherboard -> ram -> cpu in that order. If you have something compatible you can use without buying new then even better.

May be worth a post on the freenas forum but you'll be lined up and shot for using that hardware and raidz1 first but just endure that.

edit: What does a scrub produce? Same sort of errors? Although be aware if there is a hardware problem damaging data in flight this may corrupt some/all of your data.

Could you explain to me exactly what's wrong with my hardware choice? 6x 3TB WD Red drives, Core i3, 16GB of RAM, replacing the onboard sata controller with an HBA.

I don't see anything wrong with that. It's just standard PC hardware, and I'm running ZFS exactly because I don't have a ton of money to invest in enterprise level RAID hardware. It's not like I'm mixing drive sizes, mixing 512b with 4k sectors, non NAS drives mixed with NAS drives. It's a pretty clean setup I thought. If it's the fact that I chose raidz1, well that's a non issue as well. I'm switching to raidz2 as soon as the other drives I ordered come in.

The only time the checksum errors happen are when I scrub the pool. If we can rule out the HDD's being the problem I'm going to guess it's either the onboard SATA controller being weird or perhaps I've got a RAM failure. I'm fixing the onboard SATA issue, so we'll see if that fixes it.

mwarps · Jul 2, 2015

Welcome to using ZFS with non-ecc memory.

You should not be running a "production" ZFS system with non-ECC memory.

You can get a 5-year old 8 core server with 32GB ECC DDR2 and an HBA for $400 on ebay without breaking a sweat.

EDIT:

Just to clarify - your data needs to be verified; if you are storing important or irreplaceable data.. Good luck. if you're just fucking around.. whatever, doesn't matter.

BrandonB · Jul 2, 2015

mwarps said:
Welcome to using ZFS with non-ecc memory.

You should not be running a "production" ZFS system with non-ECC memory.

You can get a 5-year old 8 core server with 32GB ECC DDR2 and an HBA for $400 on ebay without breaking a sweat.

EDIT:

Just to clarify - your data needs to be verified; if you are storing important or irreplaceable data.. Good luck. if you're just fucking around.. whatever, doesn't matter.

When I started using FreeNAS and ZFS no one had ever recommended ECC RAM. If I could find a LGA 1156 socket server board with ECC support I'd switch over. Otherwise I'm looking at a new motherboard, CPU, and RAM (about $415 on newegg) just to start running ECC RAM. It's just not in the budget right now.

I certainly understand there is a right way and a wrong way to do things, or at least, the standard and generally accepted way to do things, but I'm working with what I've got and honestly ZFS has done a fine job at keeping my data in good shape even without ECC RAM for the past 4 years.

Rest assured, I don't like doing things half assed. As soon as it's in the budget, I'll switch over to ECC RAM.

shetu · Jul 2, 2015

That was past. In this time every one suggest ECC ram

mwarps · Jul 2, 2015

BrandonB said:
When I started using FreeNAS and ZFS no one had ever recommended ECC RAM. If I could find a LGA 1156 socket server board with ECC support I'd switch over. Otherwise I'm looking at a new motherboard, CPU, and RAM (about $415 on newegg) just to start running ECC RAM. It's just not in the budget right now.

I certainly understand there is a right way and a wrong way to do things, or at least, the standard and generally accepted way to do things, but I'm working with what I've got and honestly ZFS has done a fine job at keeping my data in good shape even without ECC RAM for the past 4 years.

Rest assured, I don't like doing things half assed. As soon as it's in the budget, I'll switch over to ECC RAM.

ECC for ZFS has been strongly suggested since ZFS was invented. The FS was designed with integrity on the forefront and the copy-on-write functionality is at the core of that integrity and that strong suggestion.

Regardless, there is likely very little that can be done to fix this pool. The data on it is suspect, and if you have backups from before you noticed the scrub fails, good.. if not? Cool, too.

olavgg · Jul 3, 2015

Every system should use ECC memory, not only ZFS. Though I hardly believe these issues are caused because of not using ECC memory. Most likely its a controller issue. It could also be a bad cable(SMART will report that), or a bad power supply causing the controller on the hard drive to reset.

TCM2 · Jul 3, 2015

Just do a memtest, ffs. How hard can it be?

BrandonB · Jul 3, 2015

Doesn't really matter now. I just ordered all new hardware with ECC support and the RAM to go with it.

FreeNAS - ZFS Checksum Errors

BrandonB

2[H]4U

trackersoft

n00b

Zedicus

[H]ard|Gawd

bmh.01

Gawd

BrandonB

2[H]4U

mwroobel

Supreme [H]ardness

BrandonB

2[H]4U

mwroobel

Supreme [H]ardness

BrandonB

2[H]4U

bmh.01

Gawd

TCM2

Gawd

BrandonB

2[H]4U

mwarps

Supreme [H]ardness

BrandonB

2[H]4U

shetu

Weaksauce

mwarps

Supreme [H]ardness

olavgg

Limp Gawd

TCM2

Gawd

BrandonB

2[H]4U