ZFS Scrub - many checksum errors towards end. Where should I look?

jamesponty

n00b
Joined
Nov 26, 2012
Messages
9
Learned friends,

I'm having an issue with my main storage pool. It will generate a large number of checksum error across multiple, seemingly random drives (10 disk raid2z) during every scrub. I'm also getting sfv and crc error on downloaded archives suddenly so something is definitely not right.

The scrub always seems fine for 90% of the pool with the last 10% generating the errors and eventually failing the drives to degraded. It seems its the new data that is having the problem.

I've run iostat and no drive errors of any kind have been detected so it doesn't like like the HBA or the expander are an issue to me. Is this a safe assumption?

How should I test from here? I've run 4 hours of memtest without event as I initially thought it would be the ECC memory. Would power supply be next to change out? I'm not seeing any system instability at all.

Any thoughts? Checks I could do? I'm at a loss.

Thanks for any and all suggestions.
 
If you cannot trust the system anymore and the data is important I would move the whole pool to a new system till you fix the error. A scrub on such a system may destroy even more data. Such errors are hard to track down if you do not have spare hardware. The PSU is a good start, however.

I would recommend to look a the drive's SMART data anyway, often you get at least a clue what is going on. I've had more success finding memory instabilities using stress testing tools like Prime95 than with memtest, because memtest does not put a lot of stress on the memory controller, justs tests every cell.
 
Cheers Guys,

Its a supermicro board of some description with 8GB ECC ram and an on-board sas controller to a chenbro expander. I might buy another box and back it all up first before I start playing.

If it was the expander playing up, should I be getting read and write disk errors? I'll give Prime95 a go as well and see what happens. Is there a bootable equivlanet, its not a windows box.
 
I would suspect PSU, memory (even ECC ram can get bad), cables (both IO and power) and the expander.

Try to locate the fault using dummy data on other drives before messing around with your real data (get some old ones or spares to test with).
 
Keep in mind that scrubs does "first in first check" sort of - so your theory about it beeing the latest data is probably correct.
 
Cheers Guys,

Its a supermicro board of some description with 8GB ECC ram and an on-board sas controller to a chenbro expander. I might buy another box and back it all up first before I start playing.

If it was the expander playing up, should I be getting read and write disk errors? I'll give Prime95 a go as well and see what happens. Is there a bootable equivlanet, its not a windows box.

It should be on a number of live cd/usbs http://en.wikipedia.org/wiki/BartPE for example.
 
first thing: BACK EVERYTHING UP. A RAID throwing errors on a consistent basis like that is not to be trusted
for one thing: which os? Solaris? FreeBSD? Linux?
I'd suggest installing smartmontools (there should be versions for all of the above) and start checking drives. There's a decent chance that at least one is dying on you.
 
I am suspicious of bad memory. If you run the scrub 2-3 times in a row, are the checksum errors the same on the same drives? I had an ECC stick go completely bad - because it wasn't the lowest one, the OS wasn't seeing issues - that memory range was being used for I/O buffers and such and was causing massive corruption.
 
You mention SFV and CRC errors on downloads. Are these stored on the ZFS? Are they stored through the expander?

If they are on the ZFS, then I'd say you have mobo problems(or PSU).

If not on the ZFS, but on the expander, then it is PSU, SAS cables, or the expander itself.

I'm not positive, but just because you don't get read or write errors, I don't think you can rule out the expander or cables.
 
Running Solaris. I've run the scrub twice now and the checksum errors appear on multiple random (different each time) drives with the other drives being fine. My thinking now is that if false checksums are being generated I'll need two scrubs to see if the problem has been solved. Is this correct? It also means there has/is definitely data corruption going on.

I ran a Prime95 test overnight (10hrs) without issue, for what its worth. Also checked SMART data on all drives and they are clean.

Downloads are all going straight onto the ZFS pool, through the expander.

I think I'll try pulling the second stick of memory and a new PSU. They are the easiest/cheapest to replace. I really want to back it all up but its 17TB of data so it would get a bit expensive.

What's confusing me is that the existing information seems rocks solid (first 90% of scrub). I would have thought ram issues would have impacted the initial checks.

Thanks for the input guys.
 
Normally, when ZFS reports errors on different disks on every scrub, it means the error is not on the disks, but somewhere else. Maybe in the PSU or something:
https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta

Are the errors reported only when scrubbing? Because scrub is a ZFS only thing, hardware raid does not have scrub. I wonder if hardware raid would have catched these problems.
 
Cheers for the link,

Yep, only on scrub. Anything non ZFS and I'd be blissfully unaware of the problem.

Pulled the high stick of ram and am testing now. After that the PSU. After that work to buy hard drive to make a backup.
 
I am suspicious of bad memory. If you run the scrub 2-3 times in a row, are the checksum errors the same on the same drives? I had an ECC stick go completely bad - because it wasn't the lowest one, the OS wasn't seeing issues - that memory range was being used for I/O buffers and such and was causing massive corruption.

Thanks Danswartz, This seems to be exactly it. I ran both memtest and Prime95 without error but after pulling the high end stick of RAM I've completely scrubbed without any error. Hopefully it stays that way.
 
Odd that neither test found it. I'd try one more experiment. Swap the stick with the one you pulled. If that still works, it's possible your mobo is flaking out (and the high slot is dodgy...)
 
Back
Top