Why you run periodic zfs pool scrubs :)

danswartz · Jan 13, 2014

My weekly pool scrub turned up 1 checksum error out of the 6 disks in a 3x2 raid10 pool. I did 'zpool clear tank' and re-ran the scrub. Again, 1 checksum on same drive. smartctl says drive is healthy. I dump out all the info on the 6 drives, looking for the grown defect list info. Here:

Elements in grown defect list: 6
Elements in grown defect list: 187
Elements in grown defect list: 3
Elements in grown defect list: 40
Elements in grown defect list: 2
Elements in grown defect list: 7

The drive throwing the checksum error is the one with 187 defects! These are seagate nearline 1TB SAS drives, about 2-1/2 years old. I have a rush order from amazon for two more of them (one for insta-replace and one for spare.) I'm not worried about the single digit drives. Not sure whether to replace the 40 defect drive - I'm inclined to just watch it and see if the list grows or not.

patrickdk · Jan 13, 2014

Mine was slightly more interesting

The say after I went on vacation, it reported a 2 disks missing, and several days later, another 2 disks went missing.

Seems when my disks throw a write error, the hba goes nuts, and just drops the disk. (Moved the problem disks to the onboard ICH controller instead)

After doing many smart long scans, and repairing each uncorrectable sector on each disk, and doing 3 scrubs (disks kept dropping due to oo many checksum errors from the sector remapping), it has been stable again for 2 weeks.

This weekend I had another sector go bad on one of the disks. normal scrub fixed it up, in it's weekly run.

I really need to look into some way to do smart scans and auto-fix the uncorrectable sectors sometime (force remapping).

danswartz · Jan 13, 2014

I am amused at how useless the smart health detection is

Pretty sure that one disk is dying. Just from redoing the scrub, the defect list has grown by 2.

drescherjm · Jan 13, 2014

danswartz said:
I am amused at how useless the smart health detection is Pretty sure that one disk is dying. Just from redoing the scrub, the defect list has grown by 2.

Are you looking at individual SMART attributes or the overall drive PASS / FAIL? I ask because at work looking at 4 or so SMART attributes on each disk has predicted most drive failures however the full drive PASS / FAIL has only happened 2 times and I have sent in 75 to 100 RMAs over the years.

patrickdk · Jan 13, 2014

Overall is really pointless, it will only fail, if the disk is seriously warn out, but still usable.
I have never seen a broken disk, that is unusable, fail smart.

I have seen smart tests take weeks on end, and not finish.

Personally the smart values I most watch are:
5 Reallocated_Sector_Ct 0x0033 174 174 140 Pre-fail Always - 204
196 Reallocated_Event_Count 0x0032 188 188 000 Old_age Always - 12
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

and noting this disk needs some serious help at the moment.

Jorona · Jan 13, 2014

patrickdk said:
Overall is really pointless, it will only fail, if the disk is seriously warn out, but still usable.
I have never seen a broken disk, that is unusable, fail smart.

I have seen smart tests take weeks on end, and not finish.

Personally the smart values I most watch are:
5 Reallocated_Sector_Ct 0x0033 174 174 140 Pre-fail Always - 204
196 Reallocated_Event_Count 0x0032 188 188 000 Old_age Always - 12
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

and noting this disk needs some serious help at the moment.

I have seen disks fail smart that are unusable.

Reallocated and current pending are what I usually keep an eye on. If reallocated is over25 and there are more pending, I RMA the drive. If there's pending, but none reallocated yet, I run VIVARD and see if I can get the drive to reallocate the sector. Basically if it's less then 25, and the number doesn't keep growing after I do the reallocation, I'll give the drive a chance.

drescherjm · Jan 13, 2014

After seeing some troubling values in the SMART attributes I will only RMA if a drive fails to pass a 4 pass badblocks. 2 times in a row.

Code:

badblocks -wsv -p 2 /dev/disktotest

bexamous · Jan 13, 2014

Google released that study they did couple years ago, SMART predicted disk failures like half the time, I think?

I think this is it, too lazy to read through again:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf

drescherjm · Jan 13, 2014

I know that Google said that SMART was not a good predictor of failures. However in practice it has worked well for us at work so I continue to use smart and have nagios monitor every single drive I have in all of the raid servers.

rsq · Jan 13, 2014

OP, do you have a script that interprets the zpool status after scrub and mails you if checksum errors were found?

If not, do you want one? (I need that thing, but have very little time to make one.)

bexamous · Jan 13, 2014

drescherjm said:
I know that Google said that SMART was not a good predictor of failures. However in practice it has worked well for us at work so I continue to use smart and have nagios monitor every single drive I have in all of the raid servers.

From abstract:

Our analysis identifies several parameters from the drives self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

SMART errors are a good predictor of drive failure, but no smart errors is not a good predictor of no failure. There is that large group of failures that occur with no smart errors.

danswartz · Jan 13, 2014

rsq said:
OP, do you have a script that interprets the zpool status after scrub and mails you if checksum errors were found?

If not, do you want one? (I need that thing, but have very little time to make one.)

On a linux VM. There is a plugin called check_zpool that queries the pool(s) on a zfs box.

drescherjm · Jan 13, 2014

I just started using check_zpool last week. When I converted a server that holds windows system images from btrfs on top of mdadm raid6 to zfs.

TCM2 · Jan 14, 2014

Did you see any errors anywhere hinting that the disk delivered faulty data? Was the scrub the only indication of data corruption? What OS is this?

danswartz · Jan 14, 2014

It's ZFS. It was detecting a checksum error, which indicates the built-in checksumming in a block of logical data didn't match what it computed when it read the data. That maps well with 189 defects (and growing). Yet, 'smart health is OK'.

TCM2 · Jan 14, 2014

I know. It says so in the first post.

I wanted to know if you got any driver messages or syslog entries hinting to the corruption, other than the scrub result. And again, which OS is this? Just curious.

danswartz · Jan 14, 2014

No, no messages. Which is what I'd expect if the drive is dying. It's omnios, FWIW.

Why you run periodic zfs pool scrubs :)

danswartz

2[H]4U

patrickdk

Gawd

danswartz

2[H]4U

drescherjm

[H]F Junkie

patrickdk

Gawd

Jorona

2[H]4U

drescherjm

[H]F Junkie

bexamous

[H]ard|Gawd

drescherjm

[H]F Junkie

rsq

Limp Gawd

bexamous

[H]ard|Gawd

danswartz

2[H]4U

drescherjm

[H]F Junkie

TCM2

Gawd

danswartz

2[H]4U

TCM2

Gawd

danswartz

2[H]4U