Why you run periodic zfs pool scrubs :)

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,715
My weekly pool scrub turned up 1 checksum error out of the 6 disks in a 3x2 raid10 pool. I did 'zpool clear tank' and re-ran the scrub. Again, 1 checksum on same drive. smartctl says drive is healthy. I dump out all the info on the 6 drives, looking for the grown defect list info. Here:

Elements in grown defect list: 6
Elements in grown defect list: 187
Elements in grown defect list: 3
Elements in grown defect list: 40
Elements in grown defect list: 2
Elements in grown defect list: 7

The drive throwing the checksum error is the one with 187 defects! These are seagate nearline 1TB SAS drives, about 2-1/2 years old. I have a rush order from amazon for two more of them (one for insta-replace and one for spare.) I'm not worried about the single digit drives. Not sure whether to replace the 40 defect drive - I'm inclined to just watch it and see if the list grows or not.
 
Mine was slightly more interesting :)

The say after I went on vacation, it reported a 2 disks missing, and several days later, another 2 disks went missing.

Seems when my disks throw a write error, the hba goes nuts, and just drops the disk. (Moved the problem disks to the onboard ICH controller instead)

After doing many smart long scans, and repairing each uncorrectable sector on each disk, and doing 3 scrubs (disks kept dropping due to oo many checksum errors from the sector remapping), it has been stable again for 2 weeks.

This weekend I had another sector go bad on one of the disks. normal scrub fixed it up, in it's weekly run.

I really need to look into some way to do smart scans and auto-fix the uncorrectable sectors sometime (force remapping).
 
I am amused at how useless the smart health detection is :) Pretty sure that one disk is dying. Just from redoing the scrub, the defect list has grown by 2.
 
I am amused at how useless the smart health detection is :) Pretty sure that one disk is dying. Just from redoing the scrub, the defect list has grown by 2.

Are you looking at individual SMART attributes or the overall drive PASS / FAIL? I ask because at work looking at 4 or so SMART attributes on each disk has predicted most drive failures however the full drive PASS / FAIL has only happened 2 times and I have sent in 75 to 100 RMAs over the years.
 
Overall is really pointless, it will only fail, if the disk is seriously warn out, but still usable.
I have never seen a broken disk, that is unusable, fail smart.

I have seen smart tests take weeks on end, and not finish.

Personally the smart values I most watch are:
5 Reallocated_Sector_Ct 0x0033 174 174 140 Pre-fail Always - 204
196 Reallocated_Event_Count 0x0032 188 188 000 Old_age Always - 12
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

and noting this disk needs some serious help at the moment.
 
Overall is really pointless, it will only fail, if the disk is seriously warn out, but still usable.
I have never seen a broken disk, that is unusable, fail smart.

I have seen smart tests take weeks on end, and not finish.

Personally the smart values I most watch are:
5 Reallocated_Sector_Ct 0x0033 174 174 140 Pre-fail Always - 204
196 Reallocated_Event_Count 0x0032 188 188 000 Old_age Always - 12
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

and noting this disk needs some serious help at the moment.

I have seen disks fail smart that are unusable.

Reallocated and current pending are what I usually keep an eye on. If reallocated is over25 and there are more pending, I RMA the drive. If there's pending, but none reallocated yet, I run VIVARD and see if I can get the drive to reallocate the sector. Basically if it's less then 25, and the number doesn't keep growing after I do the reallocation, I'll give the drive a chance.
 
After seeing some troubling values in the SMART attributes I will only RMA if a drive fails to pass a 4 pass badblocks. 2 times in a row.


Code:
badblocks -wsv -p 2 /dev/disktotest
 
I know that Google said that SMART was not a good predictor of failures. However in practice it has worked well for us at work so I continue to use smart and have nagios monitor every single drive I have in all of the raid servers.
 
OP, do you have a script that interprets the zpool status after scrub and mails you if checksum errors were found?

If not, do you want one? (I need that thing, but have very little time to make one.)
 
I know that Google said that SMART was not a good predictor of failures. However in practice it has worked well for us at work so I continue to use smart and have nagios monitor every single drive I have in all of the raid servers.

From abstract:
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

SMART errors are a good predictor of drive failure, but no smart errors is not a good predictor of no failure. There is that large group of failures that occur with no smart errors.
 
OP, do you have a script that interprets the zpool status after scrub and mails you if checksum errors were found?

If not, do you want one? (I need that thing, but have very little time to make one.)

On a linux VM. There is a plugin called check_zpool that queries the pool(s) on a zfs box.
 
I just started using check_zpool last week. When I converted a server that holds windows system images from btrfs on top of mdadm raid6 to zfs.
 
Did you see any errors anywhere hinting that the disk delivered faulty data? Was the scrub the only indication of data corruption? What OS is this?
 
It's ZFS. It was detecting a checksum error, which indicates the built-in checksumming in a block of logical data didn't match what it computed when it read the data. That maps well with 189 defects (and growing). Yet, 'smart health is OK'.
 
I know. It says so in the first post.

I wanted to know if you got any driver messages or syslog entries hinting to the corruption, other than the scrub result. And again, which OS is this? Just curious.
 
Back
Top