HDD failing in ZFS, couple questions

SirMaster · Nov 27, 2013

So I seem to have a HDD failing but have a couple questions about it.

Here is the SMART log of the dying drive:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   083   006    Pre-fail  Always       -       6846404
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       44
  5 Reallocated_Sector_Ct   0x0033   089   089   036    Pre-fail  Always       -       14592
  7 Seek_Error_Rate         0x000f   054   054   030    Pre-fail  Always       -       60133321131
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9147
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       48
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       585
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       65537
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   058   045    Old_age   Always       -       34 (Min/Max 33/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       23
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       252
194 Temperature_Celsius     0x0022   034   042   000    Old_age   Always       -       34 (0 23 0 0)
197 Current_Pending_Sector  0x0012   089   088   000    Old_age   Always       -       1880
198 Offline_Uncorrectable   0x0010   089   088   000    Old_age   Offline      -       1880
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       37404870189985
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       26626247731
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       43462540760

Here is the SMART of a same model drive that is not dying:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   099   006    Pre-fail  Always       -       48464472
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       44
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   059   058   030    Pre-fail  Always       -       12887756042
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5312
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       40
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   053   045    Old_age   Always       -       38 (Min/Max 38/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       39
193 Load_Cycle_Count        0x0032   093   093   000    Old_age   Always       -       14988
194 Temperature_Celsius     0x0022   038   047   000    Old_age   Always       -       38 (0 22 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       148442659689107
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7468891416
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       21496206345

So obviously the reallocated sectors is bad as that keeps going up.

The weird thing and the question I have in regards to ZFS is this though.

The drive says this:

Code:

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       585
197 Current_Pending_Sector  0x0012   089   088   000    Old_age   Always       -       1880
198 Offline_Uncorrectable   0x0010   089   088   000    Old_age   Offline      -       1880

187 is this:
Reported Uncorrectable Errors
The count of errors that could not be recovered using hardware ECC

Yet my zpool status says everything appears fine:

Code:

 NAME                                         STATE     READ WRITE CKSUM
        nickarray                                    ONLINE       0     0     0
          raidz2-0                                   ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317581       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317592       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202646       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202647       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJX0B900055       ONLINE       0     0     0
            ata-WDC_WD20EVDS-63T3B0_WD-WCAVY6715008  ONLINE       0     0     0
          raidz2-1                                   ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F299TA          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F29PE7          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F10V30          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2LWDL          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2PL9K          ONLINE       0     0     0
            ata-ST3000DM001-9YN166_W1F0SQF6          ONLINE       0     0     0

dmesg says this over and over again:

Code:

[345578.245758] sd 1:0:0:0: [sda]
[345578.245759] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[345578.245760] sd 1:0:0:0: [sda]
[345578.245761] Sense Key : Medium Error [current] [descriptor]
[345578.245763] Descriptor sense data with sense descriptors (in hex):
[345578.245764] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[345578.245768] c4 44 35 c0
[345578.245770] sd 1:0:0:0: [sda]
[345578.245771] Add. Sense: Unrecovered read error - auto reallocate failed
[345578.245773] sd 1:0:0:0: [sda] CDB:
[345578.245773] Read(16): 88 00 00 00 00 00 c4 44 35 88 00 00 01 00 00 00
[345578.245778] end_request: I/O error, dev sda, sector 3292804544
[345578.245791] ata1: EH complete

If there are really 585 uncorrectable errors, why isn't ZFS telling me there are checksum errors or something? It doesn't sound like the data I wrote to the array yesterday and today is being written successfully to that drive.

Also, smartctl takes a good 5-6 seconds to respond for this disk while all my other disks respond instantly. Any idea why this is or how it may be related? Isn't is just reading data off the firmware? Is it perhaps because there are 585 ATA Error counts in SMART and it takes awhile to read that data?

Just curious.

I'm ordering a replacement now and should get it Monday to perform the resilver. I'm not too worried since its a 6-disk vdev with 2 parity.

drescherjm · Nov 27, 2013

Isn't is just reading data off the firmware? Is it perhaps because there are over 500 ATA Error counts in SMART and it takes awhile to read that data?

The drive is probably working in the background to try to recover the 1880 unreadable sectors.

SirMaster · Nov 27, 2013

Interesting, perhaps yeah.

And by now its up to 14880 reallocated and 1928 pending and uncorrectable sectors...

Should I pull the drive out now before the replacement comes Monday? I don't suppose there is ANY chance this is a cabling issue? I can't do anything till I get home in 4 hours.

Looks pretty grim.

drescherjm · Nov 27, 2013

I don't suppose there is ANY chance this is a cabling issue?

No its a hardware problem with the drive itself (bad heads...).

Should I pull the drive out now before the replacement comes Monday?

I will have to leave that to the zfs gurus. Any zfs advice I could give you would be based on less experience than you have..

uOpt · Nov 27, 2013

ZFS never gets incorrect blocks. The drive knows they are incorrect and returns an error on read (as opposed to wrong data), and on write it reallocates them.

SirMaster · Nov 27, 2013

So since ZFS says everything is fine that means for now the drive is still successfully storing my bits? Even though it has an ever increasing pending sectors and unrecoverable sector count?

If it wasn't then there should be checksum errors via ZFS since the data coming back from the disk is not what was assumed to be written to it?

danswartz · Nov 27, 2013

The drive must be returning correct data or you'd be seeing read errors on the pool on that drive. I wouldn't trust it far enough to throw it though.

SirMaster · Nov 27, 2013

Oh yeah. I certainly don't trust it.

Just trying to make sense of whats all happening in my array heh.

Thanks for the insight so far guys.

uOpt · Nov 27, 2013

No, the drive delivers errors instead of the block. You can see that in syslog.

But the raid is reacting to that by reconstructing the block from the other drives and then overwrites the location on the disk, which cause the disk to be able to reallocate the bad sector internally, and presumably next time you read the same sector you get the good copy in the reallocated position.

So the array as such is healthy. The disk is not.

HDD failing in ZFS, couple questions

SirMaster

2[H]4U

drescherjm

[H]F Junkie

SirMaster

2[H]4U

drescherjm

[H]F Junkie

uOpt

[H]ard|Gawd

SirMaster

2[H]4U

danswartz

2[H]4U

SirMaster

2[H]4U

uOpt

[H]ard|Gawd