When do YOU replace a drive?

Zarathustra[H]

Extremely [H]
Joined
Oct 29, 2000
Messages
38,835
So, everyone seems to have a different take on when it is time to RMA a drive.

This happened today in my pool.

upload_2019-8-4_21-28-24.png


I wondered if the drive was really going bad, or if something else was going on, so I pulled it out of the backplane, and reseated it, cleared the errors, resilvered the pool, and am now running a pool scrub to see if I can tease the errors out again.

If the drive fails again, I'm going to RMA it, otherwise I'll consider it one of those random occurrences we keep redundancy for.

When do you guys typically decide to RMA yours?
 
I've never had to RMA a drive. With that said I don't replace them until they throw up SMART errors/alerts or Start clicking or grinding or has a delayed spin-up. Then they are replaced or just eliminated because of a newer larger drive that was in the system anyway.
 
I've never had to RMA a drive. With that said I don't replace them until they throw up SMART errors/alerts or Start clicking or grinding or has a delayed spin-up. Then they are replaced or just eliminated because of a newer larger drive that was in the system anyway.

Hmm.

I originally just ran smartctl -A to display the error summary statistics on this drive, which looked OK, but when I re-ran it with a lower case -a for the full report, these errors popped up which are new to me. I may wind up having to RAM it after all.

Code:
SMART Error Log Version: 1
ATA Error Count: 2
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 15309 hours (637 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  26d+10:02:03.371  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:02:03.371  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 15309 hours (637 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  26d+10:02:00.275  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  26d+10:02:00.275  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:02:00.275  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:01:58.594  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  26d+10:01:58.533  READ FPDMA QUEUED


The funny part is, it seems to be behaving very nicely right now
 
Can you do that same test on another system ( I am questioning PSU issues) or at least another power cable and sata cable to duplicate results?
 
Can you do that same test on another system ( I am questioning PSU issues) or at least another power cable and sata cable to duplicate results?

The drive is in a caddy attached to the backplane of my Norco 4216, so there are no power or sata cables in use. I could pull the drive and test it separately, but I am not sure what that would buy me, as testing it in the system right now it is not failing, but I will know more when the scrub I am currently doing completes.

I am retiring this case/backplane/psu in a a couple of weeks, moving everything over to a Supermicro SC846, so if that is the source of the problem, it is a short lived one.

You are right that the error occurring at "disk power on" is suspicious. This system hasn't rebooted in over 3 months, so it does kind of sound like it lost power for a bit and restarted.

I'm not sure I think the PSU is to blame, as none of the 11 other disks in the same backplane had the same issue. I'm leaning towards poor connection to the backplane right now. Hopefully I solved it by reseating the drive.
 
Yea that sounds good. I assume its data is on another drive and backed up so if it does act up again you can just pull it and send it off.
 
I had a WD Black 1TB start making some strange grinding sound when reading and writing, drive performed fine but sounded terrible.
Emailed WD support and they suggested to RMA the drive so I went with the Advance RMA so I could copy the data from the drive to the replacement.
A day or so later I get an email from WD Support saying that they didn't have any 1TB Blacks in stock and if I would be fine with a 2TB Black instead.
I read that and was like, HELL YA, I'LL TAKE A 2TB BLACK, lol. The 2TB Black was over $200 at the time.

wd-support-upgrade-2tb.jpg
 
Yea that sounds good. I assume its data is on another drive and backed up so if it does act up again you can just pull it and send it off.


Yup. This pool is in a ZFS equivalent of a RAID-60 (two striped dual disk redundancy RAIDS) so I can pop that drive out if I need to. The system will tell me, but I can technically lose 1-3 (depending on which ones) more drives before I have local data loss, and if I have local data loss, I can always restore from my remote backup server.

It would be a pain in the butt, but no data would be lost.
 
Depends on the scenario. At work, I take no chances and the slightest wiggle in smart data, drive gets yanked and swapped. That's the point of the 4 hour support we pay for. The vendor can figure out if the drive was bad or not. At home? I am god awful. However, I have a backup process that does backups to 2 separate external drive enclosures on a regular basis. Ill use my drives until they murder themselves, and I replace my external backup drives every few years regardless or not if they are dying. I dont do anything fancy with my storage at home as the total amount of data I care about is sub 1tb of space.. I do enough storage at work, so I hate doing it at home.
 
You are right that the error occurring at "disk power on" is suspicious. This system hasn't rebooted in over 3 months, so it does kind of sound like it lost power for a bit and restarted.

I didn't read that as happening at disk power-up, but happening at during the 15,309th hour of the disk's life that it has been in a powered-on state. Even looking at the command stack trace, it's been in a powered-up state for 26+ days (which wraps at 49.7 days).

Given that the two errors were less than two seconds apart on the same LBA (a write followed by a verify, it looks like), and as long as no more errors pop up (including the testing of that same LBA), I'd say it's safe bet that it was just a fluke.
 
I didn't read that as happening at disk power-up, but happening at during the 15,309th hour of the disk's life that it has been in a powered-on state. Even looking at the command stack trace, it's been in a powered-up state for 26+ days (which wraps at 49.7 days).

Given that the two errors were less than two seconds apart on the same LBA (a write followed by a verify, it looks like), and as long as no more errors pop up (including the testing of that same LBA), I'd say it's safe bet that it was just a fluke.

Thanks for that input.

I have no experience reading these error logs, so I am learning how to interpret them.

My previous drives just had read errors, but never flagged detailed error logs like this in Smart before.

Maybe that's an Enterprise feature? These are the first enterprise drives I've ever owned.
 
This has been in the smart for a long time in regular desktop drives.

Here is a report from a hitachi deskstar that has 68 thousand power on hours. The recorded problem happened at 24 thousand power on hours.

Code:
fileserver1 Administrator # smartctl --all /dev/sde
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.5-gentoo-20181129-1450-fileserver1] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K3000
Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN1220F31NWX7D
LU WWN Device Id: 5 000cca 369d797b3
Firmware Version: MN6OA5C0
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Aug  5 13:57:08 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (18523) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 309) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       86
  3 Spin_Up_Time            0x0007   161   161   024    Pre-fail  Always       -       431 (Average 283)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       91
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   138   138   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   091   091   000    Old_age   Always       -       68676
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       89
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       116
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       116
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 18/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 24483 hours (1020 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 f0 88 0c 02  Error: UNC at LBA = 0x020c88f0 = 34375920
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 00 00 97 0c 40 00   5d+18:01:50.367  READ FPDMA QUEUED
  60 80 b0 80 96 0c 40 00   5d+18:01:50.366  READ FPDMA QUEUED
  60 80 a8 00 96 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
  60 80 a0 80 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
  60 80 98 00 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:
Only drives I ever had to rma were the old 150gb WD raptors I ran in raid. I rma'd them 2 times each before the warranty was up. besides that I never had to rma a drive with in a warranty period. I just tossed them once I started getting errors.
 
At work I did over 75 RMAs. However since all of our Seagate 7200.X drives are out of service the RMAs have reduced to 1 or 2 a year instead a few each month or so.
 
This has been in the smart for a long time in regular desktop drives.

Here is a report from a hitachi deskstar that has 68 thousand power on hours. The recorded problem happened at 24 thousand power on hours.

Code:
fileserver1 Administrator # smartctl --all /dev/sde
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.5-gentoo-20181129-1450-fileserver1] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K3000
Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN1220F31NWX7D
LU WWN Device Id: 5 000cca 369d797b3
Firmware Version: MN6OA5C0
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Aug  5 13:57:08 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (18523) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 309) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       86
  3 Spin_Up_Time            0x0007   161   161   024    Pre-fail  Always       -       431 (Average 283)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       91
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   138   138   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   091   091   000    Old_age   Always       -       68676
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       89
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       116
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       116
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 18/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 24483 hours (1020 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 f0 88 0c 02  Error: UNC at LBA = 0x020c88f0 = 34375920
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 00 00 97 0c 40 00   5d+18:01:50.367  READ FPDMA QUEUED
  60 80 b0 80 96 0c 40 00   5d+18:01:50.366  READ FPDMA QUEUED
  60 80 a8 00 96 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
  60 80 a0 80 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
  60 80 98 00 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Huh. Prior to my Seagates, I used WD Reds and never saw anything of the sort in them, despite having to RMA a couple and an additional two dying from old age.
 
I think older Seagates had this as well. I don't have many WDC Red drives. At work I purchase only 7200 RPM drives for raid arrays.
 
I run a few small raids at home and in the office. I usually "retire" drives after about 3-4 years, usually to up the storage density of the array. I've been lucky as I only had three failures past decade or so. If I'm getting bad sector errors, thats when I replace the drive, even if its still dishing out data. I Only use WD and Hitachi NAS drives, Reds, and since 2 years ago using only WD Gold's now.
 
I haven't actually had a drive die on me in years, which is crazy because my 3TB drive I just removed has been active basically 24/7 since 2010.

Simply replacing drives due to lack of space has been the main reason why. (Such as replacing that previously mentioned 3TB drive with a shucked 10TB)
 
Hmm.

Same error on a different drive in my pool now. Just one error, ZFS recovered gracefully and the drive is still online.

Thing is, since the last one I went through a major server upgrade. I changed EVERYTHING.

Different motherboard, different CPU's, different power supplies, different SAS controllers, different backplane, even did a major OS upgrade.

The drive that originally had the error has been fine ever since.

This is kind of strange.
 
Hmm.

Same error on a different drive in my pool now. Just one error, ZFS recovered gracefully and the drive is still online.

Thing is, since the last one I went through a major server upgrade. I changed EVERYTHING.

Different motherboard, different CPU's, different power supplies, different SAS controllers, different backplane, even did a major OS upgrade.

The drive that originally had the error has been fine ever since.

This is kind of strange.

I'm starting to wonder if maybe this is a symptom of setting TLER too low.

The Seagate Enterprise drives I have have a configurable error recovery time limit, which I have set using smartctl scterc to 2 seconds.

Is 2 seconds too short? What is typical in a high load enterprise ZFS implementation?
 
Back
Top