When do YOU replace a drive?

Discussion in 'SSDs & Data Storage' started by Zarathustra[H], Aug 4, 2019.

  1. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000
    So, everyone seems to have a different take on when it is time to RMA a drive.

    This happened today in my pool.

    upload_2019-8-4_21-28-24.png

    I wondered if the drive was really going bad, or if something else was going on, so I pulled it out of the backplane, and reseated it, cleared the errors, resilvered the pool, and am now running a pool scrub to see if I can tease the errors out again.

    If the drive fails again, I'm going to RMA it, otherwise I'll consider it one of those random occurrences we keep redundancy for.

    When do you guys typically decide to RMA yours?
     
  2. DrLobotomy

    DrLobotomy [H]ardness Supreme

    Messages:
    5,886
    Joined:
    May 19, 2016
    I've never had to RMA a drive. With that said I don't replace them until they throw up SMART errors/alerts or Start clicking or grinding or has a delayed spin-up. Then they are replaced or just eliminated because of a newer larger drive that was in the system anyway.
     
  3. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000
    Hmm.

    I originally just ran smartctl -A to display the error summary statistics on this drive, which looked OK, but when I re-ran it with a lower case -a for the full report, these errors popped up which are new to me. I may wind up having to RAM it after all.

    Code:
    SMART Error Log Version: 1
    ATA Error Count: 2
       CR = Command Register [HEX]
       FR = Features Register [HEX]
       SC = Sector Count Register [HEX]
       SN = Sector Number Register [HEX]
       CL = Cylinder Low Register [HEX]
       CH = Cylinder High Register [HEX]
       DH = Device/Head Register [HEX]
       DC = Device Command Register [HEX]
       ER = Error register [HEX]
       ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 2 occurred at disk power-on lifetime: 15309 hours (637 days + 21 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 00 ff ff ff 4f 00  26d+10:02:03.371  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:02:03.371  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:02:01.773  READ FPDMA QUEUED
    
    Error 1 occurred at disk power-on lifetime: 15309 hours (637 days + 21 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 08 ff ff ff 4f 00  26d+10:02:00.275  WRITE FPDMA QUEUED
      61 00 08 ff ff ff 4f 00  26d+10:02:00.275  WRITE FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:02:00.275  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:01:58.594  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00  26d+10:01:58.533  READ FPDMA QUEUED

    The funny part is, it seems to be behaving very nicely right now
     
  4. DrLobotomy

    DrLobotomy [H]ardness Supreme

    Messages:
    5,886
    Joined:
    May 19, 2016
    Can you do that same test on another system ( I am questioning PSU issues) or at least another power cable and sata cable to duplicate results?
     
  5. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000
    The drive is in a caddy attached to the backplane of my Norco 4216, so there are no power or sata cables in use. I could pull the drive and test it separately, but I am not sure what that would buy me, as testing it in the system right now it is not failing, but I will know more when the scrub I am currently doing completes.

    I am retiring this case/backplane/psu in a a couple of weeks, moving everything over to a Supermicro SC846, so if that is the source of the problem, it is a short lived one.

    You are right that the error occurring at "disk power on" is suspicious. This system hasn't rebooted in over 3 months, so it does kind of sound like it lost power for a bit and restarted.

    I'm not sure I think the PSU is to blame, as none of the 11 other disks in the same backplane had the same issue. I'm leaning towards poor connection to the backplane right now. Hopefully I solved it by reseating the drive.
     
  6. DrLobotomy

    DrLobotomy [H]ardness Supreme

    Messages:
    5,886
    Joined:
    May 19, 2016
    Yea that sounds good. I assume its data is on another drive and backed up so if it does act up again you can just pull it and send it off.
     
  7. Zepher

    Zepher [H]ipster Replacement

    Messages:
    16,790
    Joined:
    Sep 29, 2001
    I had a WD Black 1TB start making some strange grinding sound when reading and writing, drive performed fine but sounded terrible.
    Emailed WD support and they suggested to RMA the drive so I went with the Advance RMA so I could copy the data from the drive to the replacement.
    A day or so later I get an email from WD Support saying that they didn't have any 1TB Blacks in stock and if I would be fine with a 2TB Black instead.
    I read that and was like, HELL YA, I'LL TAKE A 2TB BLACK, lol. The 2TB Black was over $200 at the time.

    wd-support-upgrade-2tb.jpg
     
    daglesj and Zarathustra[H] like this.
  8. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000

    Yup. This pool is in a ZFS equivalent of a RAID-60 (two striped dual disk redundancy RAIDS) so I can pop that drive out if I need to. The system will tell me, but I can technically lose 1-3 (depending on which ones) more drives before I have local data loss, and if I have local data loss, I can always restore from my remote backup server.

    It would be a pain in the butt, but no data would be lost.
     
  9. kdh

    kdh Gawd

    Messages:
    737
    Joined:
    Mar 16, 2005
    Depends on the scenario. At work, I take no chances and the slightest wiggle in smart data, drive gets yanked and swapped. That's the point of the 4 hour support we pay for. The vendor can figure out if the drive was bad or not. At home? I am god awful. However, I have a backup process that does backups to 2 separate external drive enclosures on a regular basis. Ill use my drives until they murder themselves, and I replace my external backup drives every few years regardless or not if they are dying. I dont do anything fancy with my storage at home as the total amount of data I care about is sub 1tb of space.. I do enough storage at work, so I hate doing it at home.
     
    Exercate likes this.
  10. ryan_975

    ryan_975 [H]ardForum Junkie

    Messages:
    14,150
    Joined:
    Feb 6, 2006
    I didn't read that as happening at disk power-up, but happening at during the 15,309th hour of the disk's life that it has been in a powered-on state. Even looking at the command stack trace, it's been in a powered-up state for 26+ days (which wraps at 49.7 days).

    Given that the two errors were less than two seconds apart on the same LBA (a write followed by a verify, it looks like), and as long as no more errors pop up (including the testing of that same LBA), I'd say it's safe bet that it was just a fluke.
     
  11. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000
    Thanks for that input.

    I have no experience reading these error logs, so I am learning how to interpret them.

    My previous drives just had read errors, but never flagged detailed error logs like this in Smart before.

    Maybe that's an Enterprise feature? These are the first enterprise drives I've ever owned.
     
  12. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    14,424
    Joined:
    Nov 19, 2008
    This has been in the smart for a long time in regular desktop drives.

    Here is a report from a hitachi deskstar that has 68 thousand power on hours. The recorded problem happened at 24 thousand power on hours.

    Code:
    fileserver1 Administrator # smartctl --all /dev/sde
    smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.5-gentoo-20181129-1450-fileserver1] (local build)
    Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF INFORMATION SECTION ===
    Model Family:     Hitachi Deskstar 7K3000
    Device Model:     Hitachi HDS723020BLA642
    Serial Number:    MN1220F31NWX7D
    LU WWN Device Id: 5 000cca 369d797b3
    Firmware Version: MN6OA5C0
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Mon Aug  5 13:57:08 2019 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    General SMART Values:
    Offline data collection status:  (0x84) Offline data collection activity
                                            was suspended by an interrupting command from host.
                                            Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                (18523) seconds.
    Offline data collection
    capabilities:                    (0x5b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            No Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   1) minutes.
    Extended self-test routine
    recommended polling time:        ( 309) minutes.
    SCT capabilities:              (0x003d) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       86
      3 Spin_Up_Time            0x0007   161   161   024    Pre-fail  Always       -       431 (Average 283)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       91
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   138   138   020    Pre-fail  Offline      -       25
      9 Power_On_Hours          0x0012   091   091   000    Old_age   Always       -       68676
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       89
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       116
    193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       116
    194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 18/42)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
    SMART Error Log Version: 1
    ATA Error Count: 1
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    Error 1 occurred at disk power-on lifetime: 24483 hours (1020 days + 3 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 10 f0 88 0c 02  Error: UNC at LBA = 0x020c88f0 = 34375920
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 80 00 00 97 0c 40 00   5d+18:01:50.367  READ FPDMA QUEUED
      60 80 b0 80 96 0c 40 00   5d+18:01:50.366  READ FPDMA QUEUED
      60 80 a8 00 96 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
      60 80 a0 80 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
      60 80 98 00 95 0c 40 00   5d+18:01:50.365  READ FPDMA QUEUED
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
     
    Last edited: Aug 5, 2019
  13. vegeta535

    vegeta535 2[H]4U

    Messages:
    3,051
    Joined:
    Jul 19, 2013
    Only drives I ever had to rma were the old 150gb WD raptors I ran in raid. I rma'd them 2 times each before the warranty was up. besides that I never had to rma a drive with in a warranty period. I just tossed them once I started getting errors.
     
  14. DrLobotomy

    DrLobotomy [H]ardness Supreme

    Messages:
    5,886
    Joined:
    May 19, 2016
  15. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    14,424
    Joined:
    Nov 19, 2008
    At work I did over 75 RMAs. However since all of our Seagate 7200.X drives are out of service the RMAs have reduced to 1 or 2 a year instead a few each month or so.
     
  16. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,248
    Joined:
    Oct 29, 2000
    Huh. Prior to my Seagates, I used WD Reds and never saw anything of the sort in them, despite having to RMA a couple and an additional two dying from old age.
     
  17. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    14,424
    Joined:
    Nov 19, 2008
    I think older Seagates had this as well. I don't have many WDC Red drives. At work I purchase only 7200 RPM drives for raid arrays.
     
  18. arri650

    arri650 n00b

    Messages:
    3
    Joined:
    Nov 27, 2017
    I run a few small raids at home and in the office. I usually "retire" drives after about 3-4 years, usually to up the storage density of the array. I've been lucky as I only had three failures past decade or so. If I'm getting bad sector errors, thats when I replace the drive, even if its still dishing out data. I Only use WD and Hitachi NAS drives, Reds, and since 2 years ago using only WD Gold's now.
     
  19. TheSlySyl

    TheSlySyl [H]Lite

    Messages:
    90
    Joined:
    May 30, 2018
    I haven't actually had a drive die on me in years, which is crazy because my 3TB drive I just removed has been active basically 24/7 since 2010.

    Simply replacing drives due to lack of space has been the main reason why. (Such as replacing that previously mentioned 3TB drive with a shucked 10TB)