Stress Test New Hard Drives - Suggestions?

raiderj

Limp Gawd
Joined
Jun 21, 2011
Messages
340
I've picked up a bunch of new hard drives for my file server and looking for suggestions on how to do a full stress test on them before putting them into my server. I'm not confined by any time constraints.
 
I'd completely fill the drive with data and copy it all back off.
Do it multiple times if you want to be sure.
If it goes ok, there are no immediate problems.

Listen for odd noises or loud whining.

Check the temperature of the whole PCB with your finger after a drive has been running for say an hour.
It should be mildly warm, nowhere near hot.
If its near hot when running in open air, it can suffer more rapid heat damage and one day go into thermal overrun when placed in a warmer environment, even worse on a hot day.
 
I do a 4 pass badblocks read / write test on every single disk I receive home or at work. At the moment I actually have 5 drives testing here at work.

This one is on the 4th pass so almost done:

Code:
datastore3 ~ # badblocks -wsv /dev/sdc
Checking for bad blocks in read-write mode
From block 0 to 1953514583
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing:  90.25% done, 29:05:59 elapsed. (0/0/0 errors)

Before, during and after the test I look at the SMART:

Code:
datastore3 shell-scripts # smartctl --all /dev/sdc
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.12.4-gentoo-datastore3-build1] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000NM0033-9ZM175
Serial Number:    Z1X1F9DB
LU WWN Device Id: 5 000c50 065889fad
Firmware Version: SN03
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Feb  4 18:33:31 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   89) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 238) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   065   044    Pre-fail  Always       -       104631756
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       11678163
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       29
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   065   045    Old_age   Always       -       33 (Min/Max 23/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   035   030   000    Old_age   Always       -       104631756
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Looks fine. I am going to put this into a 24TB zfs array in an hour..
 
25c and above up to ~50c for most harddrives is where you want it. I understand the point of making sure it works, but stress testing mechanical drives amounts for crap really, they can last longer then warranty, or die within a month.
 
but stress testing mechanical drives amounts for crap really

For me this helps weed out the DOA drives. And I do receive drives that fail the test new and back from an RMA or even ones I repurposed from a different system.
 
I do the same thing as drescherjm. Almost, I only do 2 passes with badblocks, not 4.

It is quite a useful test and has occasionally found some borderline HDDs which I returned. No reason to bother putting iffy drives into production when I can just get another drive that passes the acceptance test with no issues.
 
Wow, thanks for all the great information! I have 8x4TB drives that I'll be testing.

drescherjm - Do you have a shell script that you run to do these tests? I see the smartmontools and badblocks tools that you utilize. Seems like it wouldn't be terribly difficult to build a script that you could kick off and then spit out a log file at the end.
 
Since you asked for a good stress test regimen, here is the actual workflow we follow for new drive tests. We generally order drives a dozen at a time from multiple vendors, and in the case of 11+1 arrays we take 2 of the drives each from 6 separate orders, so as to minimize the chance of linear production line failures. When we receive new drives, we follow the following steps.. First, we have a box setup just for drive verification testing (A Frankensteined older Supermicro with 8 LFF and 8 SFF slots, as well as 12 3TB SAS drives (11+HS in R6) which live in it fulltime). We run DBAN in DoD 5220.22-M mode, which does 7 complete passes (We used to use Gutmann (35 Passes), but found we had better success in breaking it up between coldstarts. Once the 7 passes complete, the box pauses for anywhere between 1 hour and 2 hours, shuts down cold and then starts back up (network connected pdu) and we start the 7 pass regime again.
The drives will continue in this pattern for 3 days.. Start.. DBAN.. Shutdown.. Lather.. Rinse.. Repeat..
Once the drives have survived this initial shakeout, they are placed into a test array (Generally 11 drives in the array and 1 spare). Data is then copied from a test array which is attached to an Areca 1882/HP Expander (The previously mentioned ~27TB Array) to the new array. Data is laid out on the array as follows:

22TB - BluRay Rips (~25-50GB Each)
1TB - 100MB Files
3TB - 1GB Files
.5TB - 1MB Files
Misc - 10,000 Empty Folders

Once the data has completed copy, we pull one drive out of the array and let it rebuild. If all drives pass this, they are ok'd to be moved into production.
 
Wow, thanks for all the great information! I have 8x4TB drives that I'll be testing.

drescherjm - Do you have a shell script that you run to do these tests? I see the smartmontools and badblocks tools that you utilize. Seems like it wouldn't be terribly difficult to build a script that you could kick off and then spit out a log file at the end.

I do not have a shell script for this. I use screen inside of ssh (usually putty from my windows development box at work) to make multiple shell sessions and run 1 instance of badblocks for each drive directly.

Also while the test is running I look at iotop to see the current transfer rate of each drive. Here is output of that testing 3 of the 5 2TB constellation ES.3 drives that I added yesterday. This must have been 20+ minutes into the test since the drives started out around 185 or so MB/s.

Code:
6296 be/4 root      161.32 M/s    0.00 B/s  0.00 % 98.01 % badblocks -wsv /dev/sdn
26310 be/4 root      160.11 M/s    0.00 B/s  0.00 % 97.81 % badblocks -wsv /dev/sdp
26304 be/4 root      166.90 M/s    0.00 B/s  0.00 % 97.71 % badblocks -wsv /dev/sdo
 
Last edited:
I've picked up a bunch of new hard drives for my file server and looking for suggestions on how to do a full stress test on them before putting them into my server. I'm not confined by any time constraints.

Since there are no time constraints, run tests until the drive fails and then go back in time.

Failures are so infrequent that it is reasonable to believe that recovering from a failure is more cost effective than doing any testing.
 
In the past I have tried all the possible combinations of tests under FreeBSDa and / or Linux, but now I use Spinrite from Steve Gibson

https://www.grc.com/sr/spinrite.htm

It catches all errors (and repairs them too, no matter what filesystem is there on the disk, even encrypted ones).
 
Since there are no time constraints, run tests until the drive fails and then go back in time.

Failures are so infrequent that it is reasonable to believe that recovering from a failure is more cost effective than doing any testing.

That would be the most effective way to test a drive's lifespan :)

I'm building a RAIDz2 array with the disks, so when a drive crashes at some point in the future I'll hopefully have enough time to complete a snapshot and drive replacement before the array completely fails.

Excellent info in this thread for some initial testing!
 
In the past I have tried all the possible combinations of tests under FreeBSDa and / or Linux, but now I use Spinrite from Steve Gibson

https://www.grc.com/sr/spinrite.htm

It catches all errors (and repairs them too, no matter what filesystem is there on the disk, even encrypted ones).

Looks like some of the other options described in this thread would would be better options since they are free. Is there a reason why you're using Spinrite instead of those? Maybe usability since you can just boot off a disc and go?
 
I'm about to buy a Synology NAS and the HDDs should be formatted and checked by the NAS. Do you think it would be worth testing with badblocks before installing? I'm caught between trying to be safe and trying to avoid spending the days it will take to test a 4TB HDD :D.
 
I'm about to buy a Synology NAS and the HDDs should be formatted and checked by the NAS. Do you think it would be worth testing with badblocks before installing? I'm caught between trying to be safe and trying to avoid spending the days it will take to test a 4TB HDD :D.

I think it would be prudent. If you test your drives and they're ok, then you've lost some usable time. However, if a drive is bad, then you've potentially saved yourself a lot of time and headache by not having to fix a drive later and deal with potential data issues.

I'm running ZFS, so it should help identify errors before they become catastrophic. Still, I'd rather take the time to do the testing before spending even more time moving all my data over to the new array.
 
BTW, I got my ZFS array up..

Code:
datastore4 ~ # zpool list
NAME         SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
bad          448G   441G  7.07G    98%  1.00x  ONLINE  -
zfs_data_0  25.2T  3.35T  21.9T    13%  1.00x  ONLINE  -

  pool: zfs_data_0
 state: ONLINE
  scan: scrub in progress since Wed Feb  5 08:02:19 2014
    1.53T scanned out of 3.35T at 328M/s, 1h36m to go
    0 repaired, 45.69% done
config:

        NAME             STATE     READ WRITE CKSUM
        zfs_data_0       ONLINE       0     0     0
          raidz2-0       ONLINE       0     0     0
            a0_d0-part3  ONLINE       0     0     0
            a0_d1-part3  ONLINE       0     0     0
            a0_d2-part3  ONLINE       0     0     0
            a0_d3-part3  ONLINE       0     0     0
            a0_d4-part3  ONLINE       0     0     0
            a0_d5-part3  ONLINE       0     0     0
            a0_d6-part3  ONLINE       0     0     0
          raidz2-1       ONLINE       0     0     0
            a1_d0-part3  ONLINE       0     0     0
            a1_d1-part3  ONLINE       0     0     0
            a1_d2-part3  ONLINE       0     0     0
            a1_d3-part3  ONLINE       0     0     0
            a1_d4-part3  ONLINE       0     0     0
            a1_d5-part3  ONLINE       0     0     0
            a1_d6-part3  ONLINE       0     0     0

errors: No known data errors

bad is a test pool raidz2 using a physical disk (that is about to die) with 12 or so partitions each part of the vdev. I use this to test zfs fixing corrupt data.
 
Last edited:
drescherjm - Any particular reason you do a 7 drive RAIDz2? Just due to how your drives are arranged?
 
Just due to how your drives are arranged?

Due to the budget and other factors. I ordered 10 x 2TB enterprise SATA drives @175US each and already had 6 2TB spares. I used 4 of those spares for this project. I was considering a single raidz3 vdev but then it would have been more difficult to move the data off the server and back ( I needed to minimize the down time). Before the upgrade the server had 10 1TB drives in btrfs on top of linux software raid6 + 8 500 GB drives in software raid6. The downtime here was less than 30 minutes (mostly to rsync the root filesystem when it was not live) because I hot removed the array of 500GB drives added the first raidz2 array and used rsync to move the data all while the system was up.
 
Last edited:
Looks like some of the other options described in this thread would would be better options since they are free. Is there a reason why you're using Spinrite instead of those? Maybe usability since you can just boot off a disc and go?

My experience: I've sent a lot of "broken" disks to Apple, IBM or HP (under warranty); they told me the HDs were 100% "gone" and data not recoverable. I had a bunch of them rusting in my closet when a friend of mine brought in a CD with Spinrite.

That tiny piece of software was able – after 15 days of working (24/7) – to recover the unrecoverable :eek:

After that experience this is my only paid software in my toolbox. It was able to find defective HDs (badly shipped by Amazon) that other testing methods failed, saving me time and frustration.
 
My experience: I've sent a lot of "broken" disks to Apple, IBM or HP (under warranty); they told me the HDs were 100% "gone" and data not recoverable. I had a bunch of them rusting in my closet when a friend of mine brought in a CD with Spinrite.

That tiny piece of software was able – after 15 days of working (24/7) – to recover the unrecoverable :eek:

After that experience this is my only paid software in my toolbox. It was able to find defective HDs (badly shipped by Amazon) that other testing methods failed, saving me time and frustration.

Spinrite and HDD Regenerator are great for recovering hard drives with problems.
But they are not tools to use for identifying faulty drives, they fix some drive problems that you already know are faulty.
Tools like this arent relevant to the topic.
 
Spinrite and HDD Regenerator are great for recovering hard drives with problems.
But they are not tools to use for identifying faulty drives, they fix some drive problems that you already know are faulty.
Tools like this arent relevant to the topic.

Steve Gibson says that this tool is great to test new hard drives, and I mainly use it for this purpose.

A level 2 test is able to discover defective surface or mechanical problems due to bad shipping.

I'm using it now on 6 Seagate VN 3000 to test before putting them into a FreeNAS box.

Spinrite correctly identified some problems with two brand new Seagate ST2000 that - after months - showed to be faulty. No other tool was able to do that; so - for now - I trust this software as my main test suite for hard drives.
 
Yep, Spinrite running level 4 will read and write the entire drive, then invert the data and do it again and it shows the the SMART date to know if a drive is getting too many errors of issues.
 
Oops, my apologies, its an age since I used Spinrite as I have HDD Regenerator.
 
Spinrite and HDD Regenerator are great for recovering hard drives with problems.
But they are not tools to use for identifying faulty drives, they fix some drive problems that you already know are faulty.
Tools like this arent relevant to the topic.

They are relevant. The presence of this type of tool removes the need to run hours of tests on new drives.

---

I like the guy who has a job and stress tests drives for 3 days before putting them into production. (I appreciate his test scheme.)

He left out 2 important pieces of information.

1) What percentage of drives fail during the test phase.

2) What percentage of drives that pass the stress test survive production use.

There are statistics that indicate that 5% of drives fail in the first year. So I would expect that his total one year failure rate is 5%. (If it is higher, perhaps his testing is simply destroying drives that would have a longer life span.)

So the question is does testing find enough bad drives to reduce the 5% first year failure rate of drives that pass testing?

---

I don't plan on my drives failing so I don't test them. I don't plan on my drives lasting so I keep backups.
 
I do not have good enough tracking of this. And my sample size is way too small with only 200 drives.
 
Last edited:
I like the guy who has a job and stress tests drives for 3 days before putting them into production. (I appreciate his test scheme.)

He left out 2 important pieces of information.

1) What percentage of drives fail during the test phase.

2) What percentage of drives that pass the stress test survive production use.

We install about 1000 drives a year. According to my RMA list, 4.5% of the drives fail our certification and are returned to the initial shipper as SIDS fatalities (and we get new (non-reconditioned) replacements.) Of the drives that fail once put into production, we see about a 2% failure rate. One note on the post-production failures is that they tend to die quick fatal drops and not they-start-to-die-and-hang-around problems.
 
So, what I surmise from your 1,000 drive data point is that of the drives that pass, your 2% failure rate is very likely lower that what it would be if you did no initial testing as you see a 4.5% failure rate there.

Right now I'm running "badblocks" on an older WD hard drive I have that currently passes a "smartctl" assessment. Also running it on a cheapo 4GB thumbdrive for fun.
 
Steve Gibson says that this tool is great to test new hard drives, and I mainly use it for this purpose.

A level 2 test is able to discover defective surface or mechanical problems due to bad shipping.

I'm using it now on 6 Seagate VN 3000 to test before putting them into a FreeNAS box.

Spinrite correctly identified some problems with two brand new Seagate ST2000 that - after months - showed to be faulty. No other tool was able to do that; so - for now - I trust this software as my main test suite for hard drives.

Have you run a "badblocks" test on a drive that Spinrite showed to have faults? I'd be curious to see how those tools compared.
 
In the past I have tried all the possible combinations of tests under FreeBSDa and / or Linux, but now I use Spinrite from Steve Gibson

https://www.grc.com/sr/spinrite.htm

It catches all errors (and repairs them too, no matter what filesystem is there on the disk, even encrypted ones).

I'm entirely open to buying utilities even when there are free alternatives, e.g. for file/directory comparison, duplicate file detection, etc. But $89 for home use, that's too much, considering how frequently I would use this utility and the total number of hard drives in my LAN, about 12 including some spares and "scratch" drives.

I'm guessing here that for true home use, Gibson would sell a lot more copies at say $30. Steve: are you reading this?
 
Have you run a "badblocks" test on a drive that Spinrite showed to have faults? I'd be curious to see how those tools compared.

I run Spinrite for a main reason: it's true "low level" so – if you have encrypted disks (like using TrueCrypt or FreeBSD GELI encryption) – you don't need a software that "understands" the underlying filesystem
 
I run Spinrite for a main reason: it's true "low level" so – if you have encrypted disks (like using TrueCrypt or FreeBSD GELI encryption) – you don't need a software that "understands" the underlying filesystem

"true 'low level'" :D

What do you think that means? Don't you realize that the lowest possible level to access a HDD is at the block level? Many programs are capable of reading and writing to HDDs at the block level, including badblocks.
 
Gibson would sell a lot more copies at say $30. Steve: are you reading this?

In addition to a large price reduction a trial would be nice. I mean I would like to compare it to other utilities myself.
 
I'm sure Spinrite has it's uses, although for the price it doesn't seem worth it, at least when it just comes to testing a brand-new drive before adding it to a machine for active use. Data recovery is another issue - a $90 tool that can save important data is relatively cheap compared to other options.

So far I've had good luck with just using badblocks. Currently I'm running it on 4 3TB hard drives. If they all pass then I'll build a RAIDz1 out of them and copy my data over. I'm comfortable with a single run of badblocks on each drive. Seems prudent to do some type of testing beforehand.
 
Evidently you do not understand your own link.

So, let me explain this thing a more in depth. My apologies if I expressed myself not correctly because - as you can read - English in not my mother language.

For HDs testing, right before putting them into production, I always commit some "burn-in" and control testing.

I start with a SMART conveyance test, then a SMART long / extended test.

Then I DD write and read the whole drive, right before an IOZONE test.

Sometimes these tests give me the answer I need: immediately RMA the drive.

Sometimes not, so I give Spinrite a chance.

A lot of times Spinrite tells me that the drive has problems (even when all the above tests just say it's fine).

After about 6 months this drive (remember: this HD is perfectly fine for all the tests, but not for Spinrite) the drive dies.

This is my experience.
 
Question - When Spinrite or whatever tools anyone uses show errors, how have you found the RMA process?

Do they acknowledge the output of these tools, or do you have issues with the vendor/manufacturer saying the drive is good when you know otherwise?
 
As a person who RMAs 10 to 20 disks each year I know that both Seagate and WDC will take a drive that you say something like I know the drive is bad. Neither need a code to do an RMA. You will not get the drive you send in back.
 
I start with a SMART conveyance test, then a SMART long / extended test.

Then I DD write and read the whole drive, right before an IOZONE test.

Sometimes these tests give me the answer I need: immediately RMA the drive.

Sometimes not, so I give Spinrite a chance.

A lot of times Spinrite tells me that the drive has problems (even when all the above tests just say it's fine).

After about 6 months this drive (remember: this HD is perfectly fine for all the tests, but not for Spinrite) the drive dies.

That is not because Spinrite is doing something magic. For testing HDDs, badblocks can do the same thing as Spinrite. There is no magic occurring. It is simply writing and reading the sectors.

If you replaced Spinrite with badblocks in your sequence, the result would be the same. It is just a matter of order of tests -- you have Spinrite last. If you put badblocks last, it would perform similarly.

The only thing that might be different is whether Spinrite checks the SMART parameters before and after the test and notifies you of problems based on changes. badblocks does not do that, but of course it is easy to do that yourself (i.e., use smartctl to read SMART parms, run badblocks, use smartctl to read SMART parms afterwards)
 
Back
Top