Stress Testing Disks for New Array

BecauseScience

[H]ard|Gawd
Joined
Oct 9, 2005
Messages
1,047
I have a bunch of WD Reds on the way. They're destined for a new raid6 array. I'd like to know what the resident storage heavy hitters [looking at you mwroobel! :D] do to stress test disks when building new arrays. iometer? bonnie? badblocks? Do you stress test the disks individually before creating the array, after, both? What's your testing protocol?
 
Never heard of stress testing HDDs before.

A certain number of new disks are destined to die very soon after being put into use. The point of the stress test is to locate these disks before you put your array into production.

The Reds seems to have a high rate of infant mortality according to reports in the Reds thread. I don't want to put my array into production and then have 2-3 disks go belly up shortly after.

I've read most of the references on the web and in the bigger forums. I'm interested specifically in what the [H] storage guys do. You know, the ones who post here with loaded Norcos. :D There isn't much in the way of older posts, a few threads recommending to run badblocks or bonnie for some indeterminate time.
 
I have a bunch of WD Reds on the way. They're destined for a new raid6 array. I'd like to know what the resident storage heavy hitters [looking at you mwroobel! :D] do to stress test disks when building new arrays. iometer? bonnie? badblocks? Do you stress test the disks individually before creating the array, after, both? What's your testing protocol?

Wow, called out by name in the post :D Well, here is the actual workflow we follow for new drives. We generally order drives a dozen at a time from multiple vendors, and in the case of 11+1 arrays we take 2 of the drives each from 6 separate orders, so as to minimize the chance of linear production line failures. When we receive new drives, we follow the following steps.. First, we have a box setup just for drive verification testing (A Frankensteined older Supermicro with 8 LFF and 8 SFF slots, as well as 12 3TB SAS drives (11+HS in R6) which live in it fulltime). We run DBAN in DoD 5220.22-M mode, which does 7 complete passes (We used to use Gutmann (35 Passes), but found we had better success in breaking it up between coldstarts. Once the 7 passes complete, the box shuts down for anywhere between 1 hour and 2 hours (depending on how fast people get back to the box after the DVT email) and we start the 7 pass regime again.
The drives will continue in this pattern for 3 days.. Start.. DBAN.. Shutdown.. Lather.. Rinse.. Repeat..
Once the drives have survived this initial shakeout, they are placed into a test array (Generally 11 drives in the array and 1 spare). Data is then copied from a test array which is attached to an Areca 1882/HP Expander (The previously mentioned ~27TB Array) to the new array. Data is laid out on the array as follows:

22TB - BluRay Rips (~25-50GB Each)
1TB - 100MB Files
3TB - 1GB Files
.5TB - 1MB Files
Misc - 10,000 Empty Folders

Once the data has completed copy, we pull one drive out of the array and let it rebuild. If all drives pass this, they are ok'd to be moved into production.
 
Last edited:
Thanks for taking the time to share! This is exactly what I was looking for. I'll feel good knowing my new array has undergone the mwroobel torture test. :D

I never would have thought to hit them with dban.
 
We used DBAN because it is a reliable write-every-sector-of-the-drive-over-and-over test. We stopped using the Gutmann scheme (35 full writes) because we found that there were drives that would pass that and still SIDS out in a few days. We found that doing a DOD, doing a complete shutdown, letting them sit for an hour and doing it again and again for a few days did a better job of weeding out bad drives. In fact, in the past year that we have been doing it this way, we haven't had a drive fail in the first 90 days after passing. Prior to that, we still would have a SIDS drive here and there (or more).
 
We used DBAN because it is a reliable write-every-sector-of-the-drive-over-and-over test. We stopped using the Gutmann scheme (35 full writes) because we found that there were drives that would pass that and still SIDS out in a few days. We found that doing a DOD, doing a complete shutdown, letting them sit for an hour and doing it again and again for a few days did a better job of weeding out bad drives. In fact, in the past year that we have been doing it this way, we haven't had a drive fail in the first 90 days after passing. Prior to that, we still would have a SIDS drive here and there (or more).

I just did a run of badblocks.. How does it compare to dban in terms of activity on disk?
 
Back
Top