Random drives booted out of array

Guldan

Weaksauce
Joined
Jan 16, 2013
Messages
113
I have a storage server with two raid60 volumes and both are having random drives disappear, turn foreign or fail completely. If I reboot the disks pop back in and rebuild.

The failed drives will throw alerts "A block on the physical disk has been punctured by the controller: Physical Disk 0:2:12 Controller 0 Connector 0"

Two minutes before: "There was an unrecoverable disk media error during the rebuild or recovery operation: Physical Disk 0:2:2 Controller 0 Connector 0"

Both my volumes on this controller are having these issues, I've replaced the controller already and one of the MD1000 arrays and still I'm having problems.. any ideas? these WD reds are fine on an identical server so I'm not pointing my finger at the drives.


Setup:

Windows 2012 R2
Dell R410 Server
Perc H810 Controller
3x MD1000
45x 4TB WD Reds WD40EFRX
 
Out of warranty, "uncertified" drives and bought from another vendor.

We've tried the cables, also there are redundant paths so both would have to be bunk no?
 
My vote is for the PSU. I've had similar issues and after spending hours swapping in spares it turned out to be a dodgy power supply.
 
Dell/LSI "puncture" errors are URE where the parity data cannot properly rebuild the stripe (same bad sectors on multiple drives). It is possible that you have some bad drives (or a bad power supply). It is also unfortunately possible it is (at this point) a corrupt array with at least SOME unrecoverable data. Can you post the storage log from OSMA? Do you have a valid backup of the array?
 
Power supplies on the MD1000s? All three are kicking drives out, could one PSU cause all three to have issues? I don't have a full backup of the array but there is nothing super critical on there. I would have to get the data off and it's going to take at least a week.

The drives getting kicked out are 100% random, almost never the same disk.. out of the 45 disks i've seen probably half of them booted at one point.

As for the error logs, here are a few examples of errors:

http://pastebin.com/3bW2abrp
 
This sounds more like a motherboard problem in that case. when you replaced the controller, did you put it back in the same slot the first one came out of? Can your bring these chassis up on another server and see if the problem resolves?
 
Yeah could try another PCI slot or an entirely different server, perhaps I'll do that next.

So I started a manual patrol read and left it alone (didn't clear any bad blocks)

I come in today and one virtual disk is lit up green, no errors but it doing a background init? why would it do that for a 2nd time? the 2nd virtual disk is offline with 4 disks failed and one totally gone from the span. When I reboot those 4 disks pop back in and start rebuilding again.

Strange behavior indeed.
 
Latest firmware on the controllers? Sometimes firmware updates to SAS controllers can clear up issues with certain drives.
 
Latest firmware on the controllers? Sometimes firmware updates to SAS controllers can clear up issues with certain drives.

Yep all latest, but get this.. After reboot I started transferring data, about 10 mins later I come back and open manage shows all the disk as "removed" then blue screened with 0xc000021a

Now on reboot I login to controller and all disks are labelled as foreign.
 
Back
Top