RAID5 array "dumped" 2 drives?

PS-RagE

Supreme [H]ardness
Joined
Aug 11, 2000
Messages
4,298
Last night, for no good reason (that I could see) my server crashed hard. On reboot, it told me the RAID5 array had failed (Promise SX6000). I went into the array setup and it says two of the six drives was failed/disconnected. This sounded unlikely to me so as a test, I disconnected another one - just to see. Now I had three failed/missing drives and even when I reconnected this third drive the array would not acknowledge it (doh).

In full out WTF mode, I started reseating things etc and then the machine would not POST at all with the Supertrac in the bus. At this point, I figured either my PSU was crapping out or the Supertrac had commited suicide.

So..... this morning I draggged the server with me to work and it once again would power up but give the failed array message. Following my first hunch, I went out and bought a huge-assed 680W PSU but still got the same error. Fully confuserfied, I called Promise tech support. Appearently, I was an idiot for disconnecting the third drive as a test <shrug> and my only chance of recovering the data was to delete the array and then recreate/define it - he gave that a 50/50 chance of working.

Anyhow, it worked and my 800+G of data is intact <phew> so my question is: "What do you think might have caused the controller to dump two drives in the first place? Could it have been the PSU? Maybe the array is getting too close to being full?"
 
The array being full shouldn't affect the disks dropping - as far as the controller's concerned, it can't tell how full the drives are.

As for what it might be, I'd tend to point the finger at Promise; they have some pretty shaky drivers for the older stuff, and the sx6000 falls into that category. You should really have a backup for that array. As you've no doubt learned, raid is not a backup.

Nice lookin' box, btw!

 
This is definitely a Promise issue. We used to put SX6000 cards in all the whitebox servers we built, and they all ended up being replaced by SCSI instead because of issues like this. Either drives would randomly fall off the array and show as being free in the bios, or the servers would just act flaky and crash quite frequently. As soon as we replaced the Promise card with a SCSI RAID 1 or 5 array, all the issues went away. I actually just had a client's server do exactly what you mentioned last week, and it had happened with that server about 6 months ago as well. The server was running slowly so they rebooted it, and then it gave them the array not functional error at boot. 2 of the 3 drives were showing as free in the card's bios. I did the same thing you did, I just deleted and recreated the array without initializing it. Unfortunately this is our last client running that POS card, and they can't afford to replace it right now. If you can afford it, I would highly recommend getting a SCSI setup, or at least make damn sure you're getting solid backups because those Promise cards can blow up at any minute.
 
Lemme guess Rage, they are your WD2000JBs. If so, you may have been a victim of the WD error recovery bug.
 
DougLite said:
Lemme guess Rage, they are your WD2000JBs. If so, you may have been a victim of the WD error recovery bug.
Is this why they introduced the RE line of disks?

Don't really have anything else to add to this thread, sorry for the small highjack.
 
^ Well that in addition to the additional/tougher testing to justify marketing them as nearline drives, yes.

WD desktop drives being dropped by RAID BIOSes is a well documented, although difficult to recreate, problem. When a desktop Caviar detects an error, it will perform exhaustive procedures to recover from and correct the error. During this time, the drive does not respond to commands from the RAID BIOS, which may cause the RAID BIOS to timeout the drive and presume it has failed.

RAID Edition Caviars limit their error recovery time to eight seconds (hence Time Limited Error Recovery, or TLER). After this time expires, the drive returns to normal operation and relies on the RAID BIOS to correct the error. This explains why RAID Edition Caviars should not be used in regular desktops - if the drive only allows 8 seconds for error recovery, success or fail, corruption may be propagated to the physical disk media.

You can build a RAID array using desktop Caviars, and as long as all errors are detected and corrected within the RAID BIOSes timeout period, you will never notice it. If however, the drive encounters a partciularly thorny error that takes a long time to correct (or that it can't correct) the RAID BIOS leaves the drive for dead and panics because it can't communicate with it. Since it is hard to induce errors like this, this bug is extremely difficult to reproduce, and many people think I'm nuts for saying so - but those people have simply never had an error that took longer than the RAID BIOS timeout.
 
We've had problems on the Promise SX6000 cards with Maxtor, WD, IBM, and Seagate drives. The server I mentioned in my last post had 3 60GB Seagate drives in it, and 2 fell off the array.
 
DougLite said:
^ Well that in addition to the additional/tougher testing to justify marketing them as nearline drives, yes.

WD desktop drives being dropped by RAID BIOSes is a well documented, although difficult to recreate, problem. When a desktop Caviar detects an error, it will perform exhaustive procedures to recover from and correct the error. During this time, the drive does not respond to commands from the RAID BIOS, which may cause the RAID BIOS to timeout the drive and presume it has failed.

RAID Edition Caviars limit their error recovery time to eight seconds (hence Time Limited Error Recovery, or TLER). After this time expires, the drive returns to normal operation and relies on the RAID BIOS to correct the error. This explains why RAID Edition Caviars should not be used in regular desktops - if the drive only allows 8 seconds for error recovery, success or fail, corruption may be propagated to the physical disk media.

You can build a RAID array using desktop Caviars, and as long as all errors are detected and corrected within the RAID BIOSes timeout period, you will never notice it. If however, the drive encounters a partciularly thorny error that takes a long time to correct (or that it can't correct) the RAID BIOS leaves the drive for dead and panics because it can't communicate with it. Since it is hard to induce errors like this, this bug is extremely difficult to reproduce, and many people think I'm nuts for saying so - but those people have simply never had an error that took longer than the RAID BIOS timeout.
That is a somewhat scary post.

We have 5 SATA disks. All 250GB WDs; two REs and three normal BBs all on a Highpoint RR1820A controller. It's not bad and we haven't encountered any errors, but this is why I asked what I did.

And for the record, I do recall you mentioning the TLER in a post a few months ago. Very informative that time and this time as well.
 
DougLite said:
Lemme guess Rage, they are your WD2000JBs. If so, you may have been a victim of the WD error recovery bug.


Yep, it was my WD2000JBs.

Looks like I am going to be rebuilding the entire server. I'm now eyeing a 3Ware 9000 Series controller and some WD4000YRs. My supplier is going to loan me a TByte of external storage drives so I can back-up my data.
 
DougLite said:
^ Well that in addition to the additional/tougher testing to justify marketing them as nearline drives, yes.

WD desktop drives being dropped by RAID BIOSes is a well documented, although difficult to recreate, problem. When a desktop Caviar detects an error, it will perform exhaustive procedures to recover from and correct the error. During this time, the drive does not respond to commands from the RAID BIOS, which may cause the RAID BIOS to timeout the drive and presume it has failed.

RAID Edition Caviars limit their error recovery time to eight seconds (hence Time Limited Error Recovery, or TLER). After this time expires, the drive returns to normal operation and relies on the RAID BIOS to correct the error. This explains why RAID Edition Caviars should not be used in regular desktops - if the drive only allows 8 seconds for error recovery, success or fail, corruption may be propagated to the physical disk media.

You can build a RAID array using desktop Caviars, and as long as all errors are detected and corrected within the RAID BIOSes timeout period, you will never notice it. If however, the drive encounters a partciularly thorny error that takes a long time to correct (or that it can't correct) the RAID BIOS leaves the drive for dead and panics because it can't communicate with it. Since it is hard to induce errors like this, this bug is extremely difficult to reproduce, and many people think I'm nuts for saying so - but those people have simply never had an error that took longer than the RAID BIOS timeout.

Sorry to resurrect a thread that's a bit dusty, but just read this and it made me curious about my own storage array consisting of WD 320GB JB drives. Thing is, I don't RAID them because I like being able to swap drives quickly and easily out of the server, and it just doesn't seem like RAID's advantages are big enough to counter that.

But still, since there's obviously a lot of data stored there and the drives are so big that if just one failed I'd lose a lot of data, I've been nervous that I'm running JB instead of RE drives, because I have this impression that JB drives aren't as reliable (and more prone to catastrophic failure) than RE drives.

Is the main benefit of RE TLER? If I'm running a storage server as I am now (without RAID), does it not make sense to use the RE just for increased robustness of an individual drive? Am I worried for nothing?
 
Regardless of the warranty, design, or manufacturing quality, any hard disk is vulnerable to failure, at any time. If you are not backing up, you are asking to lose data, whether you're running brand new Seagate Cheetahs or three year old DeskStar 60GXPs.

Yes RAID Edition Caviars are less vulnerable to software/firmware glitches that can break RAID arrays, and yes they (allegedly) enjoy tighter tolerances and higher quality components during assembly), but even they cannot be trusted to safeguard data by themselves. External drives, DVDs, tape are all viable backup methods, ideally with offsite sotrage of the media.
 
Hey Doug,

Also curious, if the Error Recovery bug is the likely culprit for the cause of the drops, why did two drives drop instead of just one? If just one drive was dropped, since it's RAID 5, wouldn't the array have been able to rebuild itself using the extra drive, and things continue on as normal?
 
We're a small department upgrading to WD's RE disks -- Does anyone know if Linux Software Raid (md) supports the TLER feature of these drives?
 
John G. said:
Hey Doug,

Also curious, if the Error Recovery bug is the likely culprit for the cause of the drops, why did two drives drop instead of just one? If just one drive was dropped, since it's RAID 5, wouldn't the array have been able to rebuild itself using the extra drive, and things continue on as normal?
common problem with raid - bad blocks build up, then when trying to rebuild, another drive dies. RAID is not a backup.
 
Tilde said:
We're a small department upgrading to WD's RE disks -- Does anyone know if Linux Software Raid (md) supports the TLER feature of these drives?
I don't believe it does, directly, but when a TLER situation happens, it'll still help. Suppose a read is happening:
read happens, disk fails, while it's waiting other disks complete read, done
or a write:
write is delayed for 8 seconds instead of 30+, then everything moves normally

However, I haven't tried it, and I don't know how fast LSR considers a disk "dead". I guess one could look through the code, but the fastest way to find out is to ask lkml. Someone asked to no avail in Nov 2004, but perhaps someone will be better able to answer the question now.

 
Back
Top