how to replace failed drive on my Dell server? (scsi raid 5)

pookguy88

Gawd
Joined
Jan 20, 2002
Messages
682
Hi there,

I'm kind of a n00b at this and I'm doing it for a friend so I don't want to screw up. He's got a server (Dell Poweredge 1800) with a 4 drive (4x Seagate Cheetah 10k 73gb) SCSI RAID-5 setup running MS SBS Server 2003. The controller card on it, Perc 4/dc, started beeping today and I went in, looked at the Dell OpenManage and found that one of the drives had failed. I'm pretty sure it's the bottom-most one because when I took the front bezel off the server the top 3 drives had green lights and the bottom one was orange. Anyways, I called up Dell to order a new drive. Turns out they don't have the exact same drive anymore and they'll send me a 15k drive... is this ok? The CSR said it would be but just want to make sure.

My question is, how do I replace the drive once it comes? Do I need to shut the server down, take out old busted drive, insert new one in, and rebuild it in Dell Open Manage? I've got my enthusiast hardware down (built several computers) but have never worked with SCSI before (my bad). Just wondering if there's any jumpers/terminators to set on the new drive??

Any help is greatly appreciated, thanks!
 
Hi there,

If you're still under warranty the Dell guys can talk you through replacing the drive - I found them VERY good when one of our drives failed.

From this page it looks like hot-swap may have been standard on the 1800, so in that case you can just whip the old drive out & whack the new one in. If in doubt, yeah, shut it down first - shutting down a machine with hot-swap will do no harm when replacing the drive, but yanking the drive of a non-hotswap machine may let the smoke out. After replacing the drive go into the OpenManage and tell it to rebuild.

I'm sure the orange light does indeed indicate the failed drive - I think on our one (it's a 2600 or 2800 or so, I think) the lights of the working drives are blue, but they do change colour when the drive fails. There is also an option in OpenManage to "identify" - select the failed drive and then choose "identify" from the drop-down (I think that's how it works) and you'll see the drive's lights flash like a funfair.

I have absolutely no hesitation when doing this on our machine because I definitely specified hot-swap when I ordered it, so when it arrived I installed Windows and for a couple of days pulled the hard-drives and whacked them back in again, just to see the arrays rebuild. But I do have a bit of hesitation saying "yeah, definitely do this" to someone over the internet, so like I said, just call Dell's support back tomorrow when the drive arrives (assuming you're under warranty).

I found Dell's business support (which is the department that covers their servers) to be very good indeed. The guy spent an hour on the phone with me doing diagnostics - those turned out to be superfluous, I could have just replaced the drive and it would have fixed it, but he was very thorough and got me to run an extra test utility to make sure and check the controller had no faults. He also sent me a link to another utility that checked driver / firmware versions, then sent me a link to a CD that took all the drivers on the machine (including the DRAC card & all that) up to the latest versions.

Stroller.
 
Hi there,

If you're still under warranty the Dell guys can talk you through replacing the drive - I found them VERY good when one of our drives failed.

From this page it looks like hot-swap may have been standard on the 1800, so in that case you can just whip the old drive out & whack the new one in. If in doubt, yeah, shut it down first - shutting down a machine with hot-swap will do no harm when replacing the drive, but yanking the drive of a non-hotswap machine may let the smoke out. After replacing the drive go into the OpenManage and tell it to rebuild.

I'm sure the orange light does indeed indicate the failed drive - I think on our one (it's a 2600 or 2800 or so, I think) the lights of the working drives are blue, but they do change colour when the drive fails. There is also an option in OpenManage to "identify" - select the failed drive and then choose "identify" from the drop-down (I think that's how it works) and you'll see the drive's lights flash like a funfair.

I have absolutely no hesitation when doing this on our machine because I definitely specified hot-swap when I ordered it, so when it arrived I installed Windows and for a couple of days pulled the hard-drives and whacked them back in again, just to see the arrays rebuild. But I do have a bit of hesitation saying "yeah, definitely do this" to someone over the internet, so like I said, just call Dell's support back tomorrow when the drive arrives (assuming you're under warranty).

I found Dell's business support (which is the department that covers their servers) to be very good indeed. The guy spent an hour on the phone with me doing diagnostics - those turned out to be superfluous, I could have just replaced the drive and it would have fixed it, but he was very thorough and got me to run an extra test utility to make sure and check the controller had no faults. He also sent me a link to another utility that checked driver / firmware versions, then sent me a link to a CD that took all the drivers on the machine (including the DRAC card & all that) up to the latest versions.

Stroller.

That helped alot, thanks Stroller, the thing is my warranty is up so I don't think I'll get that support. And up here in Canada I'm sure Dell's support line is hit or miss, usually miss.

edit: so it's ok the new drive is 15K? (the ones now are 10k)
 
... the thing is my warranty is up so I don't think I'll get that support.
Ouch! I dread to think what you paid for that drive!!

I'm sure you'll be fine with the replacement. I can understand you being nervous, but I find everything in the RAID admin GUI to be pretty clearly labelled (if a but "webby" for this kind of tool) and straight-forward.

You might want to check out the Dell forums - they're pretty good. I think they call 'em the Dell "community forums" (??) - if you're after more detailed instructions you'll maybe find someone can give you the exact steps, or a link to them.

...edit: so it's ok the new drive is 15K? (the ones now are 10k)
Yeah, that's perfectly fine. It's just the speed the drive spins at, but as far as the computer is concerned it's just somewhere to shove data. If you used a 10k drive & a 15k drive side by side as JBoD (just disks - no RAID) then you'd see the 15k one to be faster, but in a RAID array the controller will get a block of the data from that drive and then just have to wait until the other drives supply the corresponding blocks at their usual speed.

Oh - if this is the first time you've used RAID then read some introductory articles about it first, just to get yourself some more confidence. The Wikipedia article is a bit long, but if you just read from the beginning up to and including the "Principles" section, then about levels 1 & 5, which are the most commonly used, then that is a good overview.

Stroller.
 
Ouch! I dread to think what you paid for that drive!!

Yeah.. it was absurd. I told my friend I could probably find it cheaper on buy.com but they'd have to wait longer to get it cause we're in Canada and I've to first ship the drive to my Amerifriend before we got it. Secondly, the Dell replacement would come with the bracket so no hassle there.

You might want to check out the Dell forums - they're pretty good. I think they call 'em the Dell "community forums" (??) - if you're after more detailed instructions you'll maybe find someone can give you the exact steps, or a link to them.

I think I'm going to try that too actually.


Oh - if this is the first time you've used RAID then read some introductory articles about it first, just to get yourself some more confidence. The Wikipedia article is a bit long, but if you just read from the beginning up to and including the "Principles" section, then about levels 1 & 5, which are the most commonly used, then that is a good overview.

Sadly, I learned all this in school... *sigh*, should've remembered it haha

thanks again Stroller.
 
All the Dell PERC controllers will let you just replace the drive hot... you should see it do a quick test of the replacement drive after its inserted and then it will begin the rebuild. If you have the Dell Openmanage utilities installed, you can watch the rebuild progress.

As a side note, often you can just pull the "bad" drive, let it spin down, and put it back in and it will rebuild and function normally. Sometimes the controller will mark a drive offline even if it just times out on one request, which is often a bad sector remap operation. After a rebuild, the drive should be good to go. This takes care of about half the drive failures I see at work, about 8-10 a year in a population of about 500 or so enterprise class drives.
 
All the Dell PERC controllers will let you just replace the drive hot... you should see it do a quick test of the replacement drive after its inserted and then it will begin the rebuild. If you have the Dell Openmanage utilities installed, you can watch the rebuild progress.

As a side note, often you can just pull the "bad" drive, let it spin down, and put it back in and it will rebuild and function normally. Sometimes the controller will mark a drive offline even if it just times out on one request, which is often a bad sector remap operation. After a rebuild, the drive should be good to go. This takes care of about half the drive failures I see at work, about 8-10 a year in a population of about 500 or so enterprise class drives.

Hmm that's good to know, maybe I should've tried that before I bought a new drive.. Oh well. Do I have to do anything with the new drive in terms of jumpers/terminators? I've never worked with internal scsi hds before.

edit: also, I guess I could ask this now; what's the best way to upgrade the drive space with this RAID-5 setup? Like someday I'm pretty sure they'll need more space.
 
I've been reading the PERC 4/DC user guide and I have a question in regards to this procedure, do I need a hot spare in order to do all of this?

Right now there are 4 drives in RAID-5 and none of them are designated as a spare (that is they are all in use). The one that failed is one of those 4, its ok to hot swap it with a new one? The manual isn't really clear as to whether I need a spare online or not.
 
Yes - it's ok to swap it.

A hot spare is in case the second-drive dies whilst you're waiting for a replacement to be delivered. But the RAID has redundancy without it.

I came back to suggest you might take a look on eBay & grab some drives to use as hot spares, as insurance against future disasters. I'm not sure how expensive 73gig SCSI drives will be these days, but the caddies should be pretty reasonable. Certainly buying 3rd-party will be cheaper than going to Dell, which is what your hurry has imposed upon you.

If you have hot-spares in place then it will automatically be used to replace a failed drive. It takes a while to copy the data on to it (from the still-working drives) but it's quite beneficial against the risk of an office in which no-one knows what the red light on the server means and ignores it for months. You can have a spare drive sitting in a box waiting for disaster to happen, but it's no use in that case. If you buy an extra caddy all you have to do is mark the extra drive as a hot spare in the admin interface.
 
Yes - it's ok to swap it.

A hot spare is in case the second-drive dies whilst you're waiting for a replacement to be delivered. But the RAID has redundancy without it.

I came back to suggest you might take a look on eBay & grab some drives to use as hot spares, as insurance against future disasters. I'm not sure how expensive 73gig SCSI drives will be these days, but the caddies should be pretty reasonable. Certainly buying 3rd-party will be cheaper than going to Dell, which is what your hurry has imposed upon you.

If you have hot-spares in place then it will automatically be used to replace a failed drive. It takes a while to copy the data on to it (from the still-working drives) but it's quite beneficial against the risk of an office in which no-one knows what the red light on the server means and ignores it for months. You can have a spare drive sitting in a box waiting for disaster to happen, but it's no use in that case. If you buy an extra caddy all you have to do is mark the extra drive as a hot spare in the admin interface.

Yeah I think I'm going to do that.. how do I assign a hot spare? Just plug the drive in and use Open Manage to assign it as a spare? Will OpenManage/PERC controller detect the 'hot plug in' of the drive? Or do I have to refresh the controller somehow?
 
Yes, you can do it from the OpenManage interface. It should pick it up after a minute or two. Just assign it as a global hot-spare and it'll take over immediately after a failure.

No need to mess with any jumpers on SCA drives like these. The slot you put the drive in takes care of setting the SCSI ID for you.
 
Sounds good, one more thing. Before I physically remove the failed drive do I have to do anything in open manage first? Like 'unlink' it? Or can I just pull it with no ill effects to the array?

Thanks for all the advice so far guys
 
Another vote for just hot pull it and swap it, I've done hundreds of these and its simple as pie.
 
I'd also try pulling let sit for a few seconds then reinserting the current drive. It's very possible it was a hiccup and will rebuild. Right now the array is degraded if another drive fails the data is gone. At least is if does rebuild with the current drive then you have some fault tolerance back. If it will not then I'd suggest an immediate backup of all data just in case.... you should do that regardless.
 
Just wanted to chime in with a little tid bit of info since you seem kind of nervous. Without a hotspare youre raid array can't rebuild itself when a drive fails, this means that if you pull out the wrong drive or another fails all your data is lost. If the server can take it, I would try and put another drive in with the others for that little bit of protection. With all our new sans we've went a step further with raid 6 and two hot spares per raid array. Its a little overkill but sleeping at night is just that much easier when I can have 4 drives fail in an array and not have to worry about losing all my data :) ..albeit not all at the same time.
 
Wow, I can get HP 73g 10k drives for $40 locally. New in harder and more expensive, but you should really invest in some cheap used spares.
 
thanks for all the tips guys, I'm going in tomorrow to fix the drive, hopefully I don't pull the wrong drive.. :)
 
A failed hard drive in a dell server does not mean a bad drive. It means that it's failed off the array.

Just pull the amber drive wait 10 secs and reseat it back in the bay about 10 secs later it should start a rebuild if it goes back to amber and the rebuild fails it's a bad drive.

you can also download the dell online diags and test the drive while in it's amber state.

Dell Online Diagnostics

http://support.dell.com/support/dow...1&impid=-1&formatcnt=2&libid=13&fileid=271079
 
Alright, I've just put the new disk in and it's rebuilding :) Easy as pie

Couple of questions, I tried re-activating the alarm on the PERC 4/DC RAID card and it still goes off. Is this because the virtual disk is still degraded until the rebuild is complete?

Can I put in a global hot spare while the RAID is rebuilding? I don't see an option to assign a global hotspare with any of the current 4 disks, I assume this is because they are all being used in the RAID-5 config, when I put in a 5th disk will I get this option? Does the 5th disk have to be formatted/partitioned before I assign it as a hot spare?

Thanks for all the help!
 
Yes, the alarm will keep sounding until you "silence" it, or it finishes rebuilding.

You can put the spare in too, no problem. And yes, once its in you just select it as the global hotspare, no need to partition or format it, and the OS won't even see it anyway. Since the other drives are in use, that option doesn't show up.
 
I'm glad you got the new drive in ok. I don't think it was mentioned - back up your data before doing any operation like this!
 
As far as i know, Zardoz is right. Failed, really kinda means 'dropped'; like dropped out of the array. But, also; any PowerEdge server i've been around; has free, lifetime Dell phone support for hardware issues. They are the best to call, before you touch anything when a raid fail; and the price is right! There are a variety of controller types used, and words such as initialize, container, names of logs, can be unique to a controller, and even the same word mean different things in a different controller bios, series revision etc. (because of different manufacturers). At some critical points, messing with a degraded (or even not sometiems even healthy) array; each move should be chosen carefully; and weighed for risk. As well as kept minimal to take less chance to invite corruption.

A Hot Spare setup should automate on the fly, once setup. Except, you can't Hot Spare to a Raid 10 or 50 that is using partial /shared drives.
 
Last edited:
This was an 8 month old dead thread noob.... ;)
 
Hi Experts

i need your URGENT help

i have Dell pwervault NX3100 .raid 5 with 5 hard disks without hotspare . my 2 harddisks are failed now i cannot access my data and it is too much important . i dont have the same model hard disk but i have 15k .
please suggest me .
 
Although the original thread qas from 2009, I will suggest you search google on what to do with 2 failed drives in a RSID 5 set for tour device/controller card. If they are truly failed, you will have real problems. If they are just reporting as failed but still work, you may be able to reseat them. The last time I had issues, I was able to reseat the drive and it simply recognized it as a working drive that belonged to the set and worked fine.

I hope you have a backup of your data.
 
In this scenario, Turn everything off, NOW! with 2 failed drives, you will not get any usable information from the array. If it is truly important, contact a data recovery facility for a quote of recovering data from the drives. It WILL be very expensive, but if you do not have any backups, you may be stuck. If nothing else, this event will teach you the value of backups.
 
Back
Top