Brand new SCSI drives...yay!!!...or not?

jyi786

Supreme [H]ardness
Joined
Jun 13, 2002
Messages
5,757
:rolleyes:

So got 6 new 146GB 15K SCSI drives for our server here at work, which is a Dell Poweredge 1800. Wanted to upgrade from 6 x 73GB drives that are getting long in the tooth, plus we're running out of space. Figured I'd work on it on Thanksgiving. Backups done, everything prepared smoothly.

So Wednesday night, I start on the server. Basically, get the drives ghosted over properly using Acronis backup. 2 x 146GB in RAID 1, 4 x 146GB in RAID 5. So everything works perfectly! Nice and fast, plus tons of space more to boot!

Then 3 drives fail. :mad::rolleyes:

Now I don't know how the f*** this happens. I've already had one of these new drives fail on my last attempt, and gotten it replaced. This is the second time, and now 3 drives fail in one night.

What the hell am I doing wrong? The drives got nice, cool fresh air (I pointed a table fan at the enclosure, and ambient temps did not exceed 40F, I was doing it in a cold room), and they were installed properly (anti-static strap, the works).

Now, I've read somewhere that the PERC controller in the server is what is responsible for these drives dying. Here's the link.

http://nerhood.wordpress.com/2005/09/14/dell-powervault-220-are-junk-and-im-tired/

Help? Suggestions? Condolences? Thanks. :(
 
Now, I've read somewhere that the PERC controller in the server is what is responsible for these drives dying. Here's the link.

http://nerhood.wordpress.com/2005/09/14/dell-powervault-220-are-junk-and-im-tired/

I read through some of those, and I'm quite skeptical that a PERC controller is causing the drives to fail in any physical manner. Sure, I could accept that they had data corruption or the configuration of the array itself went bad, but there's no way it's actually physically damaging the drives.

That said, you didn't mention how the drives failed in your case. Do they work in a different system or when not configured as part of a RAID array?
 
That said, you didn't mention how the drives failed in your case. Do they work in a different system or when not configured as part of a RAID array?

They work, but sporadically. Dell's Openmanage Server software reports the drives as being close to failing. In addition, you can actually hear the drives malfunctioning; they click, and sound like they are "trying" to spinup, then they spindown, and try spinning up again. Eventually, they will start working, and you can actually boot and run the system. This doesn't, however, get rid of the predictive failure errors I get on startup, not to mention checking the drives' consistency in the controller's BIOS nets me about 32 errors or more per physical drive that is reported as failing.

I already tried clearing and rebuilding the arrays, but this did not help in the end. I was able to rebuild the array, but the drives still act quirky.
 
The drives might be failing because you're operating BELOW their rated operating temperature. Most drives don't like being that cold.
 
The drives might be failing because you're operating BELOW their rated operating temperature. Most drives don't like being that cold.

Well 40f would be like 5c or something which is rather cold. But I have never heard of a drive not working due to being too cold. I mean I've done the freezer trick to help a bad drive and its not like it made it not work after being in freezing temps. I highly doubt this is the guys issue.
 
Drives can survive storage at much lower temps than they can during operation, it's why the drives have two ratings for environmental conditions.
 
I think that was a typo..I think he meant 40c. but thats neither here nor there because it should not effect it that much.

Have you run HDD diagnostics on the drives? You might just have gotten a bad batch or they were damaged in shipping.
 
If the drives are clicking, the PERC almost certainly did not cause it.
My guess is a bad drive or batch of drives, or damaged in shipping. Do you know that all 3 drives are clicking or is it just a single drive? One drive experiencing an issue on a SCSI channel can cause multiple drives to go offline
Also, the word "failed" when dealing with LSI based RAID controllers is a bit of a misnomer, it only means that the drives are offline to the array, whether or not they are physically failed or not is another story.
In this case, physical clicking sounds from the drive likely indicates a head crash, its just a matter of which drives are clicking
 
The drives might be failing because you're operating BELOW their rated operating temperature. Most drives don't like being that cold.

Well 40f would be like 5c or something which is rather cold. But I have never heard of a drive not working due to being too cold. I mean I've done the freezer trick to help a bad drive and its not like it made it not work after being in freezing temps. I highly doubt this is the guys issue.

I think that was a typo..I think he meant 40c.

No guys, that was not a typo. It was not more than 40F in the environment. That's right, FAHRENHEIT, not Celcius.

Now, just FYI, these SCSI drives run freaking hot. I mean so hot that if you don't have active cooling, it will literally burn your hand if you touch it for a split second. So it was really good that I was running it in the environment I was.

In addition, I was doing this in the same room which my personal rig is. There is no heating in this room; hence the cold temps. And my rig has 4 hard drives in it, and it works perfectly whether the temps are cold or not.

where were the drives purchased from?

Newegg. They seem to have a penchant to send out stuff that fails frequently. Basically everything I buy from them fails within the course of at least 2 years. :rolleyes:

Now, adding to the confusing mix, I verified that one of the drives that failed is the replacement drive I received from Fujitsu themselves. So the issue can't be a shipping problem, or from a bad batch of drives. Something else is killing the drives, and I can't judge where from. At least I know that when I put them in this specific server, that's what happens.

best [486];1033380258 said:
I thought that was the drive realigning the heads...

No. The clicking means the heads crashed. Basically it looks like I'll have to replace the drives, which I will, but that still doesn't account for why the drives are failing.
 
One of the comments you made is that the openmanage software said some of the drives are near failure. Have you updated the raid driver, firmware, and open manage server administrator? Also have you checked the firmware on the hard drives? If any of that is outdated it could cause false errors. Doesn't explain the clicking drives but can explain some of the other issues.

Clicking drives could be shitty UPS service.
 
One of the comments you made is that the openmanage software said some of the drives are near failure. Have you updated the raid driver, firmware, and open manage server administrator? Also have you checked the firmware on the hard drives? If any of that is outdated it could cause false errors. Doesn't explain the clicking drives but can explain some of the other issues.

Clicking drives could be shitty UPS service.

I would definitely do that, but I've read in plenty of places people having these problems that updating the software/firmware made no difference. It doesn't explain either that it's running 3 of the drives flawlessly, not to mention that the others click like mad, meaning the drives are bad for sure. The firmware on all the drives are exactly the same.

I've tried the drives on 2 totally different UPS at different locations, and with the same results. If indeed updating firmware/drivers causes the issues to go away (which is impossible for the clicking, believe me, if you hear these drives, you'll say they are defective for sure; they sound like jet engines now when spinning up and down, when the other drives don't sound like that), it would mean that the PERC controller is the culprit killing the drives.
 
Well temperature might still be a problem: http://arstechnica.com/news.ars/post/20070225-8917.html

"The researchers also found that drive failures did not increase with high temperatures or CPU utilization. In fact, they say, lower average temperatures actually correlate more strongly with failure. Only at "very high temperatures" does this change."

Though I'd more likely put it as a bad refurb..
 
Well temperature might still be a problem: http://arstechnica.com/news.ars/post/20070225-8917.html

"The researchers also found that drive failures did not increase with high temperatures or CPU utilization. In fact, they say, lower average temperatures actually correlate more strongly with failure. Only at "very high temperatures" does this change."

Though I'd more likely put it as a bad refurb..

The first drive that failed, which I mentioned earlier, that I sent back for replacement to Fujitsu, was the one that I installed in a room that was room temperature (about 73F), and it still failed. That's what led me to believe that they were running in an environment that was too hot. Now I don't know what to believe. Should I try running them in a room that is baking hot, like 80F? :eek:
 
The first drive that failed, which I mentioned earlier, that I sent back for replacement to Fujitsu, was the one that I installed in a room that was room temperature (about 73F), and it still failed. That's what led me to believe that they were running in an environment that was too hot. Now I don't know what to believe. Should I try running them in a room that is baking hot, like 80F? :eek:

No clue. Drives should be more tolerant than that.. Temperature shouldn't be that big of a deal as it only effects percentages, not specific individual drives.
 
No clue. Drives should be more tolerant than that.. Temperature shouldn't be that big of a deal as it only effects percentages, not specific individual drives.

Yeah, that's right in line with what I was thinking. If anything, ATA drives should be more susceptible to failure than these.

I'm going to send ALL the drives back to Fujitsu for replacement on Monday. First, I'm going to talk to advanced technical support to see what they have to say about this.
 
Have you checked how clean your power is? This could be problems with dirty power. I know you said you checked the drives on another UPS, but that was after they already started to show symptoms of malfunction, correct? Could it be a faulty backplane? How are the drives connected to the server? Just looking at this logically, there is a strong probability of an environmental cause to this, rather than bad luck of getting that many faulty drives...
 
Personally for work, when it comes to server drives, I buy the drives the MFG uses.. that is Dell drives for Dell servers, even though they are rebadged something else. This way if I have a problem, Dell (or HP) is the one who has to fix it.. I have yet to have any (knock on wood) issues like this. But worst case they 4 hour ship new drives to me.
 
Have you checked how clean your power is? This could be problems with dirty power. I know you said you checked the drives on another UPS, but that was after they already started to show symptoms of malfunction, correct? Could it be a faulty backplane? How are the drives connected to the server? Just looking at this logically, there is a strong probability of an environmental cause to this, rather than bad luck of getting that many faulty drives...

I agree, sounds like a power issue. Either the powersupply is overloaded, doing something weird on the 12v, or something may be wrong with a connection in between.

Do the drives begin to click automatically when they power up or only when you try to access them? Any way you can hook them up to an alternate power source to test them?
 
Have you checked how clean your power is? This could be problems with dirty power. I know you said you checked the drives on another UPS, but that was after they already started to show symptoms of malfunction, correct? Could it be a faulty backplane? How are the drives connected to the server? Just looking at this logically, there is a strong probability of an environmental cause to this, rather than bad luck of getting that many faulty drives...

I checked the drives BEFORE they started to malfunction. They were all working fine, I think. I definitely didn't hear anything weird (spinning up/down unexpectedly, clicking) until I started to access the drives.

Personally for work, when it comes to server drives, I buy the drives the MFG uses.. that is Dell drives for Dell servers, even though they are rebadged something else. This way if I have a problem, Dell (or HP) is the one who has to fix it.. I have yet to have any (knock on wood) issues like this. But worst case they 4 hour ship new drives to me.

The boss didn't want to renew the warranty on the server no matter what I tried. Prior to that, 2 drives that failed I replaced through Dell.

I agree, sounds like a power issue. Either the powersupply is overloaded, doing something weird on the 12v, or something may be wrong with a connection in between.

Do the drives begin to click automatically when they power up or only when you try to access them? Any way you can hook them up to an alternate power source to test them?

The drives don't click automatically when they power up. They seem to work fine, and then after a random period of time start to behave badly. I unfortunately don't have another power source to try them on, other than moving the server from its physical location.

Now, I know the drives are larger than their 73GB counterparts, but they shouldn't suck much more power than the 73GB drives, would they? I mean, the server has been running for the better part of 5+ years with no problem using 6 x 73GB hard drives already, with 2 drive failures in between.
 
Back
Top