Is this raid card dying? Help an old woman :).

silkshadow · Jun 14, 2011

The card is question is a Promise Ex8350. This card is old. I believe I bought it in like 2004. I've since pulled it into different systems as I bought better raid cards. Its been doing duty on a MCE array for recorded TV and junk my cousins and I upload to it for my Grandmother to watch.

My grandmother called and complained that her shows were not playing. So I went over there this weekend and pulled up the logs. Saw that the array was going off and on line, yikes! I apologized to my grandmother and pulled the system to my house to test it.

Here's the core specs:

Gigabyte GA-P35 board
Intel core2duo 2.4ghz
2GB DDR2 memory
Nvidia 8600 passive
Promise Ex8350
8x 1TB Western Digital Green EACS (in raid 5 on the raid card)
200gb samsung spinpoint OS drive

What I desperately need help with is making a decision to maximize the little time I have to get this back up and to spend money on fixing this correctly. If the card is bad then I will work on getting it back up without the card. If the card is ok, then I will spend the time cleaning the machine, adding fans and replacing disks. I just can't figure out if the card is good or bad

. Over the weekend I was able to get data off of the array (phew), took freaking forever, but that is not a consideration thankfully.

Symptoms:

-Array goes on and offline. In the logs, the array would go offline for periods then come back up in critical condition and the ex8350 would be rebuilding the array. The process never completes, while rebuilding, the array would go back offline before it completes.

-Heat seems to be a problem. When getting data off the array. The array would go offline and I would have to shut it down. I would have to keep it off for about 5 minutes before powering it back up. I discovered I could cut the downtime if I opened the case and turned on the aircon in my work room. The headsink on the XOR is very hot, but this is not abnormal. This card always runs hot. I had another one, that kicked the bucket fully last year, but it was also always too hot to touch.

I was only able to start troubleshooting yesterday, as it took me 3 days to get the data off the array.

Troubleshooting steps so far:

-Tested all the disks with WD util. All come up ok, but Promises on card ulil (WebPam's media patrol) find bad sectors on all the disks. Is there a definitive disk checking program I can use to reconcile the different results?

-When the array went down it seems like 2 disks might be the culprits. Those disks checked out fine (as above). So I am confused.

-I was able to complete a rebuild of the array (first time since the logs reported the array going offline) with the case open and the aircon on. However, it then went critical again a few hours later, with case open.

Given the time testing these disks takes, that is all I have accomplished

. Is this enough to say the card is the culprit? I am not sure because of a few things:

-Decreasing heat allowed the array to rebuild. Maybe the massive buildup of dust (the case and internals was/is covered with a thick later of dust) increased het to unfunctionalble levels? A meticulous cleaning, more fans and so forth might fix this.

-Conflicting disk status reports. I have no clue how reliable the WD util is. The disks seem to work, but maybe it could be the disks. 2 bad ones would cause this. I have no drives to test it out with. I really don't want to buy 2 1TB disks to test it with because, after the test they are useless to me (I am all full up with dozens of 2TBs).

-I am wishing for this not to be the card because I can't replace it. I have no spares and they do not sell this kind of equipment locally. I have imported all of this and that takes a lot of time.

My grandmother is very old, this is operated by her caregiver, and watching it is one of the few things she enjoys these days. Can't put into words how happy it made her when I installed it a couple years ago and she started using it for a few weeks. It was a surprise to the whole family, as she was not a TV/movie person before. So getting this back on its feet sooner rather than later would be very nice. In other words, I am under familial pressure (sucks begin the family geek). So big thanks for any help here!

ChRoNo16 · Jun 14, 2011

I would honestly get a couple fans blowing cold (Or room temp) air over that card and its heatsink, and go from there. Sounds to me personally that the card is continually overheating.

thefreeaccount · Jun 14, 2011

What I desperately need help with is making a decision to maximize the little time I have to get this back up and to spend money on fixing this correctly.

Grab a spare box, set up linux mdam, use as a NAS for the MCE box. You can probably do it after work tomorrow. Use 2 x 4-disk raid 6 arrays.

I have no drives to test it out with. I really don't want to buy 2 1TB disks to test it with because, after the test they are useless to me (I am all full up with dozens of 2TBs).

Why would you need 1TB disks just to test the card? Use 2TBs.

In other words, I am under familial pressure (sucks begin the family geek)

Yeah, right - tell them that they are welcome to pay a professional to set her up with a new system tomorrow. The reality is that they don't give a shit and would never pay for a system like this much less spend extra on labor.

Jeroen1000 · Jun 14, 2011

Remove dust from the card (be careful for static discharge) and point a big blower at it. Then pull the data again/stresstest it and determine whether is it overheating. With a big blower it should no longer overheat. That might help you to narrow down the cause.
IF heat is the cause, you may want to remove the heatsink if possible, apply new thermal grease, and depending on the size of the heasink (if no fan is present) screw a 40mm or larger fan onto it. Diagnosing things takes a little time, but you do need to eliminate heat as a cause.

Perhaps someone can advise on how to deal with the oddly behaving drives?

Disclaimer: the above assumes your data is already safe

Additional things to check are the power supply. Many strange things are often related to a flaky PSU. I would not go that route before trying to figure out whether it is the heat. Swapping a PSU can be time consuming if you have gone to a lot of trouble for neat cable management.

Freak1 · Jun 14, 2011

WD testtool does say smart OK even tho it has bad sectors.

MrWizard6600 · Jun 14, 2011

I'm not a raid guy but I see these threads all the time and wonder why you guys don't do it in software? Forget about the speed of the raid (which your grandma's not going to care about) and just think about the usability, when a RAID-5 fails on IRST intel pops up a little balloon saying "you're boned"... and since the needed computation is put at the software level, the heat is put out on the CPU which means you're never going to be totally fucked by a RAID card dying.

so yeah, move your mom over to software raid, spinrite the drives, add some fans, send the computer on its way; that's what I'd do.

silkshadow · Jun 14, 2011

Thanks for all the replies!

I would honestly get a couple fans blowing cold (Or room temp) air over that card and its heatsink, and go from there. Sounds to me personally that the card is continually overheating.

Remove dust from the card (be careful for static discharge) and point a big blower at it. Then pull the data again/stresstest it and determine whether is it overheating. With a big blower it should no longer overheat. That might help you to narrow down the cause.
IF heat is the cause, you may want to remove the heatsink if possible, apply new thermal grease, and depending on the size of the heasink (if no fan is present) screw a 40mm or larger fan onto it. Diagnosing things takes a little time, but you do need to eliminate heat as a cause.

Thanks! Stress testing with a big old industrial fan blowing on it, is on the troubleshooting checklist. I will give this one more night (tomorrow night) then I have no more time to work on it this week

. If it fails under that condition, then for sure it is dead right? This would be a definitive test?

Grab a spare box, set up linux mdam, use as a NAS for the MCE box. You can probably do it after work tomorrow. Use 2 x 4-disk raid 6 arrays.

I definitely appreciate the efficiency of this option, but I am afraid it will not work to deploy a NAS and MCE in her place.

Why would you need 1TB disks just to test the card? Use 2TBs.

Sorry, I forgot to post, but the 2TBs I have are the pieces of crap EARS advance format drives, so they don't work for swap out. That is another PIA saga, but I am stuck with those.

Yeah, right - tell them that they are welcome to pay a professional to set her up with a new system tomorrow. The reality is that they don't give a shit and would never pay for a system like this much less spend extra on labor.

Lol, I appreciate the sentiment and you are right, but unfortunately, that doesn't change the case for me. I am my own worst enemy here.

Additional things to check are the power supply. Many strange things are often related to a flaky PSU. I would not go that route before trying to figure out whether it is the heat. Swapping a PSU can be time consuming if you have gone to a lot of trouble for neat cable management.

This is interesting. The PSU is aging (2 years old) and I have not checked the base components, since everything else seems to be working fine and my time is short. It is a good PSU though, Corsair and I think 400W or maybe 500W. It wasn't on my radar since none of the other PSU failing symptoms occurred while I was getting the data off.

How likely is this possibility? You are right, swapping out the PSU would be a big pain and I have no spare that I know to be working (I have a couple old Enermax PSUs that are iffy).

WD testtool does say smart OK even tho it has bad sectors.

I was afraid of that. Is there any util I can use to say 100% if a drive is good or bad?

I'm not a raid guy but I see these threads all the time and wonder why you guys don't do it in software?

Truth is, it was easier and reliable. Hardware raid, only a bad card or bad drives can be a problem (in this case, I can't figure out who is the culprit because both are still working, kind of). Software raid, its both of those (replace mobo or sata card for raid card) plus a dozen potential software, driver, OS, etc issues. Besides, I had the card and it was just collecting dust.

If I could run MCE on WHS, I would do that if this card is dead (though reinstalling the OS and such is a huge time consumer) but that doesn't work. Other than that, windows softraid is very problematic, I have used it, flexraid is an issue for a PVR array given that its constantly changing, So what else is there for windows with more than 4 drives? I am asking because if this card is dead, I have to find it or go without redundancy for this.

In fact, I am currently waiting for DriveBlender to go stable to use it for my own raid arrays on my Windows domain controller because I got stuck with EARS drives that don't work on my hardware raid cards

. Been waiting for flexraid live, still in beta. Was waiting for MS to releases Aurora, they pulled drive extender. Now I have to wait for drive bender while a couple dozen 2TB drives sit on my shelf (I don't need the storage yet but my existing arrays are starting to get close to full, getting the data off my Grandmother's box has pretty much filled 3 of the arrays). Yeah NASes are fine, but running yet another computer at home 24/7 as opposed to putting the storage in the same 24/7 computer that does Exchange/AD, PVR, VPN, My Movies, etc is what keeps my electricity bill from making me cry.

Lol, I strayed OT there a bit, sorry.

So thanks to your replies my process is streamlining:

-Need to do stress test with magnified heat reduction. This will be a definitive test, right?

-I still need to find a util that can tell me 100% these drives are all ok. WD's util is not conclusive? Any suggestions?

-I need to take a multimeter to the PSU.

Edit: Just clicked that spinrite link. Am I right that is for data recovery? If so, I don't need that I have all the data. However, it seems it might also be a disk checking util? Can that tell me 100% if my drives are OK?

Freak1 · Jun 14, 2011

I use CrystalDiskInfo its under ID 05 if the raw value is anything else then 0 they it has bad sectors.

Jeroen1000 · Jun 14, 2011

silkshadow said:
So thanks to your replies my process is streamlining:

-Need to do stress test with magnified heat reduction. This will be a definitive test, right?

It should, yes. Your average table stand hot summer blowing fan will do though.

silkshadow said:
-I still need to find a util that can tell me 100% these drives are all ok. WD's util is not conclusive? Any suggestions?

Any tool like hdtune can check smart data for errors. They should be listed if there are any. But you want to scan the full disk? hdtune should also be able to do this (needs confirmation though).

silkshadow said:
-I need to take a multimeter to the PSU.

Edit: Just clicked that spinrite link. Am I right that is for data recovery? If so, I don't need that I have all the data. However, it seems it might also be a disk checking util? Can that tell me 100% if my drives are OK?

Every time I wanted to use spinrite, it would have taken ages to complete. As so, I do not recommend it. I never had the patience to let it run all the way.

As for the PSU concern, well, if there is a lot of heat inside the case, a PSU will degrade faster. How likely a cause this is I do not know. I do know that very dodgy behaviour can be power related. I've only experienced it twice and after extensively looking for a solution, on both occasions a faulty PSU was the culprit.

It also is a good idea (when suspecting hardware problems) to run with minimal hardware attached. No useless stuf like a DVD-rom/rewriter. I assume you have checked all power and data connections thoroughly to see whether anything is loose?

hotzen · Jun 14, 2011

I would begin with a proper SMART-test of the disks: http://gsmartcontrol.berlios.de/home/index.php/en/Home
Choose the full, long test that initiates the disk's own selfcheck-procedure. After about 2-4 hours, you can check the result and the smart-values that will reflect the test's result.

If the raw-value of bad sectors, etc is non-zero, there might be a problem with corrupted disks. CRC-errors mean bad cables, relocated sector count means the drive detected bad sectors and already circumvented them. So the disk is not in a corrupted state but the sector-faults will increase by now.

But really, this sounds like a heat-raped raid-controller

silkshadow · Jun 15, 2011

Thanks everyone, sorry for the late reply, its been a hectic day. I will be working on this tonight, hopefully I can set aside a couple hours after dinner before I have to go back into the office. Time zones suck. Will update after I go through the process suggested here.

cymon · Jun 15, 2011

If dust is a major issue, consider taking the cover off the PSU and cleaning that.

The promise card you mention - is it an actual RAID on Chip card or does it offload the XOR calculation to the CPU?

Try setting the disks to passthrough mode (if that's not an option, then put them in 1 drive RAID0) - then run the disk testing tools on them. It sounds like a lot of dropouts were due to heat - but make sure that the controller chip hasn't gotten toasted.

silkshadow · Jun 17, 2011

cymon said:
If dust is a major issue, consider taking the cover off the PSU and cleaning that.

Thanks for the replies! I'll check out the PSU. Dust is definitely an issue, though how much of one I don't know yet. Cleaning the PSU couldn't hurt.

The promise card you mention - is it an actual RAID on Chip card or does it offload the XOR calculation to the CPU?

I actually don't know. What are the ramifications of this? I can get in touch with Promise support and find out. Their support is really good for info.

Try setting the disks to passthrough mode (if that's not an option, then put them in 1 drive RAID0) - then run the disk testing tools on them. It sounds like a lot of dropouts were due to heat - but make sure that the controller chip hasn't gotten toasted.

I haven't been able to go back and futz with it yet but so fat I haven;t been able to get a stable array up for more than a short period of time before it goes offline.

So I haven't had the time to work on this, but I'm heading home and a nice friday night for me troubleshooting this crud, oh well. At least I've had the time to start and let run some of these ridiculously long drive test processes this week. HD Tune's takes upwards of 6 hours per disk!

I have discovered that there are 3 disks with bad sectors. The lists of bad sectors (via chkdsk) on these disk is quite long. The other disks check out fine.

Western Digital, well sucks, but my assistant was able to eventually get through to someone who would at least talk to her on their US line. They said to write 0's to the disks and then do the extended test. It will find the bad sectors and fix them. I have completed the process on one disk and WDutil error'd out upon trying to fixing them.

From what I understand with disks, This means that the bad sector list is probably full, is that correct? I have to do the rest of the disks HD Tune found bad, but either way, I think that means these disks are likely toast.

Tonight, the first thing I plan on trying is creating an array with just the working disks. Possibly, it ends up being the disk's fault (I'm hoping). Well, the disks and stupid Western digital's util which originally said the disks were fine

. Thankfully HD Tune is a bit more reliable (thanks for the app suggestion!).

Mackintire · Jun 17, 2011

Someone earlier suggested spinrite. That is only program I am aware of that can correct a bad sector and verify the drive's integrity to the most extreme possibility without special hardware.

silkshadow · Jun 17, 2011

It is crazy late, spent a couple hours messing with it. I think I can say the card is bad. Created an array with just 4 HD Tune certified disks, aircon on full blast (I had to put on a jacket to go into my workroom) and industrial fan (steel blades, loud as an airplane) blowing into an open case. I built the array at around 1:30am (quick build no time for full initialization, anyway I know the drives are good) and started copying over data, at around 3am the array went critical but file copy still going, its now 3:50am, the array just went offline.

So 3 bad disks in an 8 drive array and a bad card, amazing that there was no data loss!

OT but this is why I like hardware arrays. I don't think an 8 drive software array will allow you to pull 7TBs of data while 3 drives are bad and the motherboard/sata chip has been in the process of dying for at least a week.

But now I have to figure out how to make this all work again without a raid card

. I have no idea how I am going to get this level of reliability. Its been nice. I haven't had to do any work on this HTPC since I installed it a couple years ago and my Grandmother is definitely not kind to it. I think she doesn't even realize its a computer. Anyway, that's a challenge for tomorrow.

Thanks for all the advice and help!

Is this raid card dying? Help an old woman :).

silkshadow

n00b

ChRoNo16

[H]ard|Gawd

thefreeaccount

Gawd

Jeroen1000

Limp Gawd

Freak1

Limp Gawd

MrWizard6600

Supreme [H]ardness

silkshadow

n00b

Freak1

Limp Gawd

Jeroen1000

Limp Gawd

hotzen

Limp Gawd

silkshadow

n00b

cymon

Limp Gawd

silkshadow

n00b

Mackintire

2[H]4U

silkshadow

n00b