LSI 9260-4i Raid Issues

DarkDiamond · Jul 25, 2012

Hi,

I'm been having a number of hard drives drop out of my raid adapter in this past two weeks.

The most recent error logged in the MegaRaid Storage Manager is this:

251 [Critical, 2] 2012-07-25, 09:30:54 Controller ID: 0 VD is now DEGRADED VD 0 10196
81 [Information, 0] 2012-07-25, 09:30:54 Controller ID: 0 State change on VD: 0 Previous = Optimal Current = Degraded 10195
114 [Information, 0] 2012-07-25, 09:30:54 Controller ID: 0 State change: PD = Port 0 - 3:1:12 Previous = Online Current = Failed 10194
87 [Warning, 1] 2012-07-25, 09:30:54 Controller ID: 0 Error: Port 0 - 3:1:12 (Error 2) 10193
113 [Information, 0] 2012-07-25, 09:30:54 Controller ID: 0 Unexpected sense: PD = Port 0 - 3:1:12-Logical block address out of range, CDB = 0x2a 0x00 0x00 0x36 0x45 0xe8 0x00 0x00 0x04 0x00, Sense = 0x70 0x00 0x05 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x21 0x00 0x00 0x00 0x00 0x00 10192

I've seen other errors such as:
113 [Information, 0] 2012-07-24, 19:48:40 Controller ID: 0 Unexpected sense: PD = Port 0 - 3:1:22-Power on, reset, or bus device reset occurred, CDB = 0x4d 0x00 0x4d 0x00 0x00 0x00 0x00 0x00 0x20 0x00, Sense = 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x29 0x00 0x00 0x00 0x00 0x00 8547
268 [Warning, 1] 2012-07-24, 19:48:38 Controller ID: 0 PD Reset: PD = Port 0 - 3:1:22, Error = 3, Path = 0x5001E6739EA4AFF6 8546
267 [Warning, 1] 2012-07-24, 19:48:38 Controller ID: 0 Command timeout on PD: PD = Port 0 - 3:1:22-No addtional sense information, CDB = 0x12 0x00 0x00 0x00 0xff 0x00, Sense = null, Path = 0x5001E6739EA4AFF6 8545
102 [Critical, 2] 2012-07-24, 19:48:27 Controller ID: 0 Rebuild failed due to target drive error: PD Port 0 - 3:1:22 8544

and...

251 [Critical, 2] 2012-07-22, 14:39:45 Controller ID: 0 VD is now DEGRADED VD 1 2050
81 [Information, 0] 2012-07-22, 14:39:45 Controller ID: 0 State change on VD: 1 Previous = Optimal Current = Degraded 2049
114 [Information, 0] 2012-07-22, 14:39:45 Controller ID: 0 State change: PD = Port 0 - 3:1:22 Previous = Online Current = Failed 2048
87 [Warning, 1] 2012-07-22, 14:39:45 Controller ID: 0 Error: Port 0 - 3:1:22 ( Error 2) 2047
113 [Information, 0] 2012-07-22, 14:39:45 Controller ID: 0 Unexpected sense: PD = Port 0 - 3:1:22Power on, reset, or bus device reset occurred, CDB = 0x03 0x00 0x00 0x00 0x40 0x00 , Sense = 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x29 0x00 0x00 0x00 0x00 0x00 2046
113 [Information, 0] 2012-07-22, 14:39:45 Controller ID: 0 Unexpected sense: PD = Port 0 - 3:1:22Power on, reset, or bus device reset occurred, CDB = 0x2a 0x00 0x08 0x5a 0x8d 0x00 0x00 0x00 0x38 0x00 , Sense = 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x29 0x00 0x00 0x00 0x00 0x00 2045

Can anyone lend some advise as to what's happening here? I've lost about 4 drives (1 Hitachi 2TB, 2 Western Digital RE3's and 1 Western Digital Velociraptor 140GB). It seems wierd that these are all failing at the same time when I've had them in service for years. These are all residing in a Norco 4224 case using an Intel RES2SV240 expander.

I'm especially concerned about the "Power on, reset, or bus device reset occurred" and "Logical block address out of range" errors. The rebuild error I assume is due to a faulty replacement drive from NewEgg.

Is there a website out there that lists LSI errors and what they mean?

Thanks,
Dark Diamond

Old Hippie · Jul 26, 2012

LSI has always been responsive to my email questions.

I suggest you email them.

odditory · Jul 26, 2012

You dont give any information about your system besides having an expander. What are the specs -- specifically the power supply make and model? What changes or upgrades did you do if any before these problems started happening?

"I've lost about 4 drives" is meaningless without a timeframe. How close did they fail within one another?

These errors could be caused by a dozen different things besides the drive itself -- loose or improperly seated power connector to backplane, crap or underpowered PSU, bad miniSAS or SATA cable, bad motherboard, loose power connector to motherboard, flakey LSI card, etc. You have to check, re-seat everything one by one to begin trying to isolate.

DarkDiamond · Jul 26, 2012

Motherboard: Supermicro X9SCL+-F
CPU: Core i3-2100
RAM: 8GB DDR3-1333 ECC Unbuffered
Raid: LSI MegaRaid 9260-4i wiht 512MB Onboard memory
Power Supply: SeaSonic X750 Gold 750W Full Modular Active PFC Power Supply
Case: Norco 4224

The system has been built and stable since 6/18/2010 with no changes since (not even patches to Windows). It serves as an iSCSI target for two ESXi boxes. I've run memory tests for more than 12 hours prior to moving data to this system. I've owned the Norco for about 2 years. The failures started on 7/9/2012 with one of my Hitachi 7K3000 2TB drives. This drive failed with "Power on, reset or bus device reset occurred" error message. On 7/14/2012 one Western Digital RE3 1TB drives failed with the same message as the Hitachi. Then on 7/22/2012 a second Western Digital RE3 1TB failed with the same error message. Then on 7/25/2012 a Western Digital Velociraptor 140GB failed with a "Logical block address out of range error".

All four failures have come from different backplanes in the Norco case.

I've found no other errors in the event log on the server.

What do these errors mean?

gjs278 · Jul 26, 2012

it means your raid card is probably broken or overheating or who knows what.

Email LSI. they're going to give you a diagnostic utility that will fetch all of your raid controller logs. you run it, you attach the log back to them, and they tell you what to change or to RMA the card.

Old Hippie · Jul 27, 2012

they're going to give you a diagnostic utility that will fetch all of your raid controller logs. you run it, you attach the log back to them, and they tell you what to change or to RMA the card.

That's the ticket and it works well.

DarkDiamond · Jul 30, 2012

I called LSI on Friday. Was pleasantly surprised that a human being answered their 800 number.

I ran their utility to export the raid logs. The support technician said it looked like I had a failing disk in the Velociraptor array. I wasn't too convinced because I was getting those PD reset messages on various drives in the Velociraptor, RE3 and Hitachi arrays. I started a foreground initialization of the Velociraptor and RE3 arrays simulataneously. That caused a SLEW of PD reset error messages from random disks throughout all of my arrays.

I bought a new power supply. Unfortunately that didn't do anything to solve the problem.

Thankfully I had spare hardware and moved the 8 RE3 drives into their own machine and connected them to an LSI 8888ELP raid adapter. Created a raid 10 of those 8 drives and did a full foreground initialization. No errors at all.

That's telling me it's either a problem with the Norco backplanes, the raid adapter or expander.

Once the storage vmotion is complete and all data is moved off the remaining Hitachi array, I'm going to delete it and start a simultaneous initialization of Hitachi and Velociraptor and a few other drives I have lying around. If I see additional PD error messages, I'll hook them up directly to the expander with a reverse breakout cable. If no errors, I guess that means most of my Norco 4224 backplanes are shot.

Any other ideas for troubleshooting or am I on the right path?

gjs278 · Jul 30, 2012

keep going until you know it's the card or not. I would not be surprised if the card had a problem, it's not unheard of at least, especially when you're having multiple drive errors like that.

LSI 9260-4i Raid Issues

DarkDiamond

n00b

Old Hippie

Supreme [H]ardness

odditory

Supreme [H]ardness

DarkDiamond

n00b

gjs278

Gawd

Old Hippie

Supreme [H]ardness

DarkDiamond

n00b

gjs278

Gawd