Compellent failure

adoch · Aug 2, 2012

So we decided to upgrade our HDS AMS500 SAN last year (totalling about 50TB in a mix of sata and SAS) to what we thought would be the next version of their product, the AMS2100. We also had Dell knocking on our door about Compellent. Sounds great, they came in at a price point well under what Hitachi could offer, and with features (automated tiering) that the Hitachi guys couldn't offer without going to USP range, which was well above us.

Upgrade went in well, migrate a few things over, performance is great, watching the automated tiering was a bit scary at first, but after a couple months of letting it do its thing, i had no issues letting it do what it does best. We migrate slowly the ESXi cluster over to the Compellent, and its been running like that for the last few months without issue.

Fast forward to last night, at about 10:15 last night, vmware sends me some emails about lost redundancy (which I fail to look at). When I arrive the next morning, I find that the shit has hit the fan.
Both controllers were down, couldnt ping either and no web interface. Looking at the box, the fibre channel was down, network had some lights.
Contacted Co-Pilot and from what I could gather it seems both controllers go into a reboot loop complaining about failed controllers, and then sit in SafeMode until the problem is resolved.

The problem? Well thats still to come, have uploaded the core dumps, hopefully will get an answer tomorrow.

Its quite frustrating, making sure that the solution we accept is redundant in theory, but for this to happen at the same time to both controllers.. not really sure what else to say.
Is this a rant? Probably, sorry for that. I'm not asking for help on what to do, im sure it will be handled by Dell, I just felt the need to type this one out.

dedobot · Aug 2, 2012

Sorry for you,hope Dell guys will solve it in short time.
I was in the same situation,but in smallest case. G-Speed FC XL 16 RAID must have a redundant PSUs,at least all papers claims that,BUT in real word they're not redundant-when one fried year ago the whole storage become in safe mode and it was unreachable.Cost me few days downtime.

adoch · Aug 2, 2012

Well its all "solved" now, just no root cause has been provided yet. The part that really annoys me, and will be costing me in the future is that this will shake my managers (and other team members) confidence in virtualisation.. I've been saying for years to anyone who listens, vmware is safe, there's no more risk than a normal server.

msitpro · Aug 2, 2012

adoch said:
Well its all "solved" now, just no root cause has been provided yet. The part that really annoys me, and will be costing me in the future is that this will shake my managers (and other team members) confidence in virtualisation.. I've been saying for years to anyone who listens, vmware is safe, there's no more risk than a normal server.

SAN storage has been around much longer than the recent drive towards virtualisation.

The virtualisation itself didnt fail.

adoch · Aug 2, 2012

I know, virtualisation did no more or less than expected. I guess what I mean is, that its hard to explain the difference to people and have them understand what actually failed. This has not affected my confidence in virtualisation because it had nothing to do with it! It sound so stupid, me saying that it will hurt virtualisation, but I know it will. (for my workplace).

blinkenflamingo · Aug 2, 2012

My first question is, why didn't the Compellent dial home/you? This could have been solved before the day started.

adoch · Aug 2, 2012

It tried, but an unrelated incident (backhoe>fiber) left the mail queued up.
Edit: actually no, it didn't. VMware did alert me, but the compellent didn't say anything. I guess it can't communicate in safe mode? I'd have to double check

blinkenflamingo · Aug 2, 2012

Depending on the type of failure, the system should have dialed home. If that didn't happen then you'd have to reply on other warnings (VMware alerts, SAN alerts, in EMC+esrs case: esrs device offline, ...).

msitpro · Aug 2, 2012

adoch said:
I know, virtualisation did no more or less than expected. I guess what I mean is, that its hard to explain the difference to people and have them understand what actually failed. This has not affected my confidence in virtualisation because it had nothing to do with it! It sound so stupid, me saying that it will hurt virtualisation, but I know it will. (for my workplace).

I know. Perception is key.

adoch · Aug 3, 2012

Got back the root cause, it was pretty quick for them to diagnose. The phone home process was running through a squid proxy. While our internet connection was down, squid would resolve with a standard DNS failed page. This page was 1135 bytes in size.
Our version of the firmware had a bug in it that would crash the controller if it received a page over 1024 bytes, sending it into an endless reboot loop.

Scary shit! If anyone has 5.5.3 or earlier and uses a proxy, don't.

Dell_Sam_L · May 9, 2013

Hello,
Sorry for your issue & glad that we were able to resolve it & let you know what the root cause was. If you have any other issues in the future you can reach us as well by going to Dell forums and we can assist as well. http://en.community.dell.com/support-forums/storage/f/4427.aspx
aspx

Compellent failure

adoch

Weaksauce

dedobot

Weaksauce

adoch

Weaksauce

msitpro

Weaksauce

adoch

Weaksauce

blinkenflamingo

Weaksauce

adoch

Weaksauce

blinkenflamingo

Weaksauce

msitpro

Weaksauce

adoch

Weaksauce

Dell_Sam_L

n00b