• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

Compellent failure

adoch

Weaksauce
Joined
Oct 19, 2010
Messages
77
So we decided to upgrade our HDS AMS500 SAN last year (totalling about 50TB in a mix of sata and SAS) to what we thought would be the next version of their product, the AMS2100. We also had Dell knocking on our door about Compellent. Sounds great, they came in at a price point well under what Hitachi could offer, and with features (automated tiering) that the Hitachi guys couldn't offer without going to USP range, which was well above us.

Upgrade went in well, migrate a few things over, performance is great, watching the automated tiering was a bit scary at first, but after a couple months of letting it do its thing, i had no issues letting it do what it does best. We migrate slowly the ESXi cluster over to the Compellent, and its been running like that for the last few months without issue.



Fast forward to last night, at about 10:15 last night, vmware sends me some emails about lost redundancy (which I fail to look at). When I arrive the next morning, I find that the shit has hit the fan.
Both controllers were down, couldnt ping either and no web interface. Looking at the box, the fibre channel was down, network had some lights.
Contacted Co-Pilot and from what I could gather it seems both controllers go into a reboot loop complaining about failed controllers, and then sit in SafeMode until the problem is resolved.

The problem? Well thats still to come, have uploaded the core dumps, hopefully will get an answer tomorrow.

Its quite frustrating, making sure that the solution we accept is redundant in theory, but for this to happen at the same time to both controllers.. not really sure what else to say.
Is this a rant? Probably, sorry for that. I'm not asking for help on what to do, im sure it will be handled by Dell, I just felt the need to type this one out.
 
Sorry for you,hope Dell guys will solve it in short time.
I was in the same situation,but in smallest case. G-Speed FC XL 16 RAID must have a redundant PSUs,at least all papers claims that,BUT in real word they're not redundant-when one fried year ago the whole storage become in safe mode and it was unreachable.Cost me few days downtime.
 
Well its all "solved" now, just no root cause has been provided yet. The part that really annoys me, and will be costing me in the future is that this will shake my managers (and other team members) confidence in virtualisation.. I've been saying for years to anyone who listens, vmware is safe, there's no more risk than a normal server.
 
Well its all "solved" now, just no root cause has been provided yet. The part that really annoys me, and will be costing me in the future is that this will shake my managers (and other team members) confidence in virtualisation.. I've been saying for years to anyone who listens, vmware is safe, there's no more risk than a normal server.

SAN storage has been around much longer than the recent drive towards virtualisation.

The virtualisation itself didnt fail.
 
I know, virtualisation did no more or less than expected. I guess what I mean is, that its hard to explain the difference to people and have them understand what actually failed. This has not affected my confidence in virtualisation because it had nothing to do with it! It sound so stupid, me saying that it will hurt virtualisation, but I know it will. (for my workplace).
 
My first question is, why didn't the Compellent dial home/you? This could have been solved before the day started.
 
It tried, but an unrelated incident (backhoe>fiber) left the mail queued up.
Edit: actually no, it didn't. VMware did alert me, but the compellent didn't say anything. I guess it can't communicate in safe mode? I'd have to double check
 
Depending on the type of failure, the system should have dialed home. If that didn't happen then you'd have to reply on other warnings (VMware alerts, SAN alerts, in EMC+esrs case: esrs device offline, ...).
 
I know, virtualisation did no more or less than expected. I guess what I mean is, that its hard to explain the difference to people and have them understand what actually failed. This has not affected my confidence in virtualisation because it had nothing to do with it! It sound so stupid, me saying that it will hurt virtualisation, but I know it will. (for my workplace).

I know. Perception is key. :)
 
Got back the root cause, it was pretty quick for them to diagnose. The phone home process was running through a squid proxy. While our internet connection was down, squid would resolve with a standard DNS failed page. This page was 1135 bytes in size.
Our version of the firmware had a bug in it that would crash the controller if it received a page over 1024 bytes, sending it into an endless reboot loop.

Scary shit! If anyone has 5.5.3 or earlier and uses a proxy, don't.
 
Back
Top