Help with ESX 5.5 and dying HDDs

Karandras

[H]ard|Gawd
Joined
Feb 16, 2001
Messages
1,873
Hey,

So at my work we have a 4 node ESX 5.5 cluster on r710s. Each node has local drives (this one has 4x 1Tb SATA in R10) and dual gige to a 3220i with 24x500gb drives.
Yes I know it's all old but still running good and plans to get onto a 4 node c6220 later this year hopefully.
Anyways, the short story is that the BMC or Lifecycle or DRAC (which ever controls the front panel) has stopped reporting errors to VMWare. It also doesn't show errors via the system light and front panel.
I had the front bezel on, this obscured my view of the HDD lights. While trying to troubleshoot why ESX hardware tab wasn't showing everything I pulled the bezel off to try the keys on the control panel, I noticed that two hdds were in a failed state..
I'm assuming that it's a failed drive from each side of the R0 portion of the R10 or it wouldn't be a happy situation.
Yes we have nightly backups of the VMs that are on the local drives but it will take probably 7-8 hours to restore via gige.

I've vMotioned all but two of the VMs off the local drives to the 3220i but the two that are left are large.
The failing drives cause a spike in latency, up to 20,000-120,000ms about every 30 min. When this spike hits it caused the storage vMotion to fail with the error "A general system error occurred: The source detected that the destination failed to resume."
One of the VMs I could do manually through SSH, it seems to handle the latency chunk much nicer:
-SSH to host
-Stop VM.
-Remove VM from inventory.
-Go to local storage directory.
-Move the VM directory to the 3220i.
-Wait forever.
-Once moved, re-add back to inventory.
-Start VM again.

That process took 45 min for a 60gb VM. The VMs I need to move are 350gb and 700gb. The 350gb VM I can shut down and move using the above method. It'll take about 4 hours.
The other VM is a client VM which I'm guessing they don't want their VM shut down for 8+ hours while I move it via copy/paste.

Now you know the back story.
Is there a way to let vMotion move the VM without timing out during the spike? Similar to the SSH method but keeping the VM running?

Thanks in advance for all your help and your time to read my problem.
 
Bah, nope. No matter how high I set those numbers it times out at that same point...
 
Do you not have replacement drives to replace the failed one to at least try a rebuild for one of them? Raid10 only needs to copy over existing data, not like raid 5 where it has to literally scan every sector on the drive to replicate over and build parity....

Are these Ent. level drives or off the shelf drives?

Also any reason why are you not at least on ESXi 6 of 6.5 (i think supported on R710's..)
 
We do have replacement drives. My fear is that the drive will start to rebuild the newly inserted one and the other will fail causing the array to fail so trying to get all the VMs off those failing drives to shared storage.

We use Dell branded drives in these servers.

A cluster rebuild/upgrade has been on my to do list for a long time but it's been running strong so keeps getting pushed back. :( I'm not sure that a newer version would help in this situation anyways, all hardware failures causing this grief.
 
Have you tried an internal vmotion to another drive within same system?
If it were me I would crack the box open, and put in nvme drive. M.2 NVME to PCIe 3.0 x4 Adapter, $12 on Amazon.
I would then also use a Samsung 960 or 970 pro drive. I had no issues with those drives. A non Samsung drive on esx6.7 I had to downgrade the NVME driver.
Move it to that however I could, ether vmotion or a manual move.

There is also software called veeam, it has instant recovery, but it's expensive and to make it work well it needs SSD. I use it when I'm in a pinch.
Veeam method:
Do a full backup (system runs like normal during this, hopefully doesn't time out here, you can get the free version to test this stage before buying the expensive one)
Do a incremental
shutdown
Do another incremental (should be real fast to get whatever changed since last incremental)
Do instant recovery.
Down time should be less than 20 minutes.
Vmotion the instant recovery somewhere.
*Obviously veeam would not be installed on the system having issues.
 
Hey,
So solved this with Veeam. Since we have a valid vmware license Veeam was able to do a quick migration from one storage to another. Even with the IO errors every 30 minutes it was still able to migrate.
Thanks everyone for the suggestions!
 
Back
Top