Reconfigure virtual machine taking forever

Riccochet · Dec 22, 2016

ESXi 5.5, cisco blade hosts, K2 all flash storage. Just recently, doing something as mundane as adding space to a vdisk, is now taking 34-50 minutes to complete. This used to take seconds, if that. The progress bar does move, but damn. It shouldn't take this long, as hasn't ever.

Anyone see this before, and is there a fix?

ChRoNo16 · Dec 23, 2016

I sort of feel its because of esxi 5.5

I stopped using it because my overall performance on that version was slow, and when I went back to a windows server hosting my vms, they sped up considerably.

lopoetve · Dec 26, 2016

Could be a ton of things - are you eager zeroing the disks? Thin extension? Lazy zeroed? FC or iSCSI?

Riccochet · Dec 26, 2016

lopoetve said:
Could be a ton of things - are you eager zeroing the disks? Thin extension? Lazy zeroed? FC or iSCSI?

It's a Kaminario K2 on FC. Disks are thin provisioned.

michalrz · Dec 26, 2016

I think you need to chop the problem up a bit because it's too broad.

Try the following to narrow down:
- how's reconfigure performance when no domains are launched except the hypervisor?
- does adding more running VMs produce an unproportionate amount of lagging?
- does a newly created vdisk and domain exhibit the same slowness?
- is there anything in the ESXI logs? in particular, turn your domains off, reboot your visor, create just one example VM and run it - look at the log.

lopoetve · Dec 26, 2016

Riccochet said:
It's a Kaminario K2 on FC. Disks are thin provisioned.

This is only metadata updates then.

That leaves buffer cache issues (how large are the volumes and how many VM), ats/reservation failures, or bad heartbeat regions.

Grep through /bar/log/vmkernel.log for anything with error or warning right after trying it. Anything turn up? -also, run vmkfstools -V and see if anything shows up (or that command takes forever- should be a second or two at most)

lopoetve · Dec 26, 2016

Also could be lost frames I guess, but that should be absurdly hard on FC. What are the hosts/FC cards?

lopoetve · Dec 26, 2016

Or last, hostd/vpxa never reporting back. Rare but possible. See again: hosts.

Riccochet · Dec 27, 2016

lopoetve said:
Also could be lost frames I guess, but that should be absurdly hard on FC. What are the hosts/FC cards?

Cisco UCS B260 hosts. Not sure on the FC cards. Currently have roughly 67 VM's. Volumes on the K2 are all 15 TB with one 5 TB volume.

I'll have to dig through logs. It's strange that this just started happening.

lopoetve · Dec 27, 2016

With that few VMs you're not hitting buffer cache unless they're all linked clones from the same base, and even then they'd have to be doing insane amounts of metadata updates to have any effect. That's probably out. It's most likely a communication problem of some kind - which FNIC driver version are you on, and which of the VIC cards are you on - can you find out? Also, you're not doing FCoE all the way through, are you?

Riccochet · Dec 27, 2016

lopoetve said:
With that few VMs you're not hitting buffer cache unless they're all linked clones from the same base, and even then they'd have to be doing insane amounts of metadata updates to have any effect. That's probably out. It's most likely a communication problem of some kind - which FNIC driver version are you on, and which of the VIC cards are you on - can you find out? Also, you're not doing FCoE all the way through, are you?

I honestly don't know. We don't manage or own any of the equipment other than the K2.

Now that you mention clones, yes, the VM's having this issue are clones of another VM. I can't see that as being such an issue since, maybe, 6 of the VM's are clones of a parent.

lopoetve · Dec 27, 2016

Linked clone != clone. A Linked clone is something unique.

Need to kow what's in the logs to do more.

geiger · Dec 27, 2016

I've seen something like this on a VM with a history of failed snapshot backups and consolidation badly needed.

lopoetve · Dec 27, 2016

Also a possibility, but it should have freaked at extending a drive with snaps on it - invalidates the tree.

Riccochet · Dec 28, 2016

It turned out to be a config file issue. Earlier in the month we decommissioned some 2012 R2 standard servers and and attached their data vmdk's to 2012 r2 DCE servers. All was good in the hood, or so we though, since we could see and access those vmdk's. But the config file for VM wasn't exactly pointing to those vmdk's properly. Add space to one of the drives that was created with the VM and no issue, add space to one of the vmdk's we attached and it would bomb for over an hour. Which is odd since we decom'd 9 standard servers, created 5 new DCE servers and attached vmdk's to 13 different servers. Only one was having this issue.

Props to the guys are Greencloud for figuring that one out. Finally got it all sorted out around 4:00 am this morning.

lopoetve · Dec 28, 2016

Huh, that shouldn't be an issue...? I'm curious how it referenced them... unless you did it by hand, hostd would have used the VMFS UUID in a complete path to attach the drives?

Riccochet · Dec 28, 2016

lopoetve said:
Huh, that shouldn't be an issue...? I'm curious how it referenced them... unless you did it by hand, hostd would have used the VMFS UUID in a complete path to attach the drives?

I didn't delve too deeply last night since we were all exhausted and happy for things to be running smoothly. 4 hours past the end of our maintenance window. lol Waiting on their official incident report.

lopoetve · Dec 28, 2016

I'm really really curious. Let me know what they say.

bigdogchris · Jan 4, 2017

Riccochet said:
ESXi 5.5, cisco blade hosts, K2 all flash storage. Just recently, doing something as mundane as adding space to a vdisk, is now taking 34-50 minutes to complete. This used to take seconds, if that. The progress bar does move, but damn. It shouldn't take this long, as hasn't ever.

Anyone see this before, and is there a fix?

Well, is the disk set to be thick eager, rather than thick lazy or thin on previous disk? Thick eager is going to take a lot longer.

lopoetve · Jan 4, 2017

bigdogchris said:
Well, is the disk set to be thick eager, rather than thick lazy or thin on previous disk? Thick eager is going to take a lot longer.

Not on any modern VAAI equipped system, especially a deduplicating one. That'll all be in CPU in 64MB chunks with 64-128 in the queue... yeah, it'll go fast.

Reconfigure virtual machine taking forever

Fully [H]

[H]ard|Gawd

Extremely [H]

Fully [H]

Supreme [H]ardness

Extremely [H]

Extremely [H]

Extremely [H]

Fully [H]

Extremely [H]

Fully [H]

Extremely [H]

Limp Gawd

Extremely [H]

Fully [H]

Extremely [H]

Fully [H]

Extremely [H]

Fully [H]

Extremely [H]