Reconfigure virtual machine taking forever

Riccochet

Fully [H]
Joined
Apr 11, 2007
Messages
29,903
ESXi 5.5, cisco blade hosts, K2 all flash storage. Just recently, doing something as mundane as adding space to a vdisk, is now taking 34-50 minutes to complete. This used to take seconds, if that. The progress bar does move, but damn. It shouldn't take this long, as hasn't ever.

Anyone see this before, and is there a fix?
 
I sort of feel its because of esxi 5.5

I stopped using it because my overall performance on that version was slow, and when I went back to a windows server hosting my vms, they sped up considerably.
 
Could be a ton of things - are you eager zeroing the disks? Thin extension? Lazy zeroed? FC or iSCSI?
 
I think you need to chop the problem up a bit because it's too broad.

Try the following to narrow down:
- how's reconfigure performance when no domains are launched except the hypervisor?
- does adding more running VMs produce an unproportionate amount of lagging?
- does a newly created vdisk and domain exhibit the same slowness?
- is there anything in the ESXI logs? in particular, turn your domains off, reboot your visor, create just one example VM and run it - look at the log.
 
It's a Kaminario K2 on FC. Disks are thin provisioned.
This is only metadata updates then.

That leaves buffer cache issues (how large are the volumes and how many VM), ats/reservation failures, or bad heartbeat regions.

Grep through /bar/log/vmkernel.log for anything with error or warning right after trying it. Anything turn up? -also, run vmkfstools -V and see if anything shows up (or that command takes forever- should be a second or two at most)
 
Also could be lost frames I guess, but that should be absurdly hard on FC. What are the hosts/FC cards?
 
Or last, hostd/vpxa never reporting back. Rare but possible. See again: hosts.
 
Also could be lost frames I guess, but that should be absurdly hard on FC. What are the hosts/FC cards?

Cisco UCS B260 hosts. Not sure on the FC cards. Currently have roughly 67 VM's. Volumes on the K2 are all 15 TB with one 5 TB volume.

I'll have to dig through logs. It's strange that this just started happening.
 
With that few VMs you're not hitting buffer cache unless they're all linked clones from the same base, and even then they'd have to be doing insane amounts of metadata updates to have any effect. That's probably out. It's most likely a communication problem of some kind - which FNIC driver version are you on, and which of the VIC cards are you on - can you find out? Also, you're not doing FCoE all the way through, are you?
 
With that few VMs you're not hitting buffer cache unless they're all linked clones from the same base, and even then they'd have to be doing insane amounts of metadata updates to have any effect. That's probably out. It's most likely a communication problem of some kind - which FNIC driver version are you on, and which of the VIC cards are you on - can you find out? Also, you're not doing FCoE all the way through, are you?

I honestly don't know. We don't manage or own any of the equipment other than the K2.

Now that you mention clones, yes, the VM's having this issue are clones of another VM. I can't see that as being such an issue since, maybe, 6 of the VM's are clones of a parent.
 
Linked clone != clone. A Linked clone is something unique.

Need to kow what's in the logs to do more.
 
I've seen something like this on a VM with a history of failed snapshot backups and consolidation badly needed.
 
Also a possibility, but it should have freaked at extending a drive with snaps on it - invalidates the tree.
 
It turned out to be a config file issue. Earlier in the month we decommissioned some 2012 R2 standard servers and and attached their data vmdk's to 2012 r2 DCE servers. All was good in the hood, or so we though, since we could see and access those vmdk's. But the config file for VM wasn't exactly pointing to those vmdk's properly. Add space to one of the drives that was created with the VM and no issue, add space to one of the vmdk's we attached and it would bomb for over an hour. Which is odd since we decom'd 9 standard servers, created 5 new DCE servers and attached vmdk's to 13 different servers. Only one was having this issue.

Props to the guys are Greencloud for figuring that one out. Finally got it all sorted out around 4:00 am this morning.
 
Huh, that shouldn't be an issue...? I'm curious how it referenced them... unless you did it by hand, hostd would have used the VMFS UUID in a complete path to attach the drives?
 
Huh, that shouldn't be an issue...? I'm curious how it referenced them... unless you did it by hand, hostd would have used the VMFS UUID in a complete path to attach the drives?

I didn't delve too deeply last night since we were all exhausted and happy for things to be running smoothly. 4 hours past the end of our maintenance window. lol Waiting on their official incident report.
 
ESXi 5.5, cisco blade hosts, K2 all flash storage. Just recently, doing something as mundane as adding space to a vdisk, is now taking 34-50 minutes to complete. This used to take seconds, if that. The progress bar does move, but damn. It shouldn't take this long, as hasn't ever.

Anyone see this before, and is there a fix?
Well, is the disk set to be thick eager, rather than thick lazy or thin on previous disk? Thick eager is going to take a lot longer.
 
Well, is the disk set to be thick eager, rather than thick lazy or thin on previous disk? Thick eager is going to take a lot longer.
Not on any modern VAAI equipped system, especially a deduplicating one. That'll all be in CPU in 64MB chunks with 64-128 in the queue... yeah, it'll go fast.
 
Back
Top