Migrated VM to new host, now slow bad performance

Riccochet

Fully [H]
Joined
Apr 11, 2007
Messages
29,906
As the title says. Last night we migrated a VM from one host to another. Both hosts are HP 580G7's. Only difference is one host is using E7-4850's and the new host is using E7-4830's. Allocated the same 4 vCPU's and bumped up the memory from 24 to 32 BG. Now, that VM is showing near 100% CPU usage in Windows Server 2008r2 performance monitor where on the old host it would sit in the 40-60% range during the day/peak usage times. vCenter is showing all 8524mhz available, but is showing 28% max usage. Something, somewhere, is not jiving. I can't see why migrating the VM to another host would have caused this. Completely stumped, complaints about performance are rolling in. lol :(

Help me [H]ard ones!
 
The 4830 has 2 less cores and 4 less threads. Maybe you need to lower the number of vCPU's.
 
Look at esxtop and look for %CSTP and %RDY. If both are high, then you have too many vCPUs.
 
Confirm those calculations. I'd say it's a %RDY issue.

3hdl.jpg


I calculated at 19ms. That jumped a little, but not enough to be of concern. Unless I'm looking at the wrong metric.
 
How much ram do you have "per physical processor" between the two?
You could have a numa issue if you have less than 32 GB of ram on a single physical CPU.
I don't think its a CPU issue unless your overcommit ratios are really high.

Nick
Http://www.pcli.me
 
I have 32 GB allocated to the VM. SQL consumes most of it. This is a database server.

It was fine running with 24 GB of RAM on the other host. Bumped it up because, why not. We plan on adding a few more databases to this VM. Maybe not now unless we can get this situated.

I'm going to drop it to 2 vCPU's tonight and see what happens.
 
As nicholasfarmer said, there could be a performance drop if your host has e.g. 2 sockets with 24GB of RAM each (48GB total) and a single VMs RAM exceeds the amount of RAM a single socket can provide. In this example the additional 8GB for your VM would come from a second CPU socket "further" away from the VMs process running on the other socket. Though you mentioned your hosts to be HP 580G7, so I suppose these are better equipped, numa could be an issue depending on your overall workload.

Just out of curiosity, did you check the size of your windows page file or the free space on your (OS) partition? It once happened to me that I changed the RAM on a VM and windows adjusted the pagefile so the OS partition became clogged and some services weren't able to start up. But the database on another partition did.;)

Can you find out which processes cause the 100% CPU usage? Is it your SQL processes that are responsible for the normal 40-60% that cranked up to 100% or are there other processes or applications involved?
 
NUMA was going to be my next place to look as well. If you only have 24GB per processor and you have a 32GB VM, you're busting NUMA nodes. You can check that in esxtop as well.
 
The host machine has 384 GB of RAM total. It's mainly the SQL process that's hitting 100% CPU usage. Checked the paging file size and it's not causing the OS partition to become full. Still about 50GB free on the partition.
 
Did your VM end up in a (different) ressource pool with limits set?
Or were there any changes with your vCPU settings, like going from 2socket/2core to 1socket/4core so SQL might behave differently? Perhaps some SQL and/or Windows digging can reveal a hint where the additional load is coming from. This might help to better understand who to blame vmware, windows or sql.
 
You can also open up the SQL activity monitor and see what in SQL is causing the load. Out of curiosity, does anything change if you change it back to 24GB RAM?
 
It did move from a 4socket/40core host to a 2socket/16core host. Aside from an Acronis Appliance the database server is the only other VM on this new host.

I did change it to 2 vCPU last night and it was horrible. Usage hit 100% and stayed there. I'm stumped. And it's only the SQL service that's causing the load.
 
What Type of Storage are you using? iSCSI/Fiber/NFS?

There could be the possibility that you have Bad Firmware/Drivers/Cable which would lead to I/O Contention which would mimic High CPU Usage and bad performance. If it has to wait for I/O CPU Spikes!

What is the Latency on the storage you are running from the Hosts Perspective.

Also if you migrate it back to the old host does the issue stop?
 
Just a general troubleshooting suggestion, as I don't know what your entire configuration is.

I'm assuming you have many applications accessing the database server. Do you have the ability to disable said apps and bring them up one at a time to see if one particular app is causing CPU usage to become pegged?

Based on your specs, it's not NUMA considering you're at 192GB per CPU. The CPUs have >4 cores so it's not that.

While it's unlikely it's the Acronis VM, have you tried stopping it?
 
Also if you migrate it back to the old host does the issue stop?

What he said. Can you at least open the Activity Monitor in SQL up and see what in SQL is causing the hung process? If you bring the server up in safe mode does that change anything? Are you running a different version of ESXi on the new host? Perhaps VMtools needs to be updated??
 
Can't move it back to the old host. The datastore it was sitting on was damn near full which is why we moved it off of that host.

Starting to think it's a disk I/O issues. Gathering some data now.
 
If it's SQL processing have you had your DBA check on the SQL instance(s)? They should be able to tell you what types of waits they are experiencing which will help you determine if it's a compute or storage issue.
 
Well dang! There was a vmotion and a svmotion too!? No telling what it could be then without looking into the SQL Activity monitor in the VM.
 
SSMS activity monitor showing buffer i/o waits of 100-800ms with cumulative of 24000, lock of 0-100ms with cumulative of 12000. Don't know if that's bad or what. Eveyrthing else looks good.
 
what is the difference between the two storage devices?
I'm assuming the datastores you are using are local only?
Also assuming since you referred to size that they are spinning disk?

Throw us the numbers and we can help calc the difference.

For a test, you can SSH into the ESX host, hit ESX top and check out GAVG for the VM.
SSH into the host and type ESXTOP (enter)
Hold Shift and press V - this will show you the CPU stats for each VM.
> s 2 (enter) to update the view interval
> v (enter) to see the disk stats per VM (iops, MBps read and write, Latency)
> u (enter) to see the DAVG,KAVG,GAVG stats per LUN
> d (enter) to see the Adapter stats to see if your local adapter is having issues.

You can match the NAA numbers with your lun in the VIC.


I hope this helps!
Nick
 
Easiest way to address IO issues in SQL is to add RAM. I'm assuming you're running SQL standard so give that VM 72GB of RAM, set SQL to 64GB (the max for standard edition) and see where you're at.

Also, are those numbers massively different from prior to you're move? A baseline will help you determine what's gone out of whack.
 
Renixing That might have an effect but it might only mask the real issue, he had no issues on the old box he migrates to a new box with new storage and new hardware it shouldnt have the problems its having unless its some sort of constraint.

Please elaborate what Nick was talking about whats the difference between storage A and Storage B... Raid versions? SAN Enclosures? Fiber vs iSCSI vs NFS? Need more details...
 
Oh it's absolutely not addressing the reason for the degraded performance. I suggested that as a DBA trying to get SQL to not blow up. Still need to troubleshoot the rest of the problem. I'm with you on needing more info on storage.
 
The datastore on both hosts where this VM resided consists of 4x600GB 15K SAS drives in RAID10. We have another host with 4x300GB 10K SAS drives hosting a VM with more databases and no performance issues.

We just got in two HP 4530 SAN's today to expand our SAN storage. Regardless, I shouldn't have to put this VM on the SAN. You think moving the VM from one host to another, with identical datastore, another 8GB of RAM and slightly faster cores it wouldn't be having these problems. I'm about fucking stumped and running out of time due to client complaints. Thinking that if I can't get this resolved by tomorrow I'll be working through the weekend building a new VM with fresh install of SQL and restoring all these databases to it. A lot of work I really wasn't anticipating. Talk about "the suck". lol
 
Last edited:
Can you take the VM down for maintenance at all? Try taking the VM down, access the VM from the host (accessing the VM console) and then disable the NIC in the VM to see if you are not able to isolate what is going on. Just a random thought from lack of sleep. Keep us posted, please.

Update: I just had another thought. Is Host A and Datastore A on the same switch? It's late so I am hoping this will come out making sense from my brain. I am wondering if Host B (new Host) is on one switch and datastore B (new datastore) are on different switches, and those switches are cascaded together using a 1gigE link rather than a back plane cable. That would certainly explain the high I/O waits. Just a thought.
 
Last edited:
I know it's frustrating but I don't feel like you've really figured out what the problem is, whether it's CPU pressure or storage performance. Looking at SQL activity monitor isn't going to tell you a whole lot by itself. VMware's stats isn't going to tell you why CPU is being used. I'm sure you're working hard on the problem but you haven't really shared the full story here so it's difficult to figure the issue. When I'm troubleshooting a performance issue with SQL, I first make sure it's SQL (check task manager), then I move into the wait stats (currently evaluating SQL Sentry and it makes it SO much easier than using your own scripts), evaluate the wait stats to try and find issues and at the very least to gain an understanding of how SQL is utilizing the resources and then I layer in the VMware stats starting with the VM (CPU rdy, memory usage, IOPS, IO throughput, network throughput, etc.) and then checking the host (memory usage, memory ballooning, memory swapping, storage queue length, etc.).

With that said, are you sure you're disks are aligned properly? That can easily kill performance. Additionally, have you ensured there is no additional workload being sent to the VM that wasn't there prior to the move? A hung process that is looping within an application, etc. Good luck
 
I'm finding the OS partition for that VM is not aligned. Which would tell me the ESX host partition is out of alignment as well.....fuckity fuck. Will be correcting that this weekend hopefully
 
I know it's frustrating but I don't feel like you've really figured out what the problem is, whether it's CPU pressure or storage performance. Looking at SQL activity monitor isn't going to tell you a whole lot by itself. VMware's stats isn't going to tell you why CPU is being used. I'm sure you're working hard on the problem but you haven't really shared the full story here so it's difficult to figure the issue. When I'm troubleshooting a performance issue with SQL, I first make sure it's SQL (check task manager), then I move into the wait stats (currently evaluating SQL Sentry and it makes it SO much easier than using your own scripts), evaluate the wait stats to try and find issues and at the very least to gain an understanding of how SQL is utilizing the resources and then I layer in the VMware stats starting with the VM (CPU rdy, memory usage, IOPS, IO throughput, network throughput, etc.) and then checking the host (memory usage, memory ballooning, memory swapping, storage queue length, etc.).

With that said, are you sure you're disks are aligned properly? That can easily kill performance. Additionally, have you ensured there is no additional workload being sent to the VM that wasn't there prior to the move? A hung process that is looping within an application, etc. Good luck

I agree with most of what you are saying. the part I disagree with is that the SQL activity monitor will tell you where the I/O waits (for SQL) are occurring. We also have to presume that the task manager has been check, but nevertheless definitely worth mentioning.

I am super curious how the disks were misaligned with a svMotion. (Unless the issue existed before and something now is exasperating the issue.)
 
I'm guessing they've always been misaligned. Still doesn't make much sense.
 
I'm not sure. I just checked all our other VM's and every other one is aligned properly. Even the VM's remaining on the old host. We'll see how it goes after this weekend. Bringing the new host down to do all this work.
 
Glad you found the issue. Sorry it's a shitty issue though. Let us know how the fix goes.
 
The datastore on both hosts where this VM resided consists of 4x600GB 15K SAS drives in RAID10. We have another host with 4x300GB 10K SAS drives hosting a VM with more databases and no performance issues.

We just got in two HP 4530 SAN's today to expand our SAN storage. Regardless, I shouldn't have to put this VM on the SAN. You think moving the VM from one host to another, with identical datastore, another 8GB of RAM and slightly faster cores it wouldn't be having these problems. I'm about fucking stumped and running out of time due to client complaints. Thinking that if I can't get this resolved by tomorrow I'll be working through the weekend building a new VM with fresh install of SQL and restoring all these databases to it. A lot of work I really wasn't anticipating. Talk about "the suck". lol

Your BBWC is dead on the second host. I'd bet money on it. Or not present.
 
I'm finding the OS partition for that VM is not aligned. Which would tell me the ESX host partition is out of alignment as well.....fuckity fuck. Will be correcting that this weekend hopefully

Shens.

Unless you manually created the vmfs datastore and partition from the command line, it's aligned. It's also Windows 2k8, so it absolutely should be aligned as well. Even then, on local storage, that shouldn't have that massive a performance problem (local storage != netapp). I'd be really surprised if that's it.
 
I have 32 GB allocated to the VM. SQL consumes most of it. This is a database server.

It was fine running with 24 GB of RAM on the other host. Bumped it up because, why not. We plan on adding a few more databases to this VM. Maybe not now unless we can get this situated.

I'm going to drop it to 2 vCPU's tonight and see what happens.

Bumping up db servers does odd things - it rebalances the SQL VM (sql process itself) memory levels, which can make things do "oddball" stuff. More !always better.
 
Back
Top