Problem with vCenter Connectivity

KapsZ28

2[H]4U
Joined
May 29, 2009
Messages
2,114
So this started out as a Veeam issue. It was working fine, then all the sudden backups were failing because of a connection problem with vCenter. Exact error from Veeam below when trying to rescan vCenter.

6/19/2014 12:51:27 PM Error Disks and volumes discovery for VMware vCenter 'lga1vcs01' failed Error: The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
6/19/2014 12:51:27 PM Error Host discovery failed Error: The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.

Even when Veeam does connect and start the backup, sometimes it fails at hot adding the VMDK to the Veeam proxy.

Here are a couple of screen shots when using the vSphere Client. As you can see it is trying to load the information, but often times takes awhile to populate or doesn't populate at all.





Based on that I would think it is vCenter having trouble communicating with the ESXi hosts. But I also do have issues with the Update Manager getting stuck at checking for updates. I've also noticed issues in the Web Client with suppressing alarms and it hangs. Again sounds like a communication problem with the host.

When I manage the ESXi servers directly through the vSphere Client, everything seems to work fine.
 
Go check task manager on the VC - is java using some absurd amount of RAM?
 
Are vCenter and the database on the same server? As lopoetve said, is something pegged using resources?
 
Java was high. I had rebooted the VC VM multiple times and even moved it to another ESXi host with more resources allocated to the VM. No, the DB is a separate VM. Ultimately it did end up being resources. Odd part is I had it up to 8 vCPU and it was still an issue, so I had lowered it back down to 4 vCPU. When troubleshooting again, it was moved back up to 8 vCPU and still had trouble loading the datastores until "Rescan" was clicked. Then it loaded all of them and I tested Veeam and it was now connecting. I guess I will keep the resources higher and do a rescan of the datastores if it happens again.

I know Veeam is popular and supposed to be great, but I haven't been that impressed. Maybe part of it is our environment. Luckily we have NetApp and Veeam 8 will support volume snapshots. Right now with the individual VM snapshots, it can be a bit slow and effect performance of the VMs. I had issues with Exchange DAG failing over when the snapshot was removed. Found out it is better to use Network Mode when using DAG and that did solve the problem. But the RDS Farm we host is constantly using a ton of resources (during the day) and taking snapshots and removing snapshots has a huge effect if done during the day. Right now backups are only at night so nobody is effected, but I can't see how you could use Veeam to do hourly snapshots without a big impact on performance. And it seems the more jobs that are running, it is hitting vCenter hard and we only have about 25 backup jobs right now using 3 proxy servers. How does a company use Veeam to take individual snapshots of hundreds of VMs a night? I guess they have some serious money invested in their storage.

Is there any magic formula to how much resources should be allocated to vCenter per a certain number of Veeam backup jobs?
 
One thing I've noticed with Veeam is the vCenter connections stick around for roughly 30 minutes even after they complete their portion of the job. It also seems to spawn a new connection for each component of the backup job, I've seen as many as four connections per guest as it backs them up. One weekend I actually hit the SSO connection limit with vSphere 5.1 and it basically locked up vCenter, had to restart the SSO service to get things back to normal. I haven't had a chance to use it with 5.5 though, so I'm not sure of it's behavior there.
 
Just a suggestion, but with vCenter 5.5 we have moved to using the vCSA appliance rather than a Windows install with SQL. The limits are very nice now and it is much easier to deal with in our experience to date. You still need the Windows server for Update Manager and other things (for instance, we use Zerto), but so far it has been a winner for us even with our largest oil and gas client that I manage. Veeam absolutely works perfect with it.
 
How many VM's are in your environment? 8 vCPU's seems excessive and probably is detrimental if you are significantly oversubscribed on your physical cpu cores.

How much memory is allocated? Are there any limits, share changes or resource pools are that in play here?
 
How much memory did you assign to vCenter JAVA process? If you selected medium (2GB) then you will experience this issue. Give it Large (3GB) and try again.
 
How many VM's are in your environment? 8 vCPU's seems excessive and probably is detrimental if you are significantly oversubscribed on your physical cpu cores.

How much memory is allocated? Are there any limits, share changes or resource pools are that in play here?

Yeah, it has always been 4 vCPU in the past and was never an issue. But according to the performance data, the CPU usage was between 75-80% all day when there was 4 vCPU assigned to vCenter. Once it was increased to 8 vCPU, it dropped to 4%. So clearly it is using the CPU.

This is our newer environment and there are only 60 VMs running and it is nowhere near oversubscribed. No CPU ready, memory swapping, or any contention for that matter.

Currently there is 12 GB of memory assigned to the vCenter VM and it is working fine. I had it all the way up to 18 GB when there was an issue to see if more memory made a difference.

I just checked and all the Veeam backups were successful tonight. Done pretty quickly too. The CPU usage stayed low the entire time. I guess I will have to wait and see if the issue comes up again. It does seem like a waste of CPU resources to have 8 assigned, but if it is working, I really don't want to lower it and cause a problem again.
 
How much memory did you assign to vCenter JAVA process? If you selected medium (2GB) then you will experience this issue. Give it Large (3GB) and try again.

I used Small since Small is good for 100 hosts and 1000 VMs I believe. It didn't seem necessary to use Medium or Large since those are for much much larger environments.
 
Just a suggestion, but with vCenter 5.5 we have moved to using the vCSA appliance rather than a Windows install with SQL. The limits are very nice now and it is much easier to deal with in our experience to date. You still need the Windows server for Update Manager and other things (for instance, we use Zerto), but so far it has been a winner for us even with our largest oil and gas client that I manage. Veeam absolutely works perfect with it.

I considered it, but it doesn't support Linked-Mode. Although as of today we are no longer using Linked-Mode. :(
 
Java was high. I had rebooted the VC VM multiple times and even moved it to another ESXi host with more resources allocated to the VM. No, the DB is a separate VM. Ultimately it did end up being resources. Odd part is I had it up to 8 vCPU and it was still an issue, so I had lowered it back down to 4 vCPU. When troubleshooting again, it was moved back up to 8 vCPU and still had trouble loading the datastores until "Rescan" was clicked. Then it loaded all of them and I tested Veeam and it was now connecting. I guess I will keep the resources higher and do a rescan of the datastores if it happens again.

I know Veeam is popular and supposed to be great, but I haven't been that impressed. Maybe part of it is our environment. Luckily we have NetApp and Veeam 8 will support volume snapshots. Right now with the individual VM snapshots, it can be a bit slow and effect performance of the VMs. I had issues with Exchange DAG failing over when the snapshot was removed. Found out it is better to use Network Mode when using DAG and that did solve the problem. But the RDS Farm we host is constantly using a ton of resources (during the day) and taking snapshots and removing snapshots has a huge effect if done during the day. Right now backups are only at night so nobody is effected, but I can't see how you could use Veeam to do hourly snapshots without a big impact on performance. And it seems the more jobs that are running, it is hitting vCenter hard and we only have about 25 backup jobs right now using 3 proxy servers. How does a company use Veeam to take individual snapshots of hundreds of VMs a night? I guess they have some serious money invested in their storage.

Is there any magic formula to how much resources should be allocated to vCenter per a certain number of Veeam backup jobs?
It's not resources precisely - it's the JVM, and you're "cheating" around it by going nuts on things.

1. Set VC to 4 vCPU, 8G of RAM.
2. Increase Java heap for Inventory Service, VC, and maybe the webclient to their second levels (WC is a text file to edit). Assume defaults are useless. I can almost guarantee this is your issue, and no one runs hourly backups with a VDDK product during production - you use silent array snaps for that stuff.
 
I considered it, but it doesn't support Linked-Mode. Although as of today we are no longer using Linked-Mode. :(

Yeah, I moved away from that as well for the time being. Supposedly version 6 will be bringing that functionality to the vCSA. It was more of a convenience thing for me anyways, no big deal to let it go for now.
 
It's not resources precisely - it's the JVM, and you're "cheating" around it by going nuts on things.

1. Set VC to 4 vCPU, 8G of RAM.
2. Increase Java heap for Inventory Service, VC, and maybe the webclient to their second levels (WC is a text file to edit). Assume defaults are useless. I can almost guarantee this is your issue, and no one runs hourly backups with a VDDK product during production - you use silent array snaps for that stuff.

Here's the KB for how to increase the java heap sizes
http://kb.vmware.com/kb/2021302

If this is a new build out and you still have the option I'd switch to the vCSA.
 
2. Increase Java heap for Inventory Service, VC, and maybe the webclient to their second levels (WC is a text file to edit). Assume defaults are useless. I can almost guarantee this is your issue, and no one runs hourly backups with a VDDK product during production - you use silent array snaps for that stuff.

The best part about this is VMware support tells you to keep it at the defaults. I was the phone with them all day Monday rebuilding a vCenter server after the JRE folder screwed me when trying to upgrade from 5.1u1 to 5.1u2 and ended up having to do a full reinstall. As we were doing it, I was told by support to install them with the defaults, you shouldn't have increase them at all unless you have that size environment, needless to say I didn't listen to that advice.
 
Thanks lopoetve and 0V3RC10CK3D for the information about Java! I'd like to add that increasing the CPU to 8 was VMware Supports recommendation. They didn't mention anything about Java. :rolleyes:
 
It was more like a, "Are you really sure about that?" This was couple hours after he had me fully uninstall SSO and reboot the box, effectively tanking it because the inventory service that was still installed went crazy on boot.
 
and no one runs hourly backups with a VDDK product during production - you use silent array snaps for that stuff.

Hourly was just an exaggeration. I guess my point is prior to Veeam we were using a CDP backup where an agent is installed on the guest OS and backups were run every 4 hours. So then we were doing backups during production with no noticeable impact. Since moving to Veeam I am only doing nightly backups because of the performance impact of snapshots. It sounds like I am doing it right based on what you are saying that no one runs these type of backups during production. I am looking forward to Veeam 8 with NetApp support. :D
 
The best part about this is VMware support tells you to keep it at the defaults. I was the phone with them all day Monday rebuilding a vCenter server after the JRE folder screwed me when trying to upgrade from 5.1u1 to 5.1u2 and ended up having to do a full reinstall. As we were doing it, I was told by support to install them with the defaults, you shouldn't have increase them at all unless you have that size environment, needless to say I didn't listen to that advice.

Please let me know which support engineer told you that so I can correct that immediately.

Defaults barely work for PoCs - especially if you install any extra products (Veeam, NSX, VDP, etc etc etc).
 
It was more like a, "Are you really sure about that?" This was couple hours after he had me fully uninstall SSO and reboot the box, effectively tanking it because the inventory service that was still installed went crazy on boot.

Again, let me know what engineer that was - that destroys the install since you didn't unroll and the app users are now missing. :p
 
If this is a new build out and you still have the option I'd switch to the vCSA.

It has crossed my mind, but unfortunately with all the datacenters in this vCenter already, there are 12 ESXi hosts and just over 100 VMs. The only part that is really a pain in the ass is migrating the networking when joining a new vCenter. Correct me if I am wrong, but we are using Distributed Switches, and I believe the safest way is to migrate everything to a Standard Switch prior to removing the host from vCenter, and then migrating everything back to a Distributed Switch in the new vCenter.

I was considering going with a Management Cluster and Resource Cluster. Basically have two vCenters. I believe that was best practice when using vCloud Director. Not sure if it applies to this setup where we are not using vCloud.
 
Again, let me know what engineer that was - that destroys the install since you didn't unroll and the app users are now missing. :p

Shot you a PM, I had a fun time trying to get that physical box back up to the point where I could actually do something with it.
 
It has crossed my mind, but unfortunately with all the datacenters in this vCenter already, there are 12 ESXi hosts and just over 100 VMs. The only part that is really a pain in the ass is migrating the networking when joining a new vCenter. Correct me if I am wrong, but we are using Distributed Switches, and I believe the safest way is to migrate everything to a Standard Switch prior to removing the host from vCenter, and then migrating everything back to a Distributed Switch in the new vCenter.

I was considering going with a Management Cluster and Resource Cluster. Basically have two vCenters. I believe that was best practice when using vCloud Director. Not sure if it applies to this setup where we are not using vCloud.

Correct - you will want to remove the vCenter dependencies prior to migration, which means back to standard switches temporarily. Honestly though, I think in the long run it is worth the effort, the vCSA is the future (i.e. I think the Windows version may be going away sooner than later).
 
You don't happen to have vflash cache configured on the hosts, do you?
 
Actually I do. Why is there a known issue with Flash Cache?

Same exact thing happened to us with Veeam, problem was vflash was messing up the inventory database which then made it impossible for Veeam to determine where vms were stored.

There are known issues with vflash and the inventory database but I don't know if it's been documented as a Veeam issue yet.

We decided that Veeam was more important because vflash more of a wow that's cool feature not a holy crap we need this feature.
We reinitialized the inventory database, reinstalled Veeam, all kinds of crap. But the instant I disabled vflash Veeam started working without issue.
 
Same exact thing happened to us with Veeam, problem was vflash was messing up the inventory database which then made it impossible for Veeam to determine where vms were stored.

There are known issues with vflash and the inventory database but I don't know if it's been documented as a Veeam issue yet.

We decided that Veeam was more important because vflash more of a wow that's cool feature not a holy crap we need this feature.
We reinitialized the inventory database, reinstalled Veeam, all kinds of crap. But the instant I disabled vflash Veeam started working without issue.

See KB I just posted - we have a fix for the IS issue.
 
Well, now the Java issue definitely makes more sense. I have other vCenters setup the same way but not vFlash and have never run into a problem with resources and Java. Since we now have a NetApp mixed shelf with Flash Pool, I'll probably just move away from the vFlash all together.
 
Back
Top