VMWare ESXi 6.7 U3 - vswitch dies randomly - help

cyclone3d

[H]F Junkie
Joined
Aug 16, 2004
Messages
13,794
Sooooo... we have a couple identical Dell Poweredge R7515.

In the one location, the LAN vswitch basically just dies on one of the hosts.... it happens on both, but never at the same time.

None of the VMs on that host using that vswitch are able to be pinged.

If we power off the VMs and migrate to the other host they work fine again. If we migrate them without powering them off first, they do not regain a proper network connection.

To get the vswitch working on the host that messed up, we have to reboot it. Then it works again for a few days to a few weeks and then dies again.

I already had a 3+ hour support call with VMWare and they though that it had to do with VLANs so we changed that and it started working without rebooting the host so we thought we had fixed it.

Then today, one of the hosts messed up again.

Currently these hosts have the network teaming to only failover and not load share. I will be changing that tonight since I upgraded another location with the same exact hosts and we have not had a single problem with them yet... and they have the network temaing set up differently... to failover and load balance.

Has anybody ever seen this type of behavior before or have any ideas besides what I am going to try?

Edit: Noticed another very, very weird thing that happens when the vswitch dies - I am unable to PXE boot... systems say that the media is connected but it never gets past the MAC address screen. Power off or reboot the messed up host and PXE booting starts working again. @#$@#%#$$%%^^# is going on?
 
Last edited:

Eulogy

2[H]4U
Joined
Nov 9, 2005
Messages
2,229
Sounds like a MAC table issue at very first blush with no real information :). Have you looked through the logs yet? When you say they can't be pinged, from where? The TOR?
 

cyclone3d

[H]F Junkie
Joined
Aug 16, 2004
Messages
13,794
Firmware is up to date.

I think I have it figured out..... "spanning-tree portfast" was set on the ports. Pretty sure the Cisco switch was crapping on this and making it not work though the management port which was on the same vswitch was working... so maybe it was the port group that was crapping out and not the vswitch.

Not sure why it wasn't messing up with the old hosts.... I really don't like taking over setups from other people when there is absolutely 0 documentation on why they set stuff up the way they did.

I have since reconfigured the ports on the Cisco switch for the LAN vswitch to be in an Etherchannel and reconfigured the vswitch settinga on the host as well as reconfigured the port group for the management port on that host.

Going to do the other one this evening.

As for not being able to ping any of the VMs... it was from anywhere outside of the host... as-in not able to ping from a physical computer hooked up to the switch stack... so yeah, from the TOR.

Ping was working from one VM to another on the host that was messed up.
 
Top