ESXI 5.5 U1 Host non-responsive, some VMs still running

carolina_guy · Apr 16, 2014

Hey guys, I'm in need of some advice on troubleshooting my ESXI host. I am running the following config:

Single ESXI host configured with PCI passthrough for freenas, multiple NICs, various linux & windows VMs.

this issue has happened several times now:

-Vswitch0 which has my primary VM network and Management stops responding.
-NIC lights flashing so link is up
-Vswitch1 & 2 still function properly & so does console to the host

When this happened a few weeks ago I thought VEEAM was to blame, so I stopped using it for a while. This issue has happened again twice today,

Vswitch0 was using a PCI intel NIC of type 85274L which is supported. Connected directly to a netgear desktop switch, along with other NICs which work fine.

If there was a bad cable that dropped then re-established link, would esx eventually force the link down?

I have since moved that network to a different physical interface to isolate the NIC in question.

Would love some pointers! Thanks guys.

Shockey · Apr 18, 2014

Can you provide the logs for the day and time this happened?

Vengance_01 · Apr 18, 2014

carolina_guy said:
Hey guys, I'm in need of some advice on troubleshooting my ESXI host. I am running the following config:

Single ESXI host configured with PCI passthrough for freenas, multiple NICs, various linux & windows VMs.

this issue has happened several times now:

-Vswitch0 which has my primary VM network and Management stops responding.
-NIC lights flashing so link is up
-Vswitch1 & 2 still function properly & so does console to the host

When this happened a few weeks ago I thought VEEAM was to blame, so I stopped using it for a while. This issue has happened again twice today,

Vswitch0 was using a PCI intel NIC of type 85274L which is supported. Connected directly to a netgear desktop switch, along with other NICs which work fine.

If there was a bad cable that dropped then re-established link, would esx eventually force the link down?

I have since moved that network to a different physical interface to isolate the NIC in question.

Would love some pointers! Thanks guys.

Depends on how the ports are configured on the v switch.

carolina_guy · Apr 19, 2014

So without knowing exactly what i'm looking for - or which log might have the relevent data, I did find this error message throughout the vmkernel.log:

2014-04-19T08:12:07.099Z cpu5:35277)WARNING: LinNet: map_pkt_to_skb:2069: This message has repeated 93 times: vmnic3: dropping packet due to parsing failure

And since the problem occurred I moved mgmt network to it's own NIC and moved the VM network to another NIC - thus removing the intel PCI nic from the equation. Since then nothing unusual has occurred.

lopoetve · Apr 20, 2014

Sounds like a bad nic doing something stupid. Firmware on them can crash after too many errors - I've seen that on bad intel and broadcom nics when they're dying.

carolina_guy · Apr 20, 2014

Agreed. Since it's been days since I moved to other NICs I'm satisfied the problem was related to that one NIC.

carolina_guy · Apr 25, 2014

Now that this system has been up for 8 days without incident I'm going to consider the issue solved due to a bad NIC. Time to toss it in the parts bin.

ESXI 5.5 U1 Host non-responsive, some VMs still running

carolina_guy

n00b

Shockey

2[H]4U

Vengance_01

Supreme [H]ardness

carolina_guy

n00b

lopoetve

Extremely [H]

carolina_guy

n00b

carolina_guy

n00b