Does vsphere support any kind of hardware watchdog?

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,715
Running a 6.5 host, Build 6765664. Running fine since update to this build on 12/10. This morning around 10AM, it became unresponsive. I was out of town. so I was unable to check anything until just now. The IPMI console showed everything apparently okay, except for the host being unresponsive (not even to pings.) I rebooted it, and it came up fine, but I'd kinda like to avoid hangs in the future. I've been searching via google, but I don't see any kind of hardware watchdog support. Is there such a thing? It's a single host, so HA won't help me here. Thanks!
 
There are for some IPMI and iDRAC and iLO but its typically a vendor specific ISO you have to install.

That said, your issue is likely not hardware. Do a fresh install of ESXi on the host.
 
This is a whitebox in my home lab, so there is no vendor involved. I'm not saying this is a HW issue - I'd just like insurance that it won't lock up again when no-one is here to push the reset button :)
 
That said, your issue is likely not hardware. Do a fresh install of ESXi on the host.

This is bad advice. You have no idea what the issue is until you review the logs and perform some level of diagnostics.
 
This is bad advice. You have no idea what the issue is until you review the logs and perform some level of diagnostics.

Its not bad advice if your alerting is setup correctly.

However OP's issue was the hypervisor locking up, unless you have outside monitoring form that host, its not going to alert you in any fashion.
 
Its not bad advice if your alerting is setup correctly.

However OP's issue was the hypervisor locking up, unless you have outside monitoring form that host, its not going to alert you in any fashion.

His host locked-up one time and you're telling him to re-install ESXi. He should at least spend a few minutes taking a look at the server logs to see if he can find out what occurred.
 
His host locked-up one time and you're telling him to re-install ESXi. He should at least spend a few minutes taking a look at the server logs to see if he can find out what occurred.

I'm assuming troubleshooting has already been done, besides, reinstalling esxi can easily be done without affecting the VMFS volumes. Then you just re-import your machines and move one.
 
I'm not sure which logs to look at. I noticed a new build was available, so since I was suspicious something might have gotten corrupted, I installed that build. In case this happens again, where do you suggest looking? e.g. which logfiles? Thanks!
 
/var/log/vmkernel.log. If you're stored on reliable storage, that will still be there. See what hte last few messages were. /var/log/vmkwarning.log is a warn/error version of the same log.
 
Thanks! Vsphere is installed on a small (8GB) DOM, so the logs should still be available. Hasn't happened again, so far (fingers crossed...)
 
Eh, may be big enough and stable enough for it to have tagged it as stable.
 
Turns out it wasn't, so that's why I saw nothing useful in the logs :( I changed the syslog global settings to stash the logs on one of the NFS datastores.
 
I had sol many bugs and errors with ESXi 6.5 that I went back down to 5.5. Less support for newer stuff but its at least stable.
 
I do not see how you can tell a locked up vmware to reboot :)

if anything another system would run API commands via your IPM to reboot the host.
 
There are hardware management solutions that work with devices like iDRAC's in Dell servers to allow remote management and reboots and such. had to use one today because my stupid server suddently couldn't see it's onboard flash drives. Took a hard reboot. I did NOT want to fly to NY for something like that.
 
And a hardware watchdog, if one exists for your platform, will heartbeat to that IPMI solution and trigger a reboot if it goes away for long enough (generally several minutes, for obvious reasons). Not always a GOOD idea, mind you, but possible.
 
I had sol many bugs and errors with ESXi 6.5 that I went back down to 5.5. Less support for newer stuff but its at least stable.

I have never tried 6.5 but have been getting EOL advisements from vmware that we need to update. Only comment to this is we have been running 5.5+ at our office with MS guests: DC/Fileserver, Exchange with 40 mailboxes, SQL/ERP, RDP server with 15 users and one other Win7 guest. It has NEVER locked up or given any issues at all, we have logged over 1 year up time and only needed to restart the host because of Spectre/Meltdown patches for ESXi a couple months ago.

This is my big 2 cents
 
6.5 later releases are great (U1 included). The early releases are always a bit iffy for any software company.
 
I do not like not having thick client access to my hosts. That feature alone has bailed me out of mistakes my director has made in the past. "I shut down this vm... what was it for?" me.. "Oh just the Vsphere appliance." Director "Oh.. well shit." Me "It's fine I'm already booting it back up."

Now I'll have to do the command line dance... and I dislike that.
 
Its not bad advice if your alerting is setup correctly.

However OP's issue was the hypervisor locking up, unless you have outside monitoring form that host, its not going to alert you in any fashion.

This. You need some type of third party monitoring tool that would alert you when a server becomes unresponsive.
 
Oh that statement isn't exactly heartwarming.

Try adding a second management port. And then try removing it. Locks up the whole web interface until you do a hard reboot. This is on the latest build.

And the only way I have found to get rid of the second management port is to tell it to reset everything to default settings via the local interface (start over from scratch).

Removing the vswitch fails, removing the management port fails, basically nothing you can do once you add a second management port.
 
Try adding a second management port. And then try removing it. Locks up the whole web interface until you do a hard reboot. This is on the latest build.

And the only way I have found to get rid of the second management port is to tell it to reset everything to default settings via the local interface (start over from scratch).

Removing the vswitch fails, removing the management port fails, basically nothing you can do once you add a second management port.

Ah that explains... so adding the second management port then trying to remove it breaks it. Understandable.

I have noticed some odd behavior with the flash console as well.
 
Back
Top