ESXi 5.1 Host Freezing Issues

cinjun

n00b
Joined
May 3, 2011
Messages
35
Greeting,

I recently built an ESXi whitebox, with a M1015 passthrough to a FreeNAS 8.3 VM. Unfortunately I've had the host start freezing up on me a lot, usually after either a few hours or a few days. The logs don't give me any obvious warnings, so at this point I'm kind of unsure on how to go about investigating the issue.

The host machine itself becomes unresponsive, no PSOD or anything. The machine also has plenty of RAM 32GB and only runs 4 VM's, nothing too demanding.

Would these freezes be most likely pointing towards hardware issues, either bad RAM or an unresponsive HD? Where are the good/standard places to look for issues?
 
Memtest+ came back clean, so the memory is fine.

Could this hyper-threading cause these issues? I have an Intel i7-3770
 
He said it wasn't PSoDing. Just locking. Nothing in /var/log/vmkernel.log or vmkwarning.log before it locks?
 
I assume the VMs (except the FreeNAS one) are running from a datastore that is on the FreeNAS. If this is the case do you see anything in the ESXi logs about disconnects to that datastore?
 
All of the VM's are running from a single SSD. FreeNAS just provides NFS and SAMBA shares to the VM's.

vmkwarning.log:

Code:
0:00:00:04.193 cpu0:4096)WARNING: Cpu: 2165: Cache latency measurement may be inaccurate min= 180 max= 284 avg= 210
0:00:00:04.216 cpu0:4096)WARNING: CacheSched: 803: Already disabled : Cache aware scheduling already disabled
0:00:00:04.354 cpu0:4096)WARNING: VMKAcpi: 495: No IPMI PNP id found
2013-07-06T21:51:17.230Z cpu0:4527)WARNING: LinuxSignal: 761: ignored unexpected signal flags 0x2 (sig 17)
2013-07-06T21:51:19.774Z cpu3:4542)WARNING: VMK_PCI: 1170: device 00:00:1f.3 has no legacy interrupt(s)
2013-07-06T21:51:19.774Z cpu3:4542)WARNING: LinPCI: LinuxPCILegacyIntrVectorSet:80:Could not allocate legacy PCI interrupt for device 0000:00:1f.3
2013-07-06T21:51:20.087Z cpu0:4473)WARNING: Team.etherswitch: TeamES_Activate:309:Failed to initialize beaconing on portset 'pps': Not implemented.
2013-07-06T21:51:22.816Z cpu4:4624)WARNING: LinScsiLLD: scsi_add_host:601:vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 255
2013-07-06T21:51:27.831Z cpu3:4473)WARNING: Uplink: 3075: releasing cap 0x0!
2013-07-06T21:51:27.831Z cpu3:4473)WARNING: Uplink: 3075: releasing cap 0x0!
2013-07-06T21:51:27.925Z cpu3:4473)WARNING: Tcpip_Vmk: 717: Failed to set default gateway (51): Network unreachable
2013-07-06T21:51:27.938Z cpu3:4473)WARNING: Tcpip_Vmk: 717: Failed to set default gateway (51): Network unreachable
2013-07-06T21:51:27.957Z cpu3:4473)WARNING: NetPortset: 740: vSwitch0: already exists
2013-07-06T21:51:27.957Z cpu3:4473)WARNING: Net: 1588: can't create portset: Already exists
2013-07-06T21:51:28.046Z cpu0:4650)WARNING: Tcpip: 803: Failed to unset the ip address (error = 49)
2013-07-06T21:51:29.011Z cpu2:4669)WARNING: LinuxSignal: 761: ignored unexpected signal flags 0x2 (sig 17)
2013-07-06T21:52:04.622Z cpu3:4099)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba37:0:0:0 (driver name: ahci) - Message repeated 1 time
2013-07-06T21:56:45.176Z cpu6:4163)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle
0:00:00:04.192 cpu0:4096)WARNING: Cpu: 2165: Cache latency measurement may be inaccurate min= 180 max= 1148 avg= 220
0:00:00:04.215 cpu0:4096)WARNING: CacheSched: 803: Already disabled : Cache aware scheduling already disabled
0:00:00:04.353 cpu0:4096)WARNING: VMKAcpi: 495: No IPMI PNP id found
2013-07-06T22:01:10.231Z cpu7:4523)WARNING: LinuxSignal: 761: ignored unexpected signal flags 0x2 (sig 17)
2013-07-06T22:01:12.887Z cpu7:4538)WARNING: VMK_PCI: 1170: device 00:00:1f.3 has no legacy interrupt(s)
2013-07-06T22:01:12.887Z cpu7:4538)WARNING: LinPCI: LinuxPCILegacyIntrVectorSet:80:Could not allocate legacy PCI interrupt for device 0000:00:1f.3
2013-07-06T22:01:13.206Z cpu2:4469)WARNING: Team.etherswitch: TeamES_Activate:309:Failed to initialize beaconing on portset 'pps': Not implemented.
2013-07-06T22:01:13.281Z cpu2:4469)WARNING: PCI: 3771: 00:02:00.0: Bypassing non-ACS capable device in hierarchy
2013-07-06T22:01:13.281Z cpu2:4469)WARNING: PCI: 4265: 00:02:00.0 is nameless
2013-07-06T22:01:15.466Z cpu3:4620)WARNING: LinScsiLLD: scsi_add_host:601:vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 255
2013-07-06T22:01:18.148Z cpu3:4469)WARNING: Uplink: 3075: releasing cap 0x0!
2013-07-06T22:01:18.148Z cpu3:4469)WARNING: Uplink: 3075: releasing cap 0x0!
2013-07-06T22:01:18.238Z cpu3:4469)WARNING: Tcpip_Vmk: 717: Failed to set default gateway (51): Network unreachable
2013-07-06T22:01:18.249Z cpu3:4469)WARNING: Tcpip_Vmk: 717: Failed to set default gateway (51): Network unreachable
2013-07-06T22:01:18.338Z cpu1:4653)WARNING: LinuxSignal: 761: ignored unexpected signal flags 0x2 (sig 17)
2013-07-06T22:01:18.356Z cpu0:4645)WARNING: Tcpip: 803: Failed to unset the ip address (error = 49)
2013-07-06T22:05:47.257Z cpu3:5942)WARNING: MemSched: 6105: Psharing is disabled but balooning is not.
2013-07-06T22:06:25.785Z cpu7:5942)WARNING: MemSched: 6105: Psharing is disabled but balooning is not.
2013-07-06T23:31:49.146Z cpu0:5182)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 1 time
2013-07-07T00:01:02.250Z cpu4:4163)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle
2013-07-07T00:01:49.629Z cpu1:4187)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 22 times
2013-07-07T02:01:03.128Z cpu6:4163)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle
2013-07-07T02:01:50.520Z cpu1:4097)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba37:0:0:0 (driver name: ahci) - Message repeated 1 time
2013-07-07T03:01:53.065Z cpu6:5182)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 1 time
2013-07-07T04:01:55.875Z cpu3:4099)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 203 times
2013-07-07T05:31:56.461Z cpu4:4100)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 1 time
2013-07-07T08:01:57.507Z cpu4:17675)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 4 times
2013-07-07T09:31:58.236Z cpu3:17675)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba37:0:0:0 (driver name: ahci) - Message repeated 4 times
2013-07-07T10:01:58.297Z cpu5:4101)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 6 times
2013-07-07T12:31:59.358Z cpu1:4097)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba37:0:0:0 (driver name: ahci) - Message repeated 3 times
2013-07-07T14:02:00.034Z cpu1:4165)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba37:0:0:0 (driver name: ahci) - Message repeated 6 times
2013-07-07T15:32:00.693Z cpu1:4097)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba36:0:0:0 (driver name: ahci) - Message repeated 2 times
 
Pretty sure this is a memory issue. When I run just two VM's the machine stays on fine, but once I start increasing the workload (aka RAM usage) with more VM's the machine eventually crashes.

Originally I just did a single pass with MemTest so I'll be doing three passes later in the week.
 
Yeah bad RAM is my suspicion as well.

You might try more than one memory test tool.

I had a whole long dragged-out hardware nightmare last summer that was all down to bad RAM. Microsoft's own memory-test tool showed no problems, but continual crashes, failures to boot and resume from sleep, and the motherboard's own diagnostic error code (a wonderful feature of the Z77 Extreme 4!) made me doubt the RAM.

I then used a couple of different memory test programs, in addition to the MS one from the Windows 7 install disc. One of them (might have been Memtest86) booted from a CD and ran in a simple text-mode environment. It was this one which eventually showed some errors, after I let it run for 12 or so hours.

I RMA'd both pairs of RAM sticks and have had no issues since.
 
Those AHCI errors trouble me. I saw similar errors with AACRAID and my RAID card was dropping out from under me due to a rather strange hardware issue. I resolved the hardware issue and the box was fine.

Is your box completely unresponsive? If you go to the console, does the F2 key work? Can you put in your password and get to the menu?
 
Yes it's completely unresponsive, none of the keys respond, the screen remains dim.
 
Yes it's completely unresponsive, none of the keys respond, the screen remains dim.

I had a similar issue where ESXi would stop responding for no apparent reason. Sometimes it would come back from dead in 10-15 minutes sometimes not. Installing 5.1 Update 1 fixed the problem for me.
 
Are there any other solutions besides ESXi for virtualization where I can use my existing volumes on the M1015 (I don't wanna lose the data)? Like SmartOS?
 
Last edited:
Are there any other solutions besides ESXi for virtualization where I can use my existing volumes on the M1015 (I don't wanna lose the data)? Like SmartOS?

Absolutely. I'm using smartOS myself and I ported the drives over from openindiana no problem. I personally find smartos great, prefer it to ESXi
 
Ran Memtest+ for 12 hours and no errors :(

Do a 48 hour test. And you can try Goldmemory as well. Many people report that Goldmemory has found errors that Memtest86+ didn't for them.

Are there any other solutions besides ESXi for virtualization where I can use my existing volumes on the M1015 (I don't wanna lose the data)? Like SmartOS?

What pool version are you using?
 
Version 28

Then yeah SmartOS should work. If you had a newer pool version, likely not. FYI I think SmartOS doesn't have IOMMU, though you won't need it for the RAID controller (but if you, like me, use it for other things as well then I think SmartOS isn't up to the task).
 
I've been playing around with SmartOS, but the documentation leaves a lot to be desired.
 
Thats correct about no IOMMU, but you don't need it for storage. You will miss it if you had some sort of Server & HTPC all in one.
 
also check out project FIFO if you find SmartOS too steep a curve. It gives smartOS a web gui control interface. kinda like using FreeNAS for storage, but instead for visualization control.
 
I want to setup my current passthrough to just be a NFS share to all the VM's. I know it's possible, but I would like a little guidance from some docs :(
 
ESXi just doesn't wanna work for me. So disappointing. I really doubt a 48 hour memory test will help, but at this point I really don't know what else to test. I've run it without the M1015 and it still reboot or freezes randomly
 
Have you tried removing half the memory and see if the problem still occurs? You might be running into a (hopefully) BIOS issue with all the RAM slots used.

If it works with half the sticks, try swapping with the other sticks (keeping RAM at 16 GB), just to rule out a fault in the memory sticks that a memtest did not uncover. If it still works, check if an update BIOS is available that may help.

Or try different brand/model of RAM sticks.
 
Played musical slots with the DRAM, found the bad sticks. Been running without reboots for 4 days.

Without Dedup enabled, do I really need 1GB/1TB of memory for ZFS?
 
Played musical slots with the DRAM, found the bad sticks. Been running without reboots for 4 days.

Without Dedup enabled, do I really need 1GB/1TB of memory for ZFS?

Not at all. Without dedup, I recommend 2GB+ RAM no matter how much storage. 1GB could work but I wouldn't bother trying <2 since 2 is a very reasonable amount.
 
Back
Top