ESXI 5.5 host randomly freezes on new XEON 2011-3 server, please help!

x3r0z

n00b
Joined
Jun 30, 2015
Messages
8
Hello! i'm having a real annoying problem with my new server, it randomly freeze the
ESXI host after 14 days, 1 day, 2 day, 5 days etc...
It becomes unresponsive for any keystrokes via the server console directly
or via IPMI java interface, all vm's dies and the only thing i can do is to
restart it via IPMI and then it runs for a couple of days again...

I was never able to re-produce the crash by stressing the box in any way... it just seems die
while idling..

I've been using the same parts in my old server except for CPU,RAM,MB with a E3 1245v3 + ASRock Z87 Extreme 4 Intel Z87 and 16gb Kingston Black HyperX mem which worked
flawless for 1½year with the same psu, ibm m1015 and chassis..

and now it seems that i reached a dead end trying to find the problem and need your help badly..


Hardware:
CPU: XEON E5 2.2ghz (2.5ghz Turbo) 10-core 2011-3, (Engineer sample)
RAM: KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
MB: Supermicro X10SRH-CLN4F
HDD's connected to mainboard: Kingston V300 120gb, Samsung 830 256gb
PSU: Tagan 2-Force II Series 600W, ATX12V
PCI-E: 3x IBM M1015 IT mode with 15x3tb drives. (i did not test the built-in sas-controller yet)
CHASSI: NORCO RPC-4224 4U
Booting from a 16gb USB-Stick.



Here is my "This is what i tested so far log":

* 28h memtest86, no problem detected (ECC mems probably doesnt work that good in memtest..)
* 6h cpu stress+memory x264 encode, works like a charm.. been running full load
randomly and the box never crashed during full load..

* # esxcli system settings kernel set --setting=iovDisableIR -v TRUE
* Upgraded to 5.5 u2
* Changed/inactivated CoD Cluster-on-die cpu setting to home-snoop in bios settings.
* Changed all vm nic that was set to e1000 to vmxnet3 instead.
* Ran ESXI 5.5 standard version+vms from an SSD with my old setup from the E3-1245v3 instead of Kingston 16gb G4 USB Stick.
* upgraded latest patches:
VMware_bootbank_esx-base_5.5.0-2.62.2718055.vib
VMware_bootbank_net-ixgbe_3.7.13.7.14iov-12vmw.550.2.62.2718055.vib
VMware_bootbank_misc-drivers_5.5.0-2.62.2718055.vib
VMware_locker_tools-light_5.5.0-2.62.2718055.vib
-- ran for 13 days, crash.

18:36 2015-06-09 removed 2x8gb ram.
changed from shielded to unsheilded network cable (desperate).
unconnected usb-reader (desperate).
unconnected monitor dvi-cable (desperate).
* ran for 5 days until crash.

13:07 2015-06-14 removed rams and switched to the other 2x8gb ddr4 sticks.
changed pci-e slot for one of the three ibm m1015 cards from 8 to 2.
*ran for 12 days until crash.

18:31 2015-06-26 Just saw that a new bios firmware had finally been released, updated to BIOS File Name:
X10SRH-CLN4F X10SRH5_518.zi BIOS Revision: R 1.0b
note: Just noticed that my KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
are recognized as SAMSUNG in bios, compability issue?
*also re-inserted all 4 ram sticks.

19:12 2015-06-29 3 days... crash.... no more suggestions?...



I'm running out of ideas, could anyone please help me out?
Which log-files do i need to attach?
 
Last edited:
I actually didnt find this specific topic on vmware site before, thank you:)

*Added Misc.NMILint1IntAction 2 , so hopefully it will generate a PSOD next time.
 
The other thing you could try is upgrading to ESXI 6.

In any case it shouldn't be happening. Could be something as simple as a BIOS setting.
 
@dasaint Thank you! very much appreciated!, this is my very first "server"-mainboard
so i guess i have alot to learn:)
This is the first time i'm looking forward to the next crash :)
 
"Finally"...the box ran for 12 days this time.. I shorted the NMI-pins,
but no reaction from the shell that was currently running in "Alt-F12"/debug-mode and
that log didn't mention anything suspicious either before the crash. I connected the reset
button to the NMI for the upcoming crash..


edit: ah, nevermind.. guess i need to wait for the next cash... "You must reboot the ESX/ESXi host for the change to take effect."
 
Last edited:
I would look to see if there are any errors thrown in bios/system event log. I had an unstable ESXi host that came down to a bad ram stick that tested good with memtest but occasionally would throw up. System Error log showed the slot that was bad.
 
@TType85 nothing else than "AC Power On" and "iKVM login/out" from those logs :/
 
Yesterday the box crashed..again... this time with only a few days between.

Nothing happens when i trigger the NMI-button?, am i missing some settings/drivers?
* Changed nmiaction to 1 debugger-mode instead of psod.

I also noticed something odd.. i used the "Shutdown" via vsphere and the box froze,
alt-12 debug was running and this showed up before halting:

hwsleep-0216 [4294967282] HwLegacySleep : 2015-07-23T21:33:00.065Z cpu0:32848)Entering sleep state [S5]

So, instead of shutting down ESXI it tried to go to sleep mode, ACPI bug?

* I loaded optimized bios defaults and changed esxi power settings to performance instead of balanced.
 
Disable all power states in BIOS. Lock it to full power, in other words, see if it's stable now.
 
Still no crash since last post *Knock on wood*
Uptime: 3wks 1day 8hrs 13mins 23secs.
1 more week and i could prolly consider it stable and the balanced power setting in ESXI has compability
issues with my system/current acpi setting in bios (default).

@lopoetve: i didnt disable it in bios yet, i thought it would've crashed by now and that would be my next move... but for first time in 3½ months it's still up and running :)
 
Performance would have done the same thing as what I suggested, only at software instead of BIOS (which I don't trust as much) :)
 
Uptime: 4wks 1day 6hrs 53mins 26secs
It seems that the problem is finally solved after ~3½ months...
Thanks for all your help:)


This did the trick:
(vSphere)->Configuration->Power Management->High performance instead of
Balanced, probably due to some ACPI states bug with mainboard/cpu.
or as @lopoetve said, disable them in BIOS :)

Note that i also loaded bios default for the first time after upgrading to the new bios-fw.
It was using defaults from the first fw from the beginning, so unless they made some
minor changes to some settings my bet is still on the ACPI Power States.

And as it finally works.. i wont investigate it any further:)
 
Not to revive an old thread but I had the same problem on my X9SRL-F motherboard with ESXi 6. Looks like this may affect multiple Supermicro boards and this is one of the only sources I found leading up to the fix.
 
Back
Top