Server locking up... Logs show voltage issues... help?

Joined
Aug 10, 2012
Messages
31
So recently I've been experiencing random lock ups.... my server (ESXi) will just lock up and become unresponsive, except for the IPMI.

Checking the server thresholds and logs through the IPMI shows after the lock up shows all motherboard voltages are reading 0V... CPU, memory, etc, all sensors are showing 0 as well at that time.

logs seem to show either the voltages going critical high or low. This time (7:31 EST this morning) the logs show high voltage, all the CPU, system and memory temps are high, etc. It seems like every sensor on the MB just goes crazy. VBATT too (and I changed the CMOS battery some time last year as I had a VBATT error randomly show up at that time).

Resetting the machine doesn't clear the errors either, if I reset the system will not boot and just have the MB beeper screaming at me. I have to power down the server completely and then on again and it boots up as if nothing happened.

The failures (3rd time now) are completely random. The first two were back to back on the same day. The last one took 2 weeks to show up, and I was browsing the web from my phone when all of a sudden I noticed my phone dropped my Wifi AP. No sounds/alarms from the server at all. Not response to pings on any of the subnets related to my VM's/ESXi. only able to ping my hardware devices (switch, IPMI, etc).

I certainly don't believe it is my PSU (EVGA 1300W) as in my experience PSU's either fail or work. Input voltage is stable (and running through a Cyberpower UPS).

I believe it might be my Supermicro MB that is starting to fail, but not sure. Hoping someone might have has something similar happen to them to point me in a direction.

Log from after failure:

Code:
86 09/22/2018 07:32:53 VBAT Voltage Upper Non-Recoverable - Going High - Asserted
85 09/22/2018 07:32:53 VBAT Voltage Upper Critical - Going High - Asserted
84 09/22/2018 07:32:52 VBAT Voltage Upper Non-Critical - Going High - Asserted
83 09/22/2018 07:32:50 +3.3VSB Voltage Upper Non-Recoverable - Going High - Asserted
82 09/22/2018 07:32:50 +3.3VSB Voltage Upper Critical - Going High - Asserted
81 09/22/2018 07:32:49 +3.3VSB Voltage Upper Non-Critical - Going High - Asserted
80 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Recoverable - Going High - Asserted
79 09/22/2018 07:32:47 +3.3V Voltage Upper Critical - Going High - Asserted
78 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Critical - Going High - Asserted
77 09/22/2018 07:32:44 +12V Voltage Upper Non-Recoverable - Going High - Asserted
76 09/22/2018 07:32:44 +12V Voltage Upper Critical - Going High - Asserted
75 09/22/2018 07:32:44 +12V Voltage Upper Non-Critical - Going High - Asserted
74 09/22/2018 07:32:41 +5V Voltage Upper Non-Recoverable - Going High - Asserted
73 09/22/2018 07:32:41 +5V Voltage Upper Critical - Going High - Asserted
72 09/22/2018 07:32:41 +5V Voltage Upper Non-Critical - Going High - Asserted
71 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Recoverable - Going High - Asserted
70 09/22/2018 07:32:38 +1.5V Voltage Upper Critical - Going High - Asserted
69 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Critical - Going High - Asserted
68 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
67 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Critical - Going High - Asserted
66 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Critical - Going High - Asserted
65 09/22/2018 07:32:33 CPU1 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
64 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Critical - Going High - Asserted
63 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Non-Critical - Going High - Asserted
62 09/22/2018 07:32:30 CPU2 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
61 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Critical - Going High - Asserted
60 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Non-Critical - Going High - Asserted
59 09/22/2018 07:32:27 CPU1 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
58 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Critical - Going High - Asserted
57 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Non-Critical - Going High - Asserted
56 09/22/2018 07:32:24 System Temp Temperature Upper Non-Recoverable - Going High - Asserted
55 09/22/2018 07:32:23 System Temp Temperature Upper Critical - Going High - Asserted
54 09/22/2018 07:32:23 System Temp Temperature Upper Non-Critical - Going High - Asserted
53 09/22/2018 07:32:11 P2-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
52 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
51 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
50 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
49 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
48 09/22/2018 07:32:07 P2-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
47 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
46 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
45 09/22/2018 07:32:04 P2-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
44 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
43 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
42 09/22/2018 07:32:01 P2-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
41 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
40 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
39 09/22/2018 07:31:58 P2-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
38 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
37 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
36 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
35 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
34 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
33 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
32 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
31 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
30 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
29 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
28 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
27 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
26 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
25 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
24 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
23 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
22 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
21 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
20 09/22/2018 07:31:39 P1-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
19 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
18 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
17 09/22/2018 07:30:18 Fan5 Fan Lower Non-Recoverable - Going Low - Asserted
16 09/22/2018 07:30:17 Fan5 Fan Lower Critical - Going Low - Asserted
15 09/22/2018 07:30:17 Fan5 Fan Lower Non-Critical - Going Low - Asserted
14 09/22/2018 07:29:09 Fan7 Fan Lower Non-Recoverable - Going Low - Asserted
13 09/22/2018 07:29:08 Fan7 Fan Lower Critical - Going Low - Asserted
12 09/22/2018 07:29:08 Fan7 Fan Lower Non-Critical - Going Low - Asserted
11 09/22/2018 07:29:06 Fan6 Fan Lower Non-Recoverable - Going Low - Asserted
10 09/22/2018 07:29:05 Fan6 Fan Lower Critical - Going Low - Asserted

Readings after a power off and on (and system is working as normal):

Code:
CPU1 Temp Normal Low
CPU2 Temp Normal Low
System Temp Normal  43 degrees C
CPU1 Vcore Normal  0.92 Volts
CPU2 Vcore Normal  0.952 Volts
CPU1 DIMM Normal  1.56 Volts
CPU2 DIMM Normal  1.56 Volts
+1.5V Normal  1.504 Volts
+5V Normal  5.056 Volts
+12V Normal  12.084 Volts
+3.3V Normal  3.24 Volts
+3.3VSB Normal  3.216 Volts
VBAT Normal  3.216 Volts
Fan1 Not Available No Reading
Fan2 Not Available No Reading
Fan3 Not Available No Reading
Fan4 Not Available No Reading
Fan5 Normal  1080 RPM
Fan6 Lower Critical  675 RPM
Fan7 Normal  1620 RPM
Fan8 Not Available No Reading
Intrusion    Detected
PS Status    OK
P1-DIMM1A-TEMP Normal  54 degrees C
P1-DIMM1B-TEMP Normal  50 degrees C
P1-DIMM2A-TEMP Normal  55 degrees C
P1-DIMM2B-TEMP Not Available No Reading
P1-DIMM3A-TEMP Normal  52 degrees C
P1-DIMM3B-TEMP Not Available No Reading
P2-DIMM1A-TEMP Normal  51 degrees C
P2-DIMM1B-TEMP Normal  53 degrees C
P2-DIMM2A-TEMP Normal  54 degrees C
P2-DIMM2B-TEMP Not Available No Reading
P2-DIMM3A-TEMP Normal  52 degrees C
P2-DIMM3B-TEMP Not Available No Reading
 
The voltages and temps in the 2nd log are normal and minor fluctuation. The first log gives no information on the actual reading - so this is useless without actual values, Check your OS logs. This doesn't appear to be hardware with the limited info posted. "not available" troubles me and points tward a busy system - doing what? Investigate the performance logging as well.

Inventory your installed software too.

edit: the monitoring software thresholds appear to be too narrow,
 
Last edited:
The voltages and temps in the 2nd log are normal and minor fluctuation. The first log gives no information on the actual reading - so this is useless without actual values, Check your OS logs. This doesn't appear to be hardware with the limited info posted. "not available" troubles me and points tward a busy system - doing what? Investigate the performance logging as well.

Inventory your installed software too.

edit: the monitoring software thresholds appear to be too narrow,

Yes as I said, the 2nd log snippet is after the reboot and everything is working normal.

The first log is the history, not actual readings. Actual readings (which I didn't copy/paste during the failure) when the system fails is the second log but for everything is 0 volts, 0 RPM, zero everything for ACTUAL values (even though the fans are still spinning, etc.)

This isn't a software issue, the system (including the IPMI console is completely unavailable/unresponsive as well). If it were software a soft reboot would also work IMO, but it doesn't. I have to down a hard power off and power on to get the system to POST/boot (so not even close to loading ESXi).

If you have any other suggestions for info I should be posting to give a better idea let me know.
 
So let's go with the heat warnings and get some actual CPU temps, DIMM temps, MB temps, when critical. Can you take the unrack/cover off to make sure heat is not the issue? Check, increase the Fan bios setting. What are your ambient temps?
 
So let's go with the heat warnings and get some actual CPU temps, DIMM temps, MB temps, when critical. Can you take the unrack/cover off to make sure heat is not the issue? Check, increase the Fan bios setting. What are your ambient temps?

I don't have issues with temperature so I won't be able to get you that, the system has never seen critical temps, ever. Also none of the data suggested points to a temp issue specifically from what I can see - just a general failure of ALL MB sensors at the same time. Temp , voltage, VBATT. The first log is showing event log after the fact (no actual numbers are recorded in that log). When the system halts/fails, ALL sensors are reading 0 as I've said before so it is impossible for me to get you want you're asking for. All sensors remain at 0 even after a soft reset (where the system will continue not to boot), after a hard power off/on the sensors start to work again and report actual numbers as shown below and in my OP.

Ambient temps are about 24 deg outside of the closet and 35 inside. As you can see the system/case temp is 43 deg... I never really see it higher than 45-46 deg even when it's warmer out.

The system has been running in this config for about 3 years now.

Here are ACTUAL sensor readings taken just now:

Code:
Name   Status  Reading   Low NR   Low CT   Low NC   High NC   High CT   High NR  
CPU1 Temp Normal Low N/A N/A N/A N/A N/A N/A
CPU2 Temp Normal Low N/A N/A N/A N/A N/A N/A
System Temp Normal 43 degrees C 0 degrees C 0 degrees C 0 degrees C 81 degrees C 82 degrees C 83 degrees C
CPU1 Vcore Normal 1.024 Volts 0.808 Volts 0.816 Volts 0.824 Volts 1.352 Volts 1.36 Volts 1.368 Volts
CPU2 Vcore Normal 0.96 Volts 0.808 Volts 0.816 Volts 0.824 Volts 1.352 Volts 1.36 Volts 1.368 Volts
CPU1 DIMM Normal 1.56 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
CPU2 DIMM Normal 1.56 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
+1.5V Normal 1.504 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
+5V Normal 5.056 Volts 4.416 Volts 4.448 Volts 4.48 Volts 5.536 Volts 5.568 Volts 5.6 Volts
+12V Normal 12.084 Volts 10.6 Volts 10.653 Volts 10.706 Volts 13.25 Volts 13.303 Volts 13.356 Volts
+3.3V Normal 3.24 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
+3.3VSB Normal 3.216 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
VBAT Normal 3.216 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
Fan1 Not Available No Reading 405 540 675 34155 34290 34425
Fan2 Not Available No Reading 405 540 675 34155 34290 34425
Fan3 Not Available No Reading 405 540 675 34155 34290 34425
Fan4 Not Available No Reading 405 540 675 34155 34290 34425
Fan5 Normal 1080 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
Fan6 Normal 945 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
Fan7 Normal 1620 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
Fan8 Not Available No Reading 405 540 675 34155 34290 34425
Intrusion   Detected N/A N/A N/A N/A N/A N/A
PS Status   OK N/A N/A N/A N/A N/A N/A
P1-DIMM1A-TEMP Normal 54 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P1-DIMM1B-TEMP Normal 49 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P1-DIMM2A-TEMP Normal 54 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P1-DIMM2B-TEMP Not Available No Reading 0 0 0 75 80 85
P1-DIMM3A-TEMP Normal 51 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P1-DIMM3B-TEMP Not Available No Reading 0 0 0 75 80 85
P2-DIMM1A-TEMP Normal 51 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P2-DIMM1B-TEMP Normal 53 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P2-DIMM2A-TEMP Normal 55 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P2-DIMM2B-TEMP Not Available No Reading 0 0 0 75 80 85
P2-DIMM3A-TEMP Normal 53 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
P2-DIMM3B-TEMP Not Available No Reading 0 0 0 75 80 85

For another idea of the case temps, here is what the 13 HDD's report:

Code:
Smartinfos (buffered info max s old. Health-check can increase the counter for Soft-Errors!! Click on smart_sn to display Smart details)
 id   diskcap   pool   vdev   state   error   smart_model   smart_type   smart_health   temp   smart_sn   smart_selftest   smart_check
 c2t0d0       rpool   basic   ONLINE    S:0 H:0 T:0   -   -   -   -      -   -
 c5t50014EE20A467FA6d0       SCRATCH_TMP   mirror-0   ONLINE    S:0 H:0 T:0   WDC WD10EFRX-68PJCN0   sat,12   PASSED   31 °C   WDWCC4J4385481   --   short long abort log
 c5t50014EE20C8A235Cd0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   34 °C   WDWCC4E0TRF0NK   --   short long abort log
 c5t50014EE20C8A61A8d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E0RLC2XL   --   short long abort log
 c5t50014EE20C8A802Ad0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E6PVPYN7   --   short long abort log
 c5t50014EE25F9BC588d0       SCRATCH_TMP   mirror-0   ONLINE    S:0 H:0 T:0   WDC WD10EFRX-68PJCN0   sat,12   PASSED   29 °C   WDWCC4J4549648   --   short long abort log
 c5t50014EE261DFBF31d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E6PVPACX   --   short long abort log
 c5t50014EE261DFCCC6d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E7FZ2SL8   --   short long abort log
 c5t50014EE2B7350224d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E1SANR1T   --   short long abort log
 c5t50014EE2B7353608d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   40 °C   WDWCC4E0TRFL0A   --   short long abort log
 c5t50014EE2B7355807d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E4SZ275E   --   short long abort log
 c5t50014EE2B735BA1Ad0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E6PVPZUE   --   short long abort log
 c5t50014EE2B737F8D9d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E7JLA4TL   --   short long abort log


 id   diskcap   pool   vdev   state   error   smart_model   smart_type   health   temp   smart_sn   smart_selftest   smart_check
 c5t50014EE2B734C691d0       -   -   -    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   -   38 °C   WDWCC4E0TRFDFK   without error   short long abort log
 
Memtest and/or start a case with the Motherboard manufacturer. Run any diagnostics you have available. Without being in front of it, hard to get a better idea.
 
Memtest and/or start a case with the Motherboard manufacturer. Run any diagnostics you have available. Without being in front of it, hard to get a better idea.

Understandable, not a lot of info for someone over the internet to assess... Since the last crash nothing has happened.

I think I will start to plan my next upgrade path just in case as this is most likely a failure of the mainboard somehow.
 
Sounds like a bad PSU to me.
It's very common for them to have marginal and intermittent failures.

They don't always fail in a totally dead condition. I had a very nice and expensive
PSU fail after a couple of years. The 5v would drop under load and cause reboots.

Also, clearly the 0 volts reading on everything with fans spinning is not possible.
Those readings are incorrect, don't rely on them.

If the failure is still present with a different PSU, then next likely item is the mobo.

.
 
Back
Top