NOTORIOUS VR
n00b
- Joined
- Aug 10, 2012
- Messages
- 31
So recently I've been experiencing random lock ups.... my server (ESXi) will just lock up and become unresponsive, except for the IPMI.
Checking the server thresholds and logs through the IPMI shows after the lock up shows all motherboard voltages are reading 0V... CPU, memory, etc, all sensors are showing 0 as well at that time.
logs seem to show either the voltages going critical high or low. This time (7:31 EST this morning) the logs show high voltage, all the CPU, system and memory temps are high, etc. It seems like every sensor on the MB just goes crazy. VBATT too (and I changed the CMOS battery some time last year as I had a VBATT error randomly show up at that time).
Resetting the machine doesn't clear the errors either, if I reset the system will not boot and just have the MB beeper screaming at me. I have to power down the server completely and then on again and it boots up as if nothing happened.
The failures (3rd time now) are completely random. The first two were back to back on the same day. The last one took 2 weeks to show up, and I was browsing the web from my phone when all of a sudden I noticed my phone dropped my Wifi AP. No sounds/alarms from the server at all. Not response to pings on any of the subnets related to my VM's/ESXi. only able to ping my hardware devices (switch, IPMI, etc).
I certainly don't believe it is my PSU (EVGA 1300W) as in my experience PSU's either fail or work. Input voltage is stable (and running through a Cyberpower UPS).
I believe it might be my Supermicro MB that is starting to fail, but not sure. Hoping someone might have has something similar happen to them to point me in a direction.
Log from after failure:
Readings after a power off and on (and system is working as normal):
Checking the server thresholds and logs through the IPMI shows after the lock up shows all motherboard voltages are reading 0V... CPU, memory, etc, all sensors are showing 0 as well at that time.
logs seem to show either the voltages going critical high or low. This time (7:31 EST this morning) the logs show high voltage, all the CPU, system and memory temps are high, etc. It seems like every sensor on the MB just goes crazy. VBATT too (and I changed the CMOS battery some time last year as I had a VBATT error randomly show up at that time).
Resetting the machine doesn't clear the errors either, if I reset the system will not boot and just have the MB beeper screaming at me. I have to power down the server completely and then on again and it boots up as if nothing happened.
The failures (3rd time now) are completely random. The first two were back to back on the same day. The last one took 2 weeks to show up, and I was browsing the web from my phone when all of a sudden I noticed my phone dropped my Wifi AP. No sounds/alarms from the server at all. Not response to pings on any of the subnets related to my VM's/ESXi. only able to ping my hardware devices (switch, IPMI, etc).
I certainly don't believe it is my PSU (EVGA 1300W) as in my experience PSU's either fail or work. Input voltage is stable (and running through a Cyberpower UPS).
I believe it might be my Supermicro MB that is starting to fail, but not sure. Hoping someone might have has something similar happen to them to point me in a direction.
Log from after failure:
Code:
86 09/22/2018 07:32:53 VBAT Voltage Upper Non-Recoverable - Going High - Asserted
85 09/22/2018 07:32:53 VBAT Voltage Upper Critical - Going High - Asserted
84 09/22/2018 07:32:52 VBAT Voltage Upper Non-Critical - Going High - Asserted
83 09/22/2018 07:32:50 +3.3VSB Voltage Upper Non-Recoverable - Going High - Asserted
82 09/22/2018 07:32:50 +3.3VSB Voltage Upper Critical - Going High - Asserted
81 09/22/2018 07:32:49 +3.3VSB Voltage Upper Non-Critical - Going High - Asserted
80 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Recoverable - Going High - Asserted
79 09/22/2018 07:32:47 +3.3V Voltage Upper Critical - Going High - Asserted
78 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Critical - Going High - Asserted
77 09/22/2018 07:32:44 +12V Voltage Upper Non-Recoverable - Going High - Asserted
76 09/22/2018 07:32:44 +12V Voltage Upper Critical - Going High - Asserted
75 09/22/2018 07:32:44 +12V Voltage Upper Non-Critical - Going High - Asserted
74 09/22/2018 07:32:41 +5V Voltage Upper Non-Recoverable - Going High - Asserted
73 09/22/2018 07:32:41 +5V Voltage Upper Critical - Going High - Asserted
72 09/22/2018 07:32:41 +5V Voltage Upper Non-Critical - Going High - Asserted
71 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Recoverable - Going High - Asserted
70 09/22/2018 07:32:38 +1.5V Voltage Upper Critical - Going High - Asserted
69 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Critical - Going High - Asserted
68 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
67 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Critical - Going High - Asserted
66 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Critical - Going High - Asserted
65 09/22/2018 07:32:33 CPU1 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
64 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Critical - Going High - Asserted
63 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Non-Critical - Going High - Asserted
62 09/22/2018 07:32:30 CPU2 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
61 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Critical - Going High - Asserted
60 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Non-Critical - Going High - Asserted
59 09/22/2018 07:32:27 CPU1 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
58 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Critical - Going High - Asserted
57 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Non-Critical - Going High - Asserted
56 09/22/2018 07:32:24 System Temp Temperature Upper Non-Recoverable - Going High - Asserted
55 09/22/2018 07:32:23 System Temp Temperature Upper Critical - Going High - Asserted
54 09/22/2018 07:32:23 System Temp Temperature Upper Non-Critical - Going High - Asserted
53 09/22/2018 07:32:11 P2-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
52 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
51 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
50 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
49 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
48 09/22/2018 07:32:07 P2-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
47 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
46 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
45 09/22/2018 07:32:04 P2-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
44 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
43 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
42 09/22/2018 07:32:01 P2-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
41 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
40 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
39 09/22/2018 07:31:58 P2-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
38 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
37 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
36 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
35 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
34 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
33 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
32 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
31 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
30 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
29 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
28 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
27 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
26 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
25 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
24 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
23 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
22 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
21 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
20 09/22/2018 07:31:39 P1-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
19 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
18 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
17 09/22/2018 07:30:18 Fan5 Fan Lower Non-Recoverable - Going Low - Asserted
16 09/22/2018 07:30:17 Fan5 Fan Lower Critical - Going Low - Asserted
15 09/22/2018 07:30:17 Fan5 Fan Lower Non-Critical - Going Low - Asserted
14 09/22/2018 07:29:09 Fan7 Fan Lower Non-Recoverable - Going Low - Asserted
13 09/22/2018 07:29:08 Fan7 Fan Lower Critical - Going Low - Asserted
12 09/22/2018 07:29:08 Fan7 Fan Lower Non-Critical - Going Low - Asserted
11 09/22/2018 07:29:06 Fan6 Fan Lower Non-Recoverable - Going Low - Asserted
10 09/22/2018 07:29:05 Fan6 Fan Lower Critical - Going Low - Asserted
Readings after a power off and on (and system is working as normal):
Code:
CPU1 Temp Normal Low
CPU2 Temp Normal Low
System Temp Normal 43 degrees C
CPU1 Vcore Normal 0.92 Volts
CPU2 Vcore Normal 0.952 Volts
CPU1 DIMM Normal 1.56 Volts
CPU2 DIMM Normal 1.56 Volts
+1.5V Normal 1.504 Volts
+5V Normal 5.056 Volts
+12V Normal 12.084 Volts
+3.3V Normal 3.24 Volts
+3.3VSB Normal 3.216 Volts
VBAT Normal 3.216 Volts
Fan1 Not Available No Reading
Fan2 Not Available No Reading
Fan3 Not Available No Reading
Fan4 Not Available No Reading
Fan5 Normal 1080 RPM
Fan6 Lower Critical 675 RPM
Fan7 Normal 1620 RPM
Fan8 Not Available No Reading
Intrusion Detected
PS Status OK
P1-DIMM1A-TEMP Normal 54 degrees C
P1-DIMM1B-TEMP Normal 50 degrees C
P1-DIMM2A-TEMP Normal 55 degrees C
P1-DIMM2B-TEMP Not Available No Reading
P1-DIMM3A-TEMP Normal 52 degrees C
P1-DIMM3B-TEMP Not Available No Reading
P2-DIMM1A-TEMP Normal 51 degrees C
P2-DIMM1B-TEMP Normal 53 degrees C
P2-DIMM2A-TEMP Normal 54 degrees C
P2-DIMM2B-TEMP Not Available No Reading
P2-DIMM3A-TEMP Normal 52 degrees C
P2-DIMM3B-TEMP Not Available No Reading