Server locking up... Logs show voltage issues... help?

Discussion in 'Motherboards' started by NOTORIOUS VR, Sep 22, 2018.

  1. NOTORIOUS VR

    NOTORIOUS VR n00bie

    Messages:
    31
    Joined:
    Aug 10, 2012
    So recently I've been experiencing random lock ups.... my server (ESXi) will just lock up and become unresponsive, except for the IPMI.

    Checking the server thresholds and logs through the IPMI shows after the lock up shows all motherboard voltages are reading 0V... CPU, memory, etc, all sensors are showing 0 as well at that time.

    logs seem to show either the voltages going critical high or low. This time (7:31 EST this morning) the logs show high voltage, all the CPU, system and memory temps are high, etc. It seems like every sensor on the MB just goes crazy. VBATT too (and I changed the CMOS battery some time last year as I had a VBATT error randomly show up at that time).

    Resetting the machine doesn't clear the errors either, if I reset the system will not boot and just have the MB beeper screaming at me. I have to power down the server completely and then on again and it boots up as if nothing happened.

    The failures (3rd time now) are completely random. The first two were back to back on the same day. The last one took 2 weeks to show up, and I was browsing the web from my phone when all of a sudden I noticed my phone dropped my Wifi AP. No sounds/alarms from the server at all. Not response to pings on any of the subnets related to my VM's/ESXi. only able to ping my hardware devices (switch, IPMI, etc).

    I certainly don't believe it is my PSU (EVGA 1300W) as in my experience PSU's either fail or work. Input voltage is stable (and running through a Cyberpower UPS).

    I believe it might be my Supermicro MB that is starting to fail, but not sure. Hoping someone might have has something similar happen to them to point me in a direction.

    Log from after failure:

    Code:
    86 09/22/2018 07:32:53 VBAT Voltage Upper Non-Recoverable - Going High - Asserted
    85 09/22/2018 07:32:53 VBAT Voltage Upper Critical - Going High - Asserted
    84 09/22/2018 07:32:52 VBAT Voltage Upper Non-Critical - Going High - Asserted
    83 09/22/2018 07:32:50 +3.3VSB Voltage Upper Non-Recoverable - Going High - Asserted
    82 09/22/2018 07:32:50 +3.3VSB Voltage Upper Critical - Going High - Asserted
    81 09/22/2018 07:32:49 +3.3VSB Voltage Upper Non-Critical - Going High - Asserted
    80 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Recoverable - Going High - Asserted
    79 09/22/2018 07:32:47 +3.3V Voltage Upper Critical - Going High - Asserted
    78 09/22/2018 07:32:47 +3.3V Voltage Upper Non-Critical - Going High - Asserted
    77 09/22/2018 07:32:44 +12V Voltage Upper Non-Recoverable - Going High - Asserted
    76 09/22/2018 07:32:44 +12V Voltage Upper Critical - Going High - Asserted
    75 09/22/2018 07:32:44 +12V Voltage Upper Non-Critical - Going High - Asserted
    74 09/22/2018 07:32:41 +5V Voltage Upper Non-Recoverable - Going High - Asserted
    73 09/22/2018 07:32:41 +5V Voltage Upper Critical - Going High - Asserted
    72 09/22/2018 07:32:41 +5V Voltage Upper Non-Critical - Going High - Asserted
    71 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Recoverable - Going High - Asserted
    70 09/22/2018 07:32:38 +1.5V Voltage Upper Critical - Going High - Asserted
    69 09/22/2018 07:32:38 +1.5V Voltage Upper Non-Critical - Going High - Asserted
    68 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
    67 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Critical - Going High - Asserted
    66 09/22/2018 07:32:35 CPU2 DIMM Voltage Upper Non-Critical - Going High - Asserted
    65 09/22/2018 07:32:33 CPU1 DIMM Voltage Upper Non-Recoverable - Going High - Asserted
    64 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Critical - Going High - Asserted
    63 09/22/2018 07:32:32 CPU1 DIMM Voltage Upper Non-Critical - Going High - Asserted
    62 09/22/2018 07:32:30 CPU2 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
    61 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Critical - Going High - Asserted
    60 09/22/2018 07:32:29 CPU2 Vcore Voltage Upper Non-Critical - Going High - Asserted
    59 09/22/2018 07:32:27 CPU1 Vcore Voltage Upper Non-Recoverable - Going High - Asserted
    58 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Critical - Going High - Asserted
    57 09/22/2018 07:32:26 CPU1 Vcore Voltage Upper Non-Critical - Going High - Asserted
    56 09/22/2018 07:32:24 System Temp Temperature Upper Non-Recoverable - Going High - Asserted
    55 09/22/2018 07:32:23 System Temp Temperature Upper Critical - Going High - Asserted
    54 09/22/2018 07:32:23 System Temp Temperature Upper Non-Critical - Going High - Asserted
    53 09/22/2018 07:32:11 P2-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    52 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
    51 09/22/2018 07:32:10 P2-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    50 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    49 09/22/2018 07:32:08 P2-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
    48 09/22/2018 07:32:07 P2-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    47 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    46 09/22/2018 07:32:05 P2-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
    45 09/22/2018 07:32:04 P2-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    44 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    43 09/22/2018 07:32:02 P2-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
    42 09/22/2018 07:32:01 P2-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    41 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    40 09/22/2018 07:31:59 P2-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
    39 09/22/2018 07:31:58 P2-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    38 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    37 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
    36 09/22/2018 07:31:56 P2-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    35 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    34 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Critical - Going High - Asserted
    33 09/22/2018 07:31:53 P1-DIMM3B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    32 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    31 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Critical - Going High - Asserted
    30 09/22/2018 07:31:50 P1-DIMM3A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    29 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    28 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Critical - Going High - Asserted
    27 09/22/2018 07:31:47 P1-DIMM2B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    26 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    25 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Critical - Going High - Asserted
    24 09/22/2018 07:31:44 P1-DIMM2A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    23 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    22 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Critical - Going High - Asserted
    21 09/22/2018 07:31:41 P1-DIMM1B-TEMP Temperature Upper Non-Critical - Going High - Asserted
    20 09/22/2018 07:31:39 P1-DIMM1A-TEMP Temperature Upper Non-Recoverable - Going High - Asserted
    19 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Critical - Going High - Asserted
    18 09/22/2018 07:31:38 P1-DIMM1A-TEMP Temperature Upper Non-Critical - Going High - Asserted
    17 09/22/2018 07:30:18 Fan5 Fan Lower Non-Recoverable - Going Low - Asserted
    16 09/22/2018 07:30:17 Fan5 Fan Lower Critical - Going Low - Asserted
    15 09/22/2018 07:30:17 Fan5 Fan Lower Non-Critical - Going Low - Asserted
    14 09/22/2018 07:29:09 Fan7 Fan Lower Non-Recoverable - Going Low - Asserted
    13 09/22/2018 07:29:08 Fan7 Fan Lower Critical - Going Low - Asserted
    12 09/22/2018 07:29:08 Fan7 Fan Lower Non-Critical - Going Low - Asserted
    11 09/22/2018 07:29:06 Fan6 Fan Lower Non-Recoverable - Going Low - Asserted
    10 09/22/2018 07:29:05 Fan6 Fan Lower Critical - Going Low - Asserted
    
    Readings after a power off and on (and system is working as normal):

    Code:
    CPU1 Temp Normal Low
    CPU2 Temp Normal Low
    System Temp Normal  43 degrees C
    CPU1 Vcore Normal  0.92 Volts
    CPU2 Vcore Normal  0.952 Volts
    CPU1 DIMM Normal  1.56 Volts
    CPU2 DIMM Normal  1.56 Volts
    +1.5V Normal  1.504 Volts
    +5V Normal  5.056 Volts
    +12V Normal  12.084 Volts
    +3.3V Normal  3.24 Volts
    +3.3VSB Normal  3.216 Volts
    VBAT Normal  3.216 Volts
    Fan1 Not Available No Reading
    Fan2 Not Available No Reading
    Fan3 Not Available No Reading
    Fan4 Not Available No Reading
    Fan5 Normal  1080 RPM
    Fan6 Lower Critical  675 RPM
    Fan7 Normal  1620 RPM
    Fan8 Not Available No Reading
    Intrusion    Detected
    PS Status    OK
    P1-DIMM1A-TEMP Normal  54 degrees C
    P1-DIMM1B-TEMP Normal  50 degrees C
    P1-DIMM2A-TEMP Normal  55 degrees C
    P1-DIMM2B-TEMP Not Available No Reading
    P1-DIMM3A-TEMP Normal  52 degrees C
    P1-DIMM3B-TEMP Not Available No Reading
    P2-DIMM1A-TEMP Normal  51 degrees C
    P2-DIMM1B-TEMP Normal  53 degrees C
    P2-DIMM2A-TEMP Normal  54 degrees C
    P2-DIMM2B-TEMP Not Available No Reading
    P2-DIMM3A-TEMP Normal  52 degrees C
    P2-DIMM3B-TEMP Not Available No Reading
    
     
  2. Mega6

    Mega6 [H]ard|Gawd

    Messages:
    1,123
    Joined:
    Aug 13, 2017
    The voltages and temps in the 2nd log are normal and minor fluctuation. The first log gives no information on the actual reading - so this is useless without actual values, Check your OS logs. This doesn't appear to be hardware with the limited info posted. "not available" troubles me and points tward a busy system - doing what? Investigate the performance logging as well.

    Inventory your installed software too.

    edit: the monitoring software thresholds appear to be too narrow,
     
    Last edited: Sep 22, 2018
  3. NOTORIOUS VR

    NOTORIOUS VR n00bie

    Messages:
    31
    Joined:
    Aug 10, 2012
    Yes as I said, the 2nd log snippet is after the reboot and everything is working normal.

    The first log is the history, not actual readings. Actual readings (which I didn't copy/paste during the failure) when the system fails is the second log but for everything is 0 volts, 0 RPM, zero everything for ACTUAL values (even though the fans are still spinning, etc.)

    This isn't a software issue, the system (including the IPMI console is completely unavailable/unresponsive as well). If it were software a soft reboot would also work IMO, but it doesn't. I have to down a hard power off and power on to get the system to POST/boot (so not even close to loading ESXi).

    If you have any other suggestions for info I should be posting to give a better idea let me know.
     
  4. Mega6

    Mega6 [H]ard|Gawd

    Messages:
    1,123
    Joined:
    Aug 13, 2017
    So let's go with the heat warnings and get some actual CPU temps, DIMM temps, MB temps, when critical. Can you take the unrack/cover off to make sure heat is not the issue? Check, increase the Fan bios setting. What are your ambient temps?
     
  5. NOTORIOUS VR

    NOTORIOUS VR n00bie

    Messages:
    31
    Joined:
    Aug 10, 2012
    I don't have issues with temperature so I won't be able to get you that, the system has never seen critical temps, ever. Also none of the data suggested points to a temp issue specifically from what I can see - just a general failure of ALL MB sensors at the same time. Temp , voltage, VBATT. The first log is showing event log after the fact (no actual numbers are recorded in that log). When the system halts/fails, ALL sensors are reading 0 as I've said before so it is impossible for me to get you want you're asking for. All sensors remain at 0 even after a soft reset (where the system will continue not to boot), after a hard power off/on the sensors start to work again and report actual numbers as shown below and in my OP.

    Ambient temps are about 24 deg outside of the closet and 35 inside. As you can see the system/case temp is 43 deg... I never really see it higher than 45-46 deg even when it's warmer out.

    The system has been running in this config for about 3 years now.

    Here are ACTUAL sensor readings taken just now:

    Code:
    Name   Status  Reading   Low NR   Low CT   Low NC   High NC   High CT   High NR  
    CPU1 Temp Normal Low N/A N/A N/A N/A N/A N/A
    CPU2 Temp Normal Low N/A N/A N/A N/A N/A N/A
    System Temp Normal 43 degrees C 0 degrees C 0 degrees C 0 degrees C 81 degrees C 82 degrees C 83 degrees C
    CPU1 Vcore Normal 1.024 Volts 0.808 Volts 0.816 Volts 0.824 Volts 1.352 Volts 1.36 Volts 1.368 Volts
    CPU2 Vcore Normal 0.96 Volts 0.808 Volts 0.816 Volts 0.824 Volts 1.352 Volts 1.36 Volts 1.368 Volts
    CPU1 DIMM Normal 1.56 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
    CPU2 DIMM Normal 1.56 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
    +1.5V Normal 1.504 Volts 1.32 Volts 1.328 Volts 1.336 Volts 1.656 Volts 1.664 Volts 1.672 Volts
    +5V Normal 5.056 Volts 4.416 Volts 4.448 Volts 4.48 Volts 5.536 Volts 5.568 Volts 5.6 Volts
    +12V Normal 12.084 Volts 10.6 Volts 10.653 Volts 10.706 Volts 13.25 Volts 13.303 Volts 13.356 Volts
    +3.3V Normal 3.24 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
    +3.3VSB Normal 3.216 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
    VBAT Normal 3.216 Volts 2.88 Volts 2.904 Volts 2.928 Volts 3.648 Volts 3.672 Volts 3.696 Volts
    Fan1 Not Available No Reading 405 540 675 34155 34290 34425
    Fan2 Not Available No Reading 405 540 675 34155 34290 34425
    Fan3 Not Available No Reading 405 540 675 34155 34290 34425
    Fan4 Not Available No Reading 405 540 675 34155 34290 34425
    Fan5 Normal 1080 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
    Fan6 Normal 945 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
    Fan7 Normal 1620 RPM 405 RPM 540 RPM 675 RPM 34155 RPM 34290 RPM 34425 RPM
    Fan8 Not Available No Reading 405 540 675 34155 34290 34425
    Intrusion   Detected N/A N/A N/A N/A N/A N/A
    PS Status   OK N/A N/A N/A N/A N/A N/A
    P1-DIMM1A-TEMP Normal 54 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P1-DIMM1B-TEMP Normal 49 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P1-DIMM2A-TEMP Normal 54 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P1-DIMM2B-TEMP Not Available No Reading 0 0 0 75 80 85
    P1-DIMM3A-TEMP Normal 51 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P1-DIMM3B-TEMP Not Available No Reading 0 0 0 75 80 85
    P2-DIMM1A-TEMP Normal 51 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P2-DIMM1B-TEMP Normal 53 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P2-DIMM2A-TEMP Normal 55 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P2-DIMM2B-TEMP Not Available No Reading 0 0 0 75 80 85
    P2-DIMM3A-TEMP Normal 53 degrees C 0 degrees C 0 degrees C 0 degrees C 75 degrees C 80 degrees C 85 degrees C
    P2-DIMM3B-TEMP Not Available No Reading 0 0 0 75 80 85
    
    For another idea of the case temps, here is what the 13 HDD's report:

    Code:
    Smartinfos (buffered info max s old. Health-check can increase the counter for Soft-Errors!! Click on smart_sn to display Smart details)
     id   diskcap   pool   vdev   state   error   smart_model   smart_type   smart_health   temp   smart_sn   smart_selftest   smart_check
     c2t0d0       rpool   basic   ONLINE    S:0 H:0 T:0   -   -   -   -      -   -
     c5t50014EE20A467FA6d0       SCRATCH_TMP   mirror-0   ONLINE    S:0 H:0 T:0   WDC WD10EFRX-68PJCN0   sat,12   PASSED   31 °C   WDWCC4J4385481   --   short long abort log
     c5t50014EE20C8A235Cd0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   34 °C   WDWCC4E0TRF0NK   --   short long abort log
     c5t50014EE20C8A61A8d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E0RLC2XL   --   short long abort log
     c5t50014EE20C8A802Ad0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E6PVPYN7   --   short long abort log
     c5t50014EE25F9BC588d0       SCRATCH_TMP   mirror-0   ONLINE    S:0 H:0 T:0   WDC WD10EFRX-68PJCN0   sat,12   PASSED   29 °C   WDWCC4J4549648   --   short long abort log
     c5t50014EE261DFBF31d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E6PVPACX   --   short long abort log
     c5t50014EE261DFCCC6d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E7FZ2SL8   --   short long abort log
     c5t50014EE2B7350224d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   37 °C   WDWCC4E1SANR1T   --   short long abort log
     c5t50014EE2B7353608d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   40 °C   WDWCC4E0TRFL0A   --   short long abort log
     c5t50014EE2B7355807d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E4SZ275E   --   short long abort log
     c5t50014EE2B735BA1Ad0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E6PVPZUE   --   short long abort log
     c5t50014EE2B737F8D9d0       storage_z2   raidz2-0   ONLINE    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   PASSED   38 °C   WDWCC4E7JLA4TL   --   short long abort log
    
    
     id   diskcap   pool   vdev   state   error   smart_model   smart_type   health   temp   smart_sn   smart_selftest   smart_check
     c5t50014EE2B734C691d0       -   -   -    S:0 H:0 T:0   WDC WD40EFRX-68WT0N0   sat,12   -   38 °C   WDWCC4E0TRFDFK   without error   short long abort log
    
     
  6. Mega6

    Mega6 [H]ard|Gawd

    Messages:
    1,123
    Joined:
    Aug 13, 2017
    Memtest and/or start a case with the Motherboard manufacturer. Run any diagnostics you have available. Without being in front of it, hard to get a better idea.
     
  7. NOTORIOUS VR

    NOTORIOUS VR n00bie

    Messages:
    31
    Joined:
    Aug 10, 2012
    Understandable, not a lot of info for someone over the internet to assess... Since the last crash nothing has happened.

    I think I will start to plan my next upgrade path just in case as this is most likely a failure of the mainboard somehow.
     
  8. Spartacus

    Spartacus [H]ard|Gawd

    Messages:
    1,891
    Joined:
    Apr 29, 2005
    Sounds like a bad PSU to me.
    It's very common for them to have marginal and intermittent failures.

    They don't always fail in a totally dead condition. I had a very nice and expensive
    PSU fail after a couple of years. The 5v would drop under load and cause reboots.

    Also, clearly the 0 volts reading on everything with fans spinning is not possible.
    Those readings are incorrect, don't rely on them.

    If the failure is still present with a different PSU, then next likely item is the mobo.

    .