4P G34 reboots spontaneously

plext0r

[H]ard DCOTM x3
Joined
Dec 1, 2009
Messages
780
So last week I decided I wanted to try something new and I up'd the voltage to 1.225 on my SM GL board. I left the clock at 3500. Something went amiss over the weekend. I came in today to find the box running, but console unresponsive. I had to hard power it off and back on. It comes up for 2-4 minutes and then spontaneously reboots even with Folding disabled.

Any suggestions on what to look at first?

Thanks.
 
Just curious was it having troubble before upping the voltage it sounds like it may be a heat issue due to the increased voltage and ambient temps possibly.
 
Just curious was it having troubble before upping the voltage it sounds like it may be a heat issue due to the increased voltage and ambient temps possibly.

tpc -mtemp was showing CPU temps on two CPUs around 51C. The other two were a lot lower. I was hoping to convince it to run at 3.6GHz with 1.225V, but I think something else broke. :(

This is the same motherboard where I may have damaged the underneath (broke off a cap) when removing the stock aluminum VRM heatsinks and replacing them with copper heatsinks.

I just re-enabled BMC/IPMI via jumper so I can remote control it. I PXE-booted a memtest86+ 4.10 image and it started up fine. It's been running for 20 minutes without a reboot.
 
Run it for 6+ hours.

10-4. When I hit ESC to reboot Memtest86+ v4.10 and PXE-boot v4.20, it hung (never got anything on the remote KVM). I powered it off via IPMI, waited 10 seconds, power it on via IPMI and got the beeps not mentioned in the manual. 5 long beeps, 2 short beeps

I physically powered off the PS, powered it back on, then booted the box and it's been running MemTest86+ v4.20 ever since.
 
What does tpc -dram and clockspeed show please.
 
I'm not running tpc tweaks during startup any longer. I ran MemTest86+ until the pass completed and got 0 errors. For about two weeks, I had been running NorthBridge at 2200MHz (-nbfid 7, soft reboot), vcore 1.2, freq 3500. Last Thursday I up'd the voltage to 1.225 since I was looking for a correlation between PPD and vcore, not just freq. Over the weekend the system went unresponsive and here we are. :)

clockspeed:
Code:
Clockspeed (OCNG4.2)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
Refclock: 200.010 MHz
Clockspeed: 3004.671 MHz

tpc -dram shows the following:
Code:
TurionPowerControl 0.44-rc2 (export)
Turion Power States Optimization and Control - by blackshard

DRAM Configuration Status

Node 0 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=9 TrwtTO=8 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=59
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 1 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=9 TrwtTO=8 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=59
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 2 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=62
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=60
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 3 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 4 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=59
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 5 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=60
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=59
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 6 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=9 TrwtTO=8 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=2 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=61
LDIMM0=OK/OK LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=60
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

Node 7 ---
DCT0: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=59
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY

DCT1: DDR3 frequency: 1333 MHz
Tcl=9 Trcd=9 Trp=9 Tras=24 Access Mode:1T Trtp=5 Trc=33 Twr=10 Trrd=4 Tcwl=7 Tfaw=20
TrwtWB=8 TrwtTO=7 Twtr=5 Twrrd=1 Twrwrsdsc=1 Trdrdsdsc=1 Tref=2 Trfc0=3 Trfc1=4 Trfc2=4 Trfc3=4 MaxRdLatency=60
LDIMM0=OK/EMPTY LDIMM1=EMPTY/EMPTY LDIMM2=EMPTY/EMPTY LDIMM3=EMPTY/EMPTY
 
As I write this, gkar is running with tpc -psmax 1 and it's completed 2 frames of 8101. I'll let it run like this for a few days if it remains stable and go from there.
 
See if you have traces of hardware errors / machine check exceptions in /var/log/syslog* files and/or /var/log/messages* -- including the rotated ones (.1, .2.gz, .3.gz and so on).

Search for "hardware error" and "machine check events" (case insensitive),

Something like:
Code:
sudo zgrep -Ei 'hardware.error|machine.check' $(ls -rt /var/log/{messages,syslog}*)
should do it.

I'd examine VRMs closely as well... (such drastic change of stability strongly suggests hardware issue).
 
Last edited:
Post pics of the cooling as well.

God help you if you need to RMA the board...
 
See if you have traces of hardware errors / machine check exceptions in /var/log/syslog* files and/or /var/log/messages* -- including the rotated ones (.1, .2.gz, .3.gz and so on).

Search for "hardware error" and "machine check events" (case insensitive),

Something like:
Code:
sudo zgrep -Ei 'hardware.error|machine.check' $(ls -rt /var/log/{messages,syslog}*)
should do it.

I'd examine VRMs closely as well... (such drastic change of stability strongly suggests hardware issue).

I ran your command and only found the machine check events I mentioned a few weeks ago, not tied to the latest issues.

Code:
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: CPU:21 MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000110151
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: 	MC1_ADDR: 0x0000000000000001
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: Instruction Cache Error: Decoder uop queue parity error.
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: CPU:21 MC5_STATUS[-|CE|MiscV|-|-|-|-]: 0x98000000000c0e0f
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: Execution Unit Error: DE error occurred.
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
/var/log/messages-20130324:Mar 19 22:04:39 gkar kernel: [Hardware Error]: Machine check events logged
/var/log/messages-20130324:Mar 19 22:25:39 gkar kernel: [Hardware Error]: Machine check events logged
/var/log/messages-20130324:Mar 22 10:38:03 gkar kernel: [Hardware Error]: CPU:54 MC5_STATUS[Over|CE|MiscV|-|AddrV|-|-]: 0xdc00000000020e0f
/var/log/messages-20130324:Mar 22 10:38:03 gkar kernel: [Hardware Error]: 	MC5_ADDR: 0x000000000000001e
/var/log/messages-20130324:Mar 22 10:38:03 gkar kernel: [Hardware Error]: Execution Unit Error: AG payload array parity error.
/var/log/messages-20130324:Mar 22 10:38:03 gkar kernel: [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
/var/log/messages-20130324:Mar 22 10:38:03 gkar kernel: [Hardware Error]: Machine check events logged

These were mentioned in this thread:
http://hardforum.com/showthread.php?t=1752033

I ended up leaving the rig at 3.5GHz, 1.2V for two weeks without any problems. Then I decided to mess it up and change vcore to 1.225. It seemed stable for a few hours, then I went on vacation and came back to find it hung. All fans were running, could not get response on console or via SSH. Ended up performing a hard reboot yesterday morning.
 
Ah, and try eliminating the PSU as possible cause of your issues...
 
Ah, and try eliminating the PSU as possible cause of your issues...

That's going to be more difficult since I only have the one 1250W ATX P/S. I'd have to rig something with a SuperMicro 4U case I have sitting around with dual 1200W redundant P/S which would mean removing the motherboard from the current case, etc. So far, the rig has been stable back at 3GHz. I'll let it crunch until tomorrow and then start cranking it back up where it used to be.
 
Back
Top