2P crashes

tear · Apr 20, 2013

Spazturtle said:
I've heard people say that befor, somebody said that the configuration data that the bios reads is in cpu1 and that swaping the chips over can help in some cases.

Indeed, first core of CPU1 is the bootstrap processor but as far as _data_ that are _in_
the CPUs are concerned (actually there's not a lot of it and most pertains to P-states) all
sockets must be processed. It's not like BIOS takes shortcuts...

Similarly, memory and HT calibrations are done for each socket.

If anything, I'd attribute "I rotated CPUs and things started to work" issues to
physical/electrical phenomena, but definitely not logical (EDIT: logical software-wise).

Don't get me wrong, I'm not saying that rotating CPUs is a bad idea (it's actually
very good), I just don't like the "brains" argument...

EDIT:
Yeah I'd have to turn the BMC back on the use IPMI right? That shouldn't be a problem with Gi boards I think.

Umm, damn. I'm an idiot. These are 2 CPUs in a Gi? They are.

I'm sorry, Spazturtle. Keep IPMI disabled... then run voltcheck while folding -- http://hardforum.com/showpost.php?p=1039265047&postcount=665

EDIT: I'd also consider (just for kicks) loading "some other" BIOS; if you're running 3.0 (or 3.x) -- load 2.0; if you're running 2.0 -- load the most recent 3.x...

Qinsp · Apr 20, 2013

Just another dumb thought.

Running in Pstate 1 might have proven something.

If those chips are a known folder, it means the settings aren't to blame? It's got to be heat/HW?

Spazturtle · Apr 20, 2013

With just CPU1 folding, confirmed with temps (Node0 and 1 are hottest)

Code:

root@speedstar:/home/spazturtle# ./voltcheck.sh
CPU1 (min/avg/max): 1126/1128.8/1130 mV
CPU2 (min/avg/max): 1184/1188.8/1190 mV
CPU3 (min/avg/max): 24/24.0/24 mV
CPU4 (min/avg/max): 24/24.0/24 mV

With both CPUs folding:

Code:

root@speedstar:/home/spazturtle# ./voltcheck.sh
CPU1 (min/avg/max): 1126/1128.8/1134 mV
CPU2 (min/avg/max): 1102/1110.8/1120 mV
CPU3 (min/avg/max): 24/25.6/26 mV
CPU4 (min/avg/max): 22/22.8/24 mV
root@speedstar:/home/spazturtle# ./voltcheck.sh
CPU1 (min/avg/max): 1126/1132.0/1134 mV
CPU2 (min/avg/max): 1102/1112.0/1118 mV
CPU3 (min/avg/max): 24/25.2/26 mV
CPU4 (min/avg/max): 22/23.6/24 mV
root@speedstar:/home/spazturtle#

Running the 3.0 bios btw, don't know where I would get a copy of 2.0.

tear · Apr 20, 2013

So.. CPU1 dips from 1.175 V nominal to 1.129-ish and CPU2 dips to 1.112-ish.
That's pretty significant (in general).

BUT

The fact that CPU1 voltage didn't dip (further) as you added load on CPU2 tells us that the regulation
problems are contained to per-CPU VRMs.

This is good. This is good because that means if the CPUs are the issue, I would think you should
see a reboot even with one CPU loaded....

Another thing to consider -- moving second CPU to socket 4. Sockets 1 and 4 have the best regulation
on Supermicro 4P boards (in case it turns out to be some weird regulation issue).

I'll find 2.0 for you.

EDIT: H8QGi/6 2.0 -- http://www.supermicro.com/support/resources/getfile.aspx?ID=1479

tear · Apr 20, 2013

Here's Gi/4P (MC, however) at 1000W (courtesy of sc0tty8):

Code:

CPU1 (min/avg/max): 1276/1285.8/1294 mV (1.3250 nominal -- 39.2 mV dip vs avg)
CPU2 (min/avg/max): 1224/1233.4/1244 mV (1.3000 nominal -- 66.6 mV dip vs avg)
CPU3 (min/avg/max): 1228/1238.8/1254 mV (1.3000 nominal -- 61.2 mV dip vs avg)
CPU4 (min/avg/max): 1294/1299.6/1306 mV (1.3375 nominal -- 37.9 mV dip vs avg)

You're dipping mid-40s on CPU1 and low-mid 60s on CPU2. So.. not that much different... hmmm.

Qinsp · Apr 20, 2013

OK, I'm a total idiot. No news there.

News:

Yes I was in turbo on the BIOS. Apparently I didn't reset it after reflash.

To get ASUS into pstate 1:

sudo -set core all pstate 1 frequency 3xxx (I did 3300 this time)
sudo -psmax 1
sudo -psmax 3
sudo -psmax 1

then it clicked. But it took a few tries this time. Odd. First time, I'm 99% sure I did what I posted earlier. This time required effort. Whether this works with SM I dunno.

Footnote to self - Check your facts before using a keyboard. Temp is 2°C higher, still at 1.200v. Bumped RAM from 1.35v to 1.500v (default). Air director had no effect on temps.

tear · Apr 20, 2013

Qinsp, this is not voodoo.

You need to have at least a little bit of understanding of what you're doing. I did refer you to a document
earlier. I strongly recommend that you read everything pertaining P-states, P-state transitions,
difference between SW and HW P-states (esp. when boosted P-states are in the picture).

What you're doing is manipulating your system in non-deterministic fashion.

No wonder you're not getting deterministic results...

Qinsp · Apr 20, 2013

I have YY chromosomes. Not only do I ignore the instructions, I go one step further and do the complete opposite.

Yes, I did briefly browse the AMD doc. Most of it was WAY over my head. Without you though, I'd be stuck at SMP. Thanks!

PS - 3300 is the 2 Core Pstate 0 speed on this chip. So I reached my target speed. Testing stability right now. Due to NOT reading instructions, I had one bank of RAM at 1.35v and the other at 1.500v. So I balanced them. If it folds well, I will drop back to 1.35v on both banks, test, then start to drop Vcore.

Spazturtle · Apr 20, 2013

Ok done the thekraken -i -c startcpu=16 test on cpu 2 and here are the voltcheck results:

Code:

root@speedstar:/home/spazturtle# ./vol*
CPU1 (min/avg/max): 1196/1196.8/1198 mV
CPU2 (min/avg/max): 1106/1117.6/1146 mV
CPU3 (min/avg/max): 26/26.0/26 mV
CPU4 (min/avg/max): 24/24.0/24 mV

Gonna if it crashes on cpu2 only (was stable on cpu1 as far as I could tell)

tear · Apr 20, 2013

Wow, these swings (min/max) are much higher than before. Interesting...

Red Falcon · Apr 20, 2013

I've seen issues like this before.
It almost sounds like what was happening to me, which ended up being a bad CPU core.

The only way I discovered it was through a crash log using Scientific Linux (RHEL).
The processor was a Phenom 9850 quad-core, and core 0 ended up being bad on it.

Swapping processors fixed every issue I had immediately, as I too was seeing the random crashes across multiple OSes.

Spazturtle, if your latest test doesn't check out properly, try running Scientific Linux for a bit.
If it crashes, it will automatically output an error log stating the exact problem; Debian distros do not do as detailed of a log as I can tell, so it's a bit harder to find out with those than RHEL distros on that aspect.

Hope it works out for you, good luck.

Spazturtle · Apr 20, 2013

Seams to be stable folding just on CPU2 (at 3ghz 1.175v) as well, so its only when both cpus are folding and overvolted does it crash/reboot.

Gonna try bios 2.0 tommorow.

tear · Apr 20, 2013

He's not seeing any MCEs, just insta-reboot without any events.

Though, now that I think of it, MCEs could be consumed by the BIOS.

Spazturtle, have you ever witnessed such insta-reboot (with VGA connected) ?
Did BIOS print any add'l information in the POST screen? [you'd probably need
to disable the splash -- Disable Quiet Boot IIRC]

Spazturtle · Apr 20, 2013

After houres of uptime on each cpus only I switched back to both (after re-wraping cores) and it crashed pretty quickly.

I do have a VGA connected so I will have a look tommorow (well today as its 4am but whatever)

Qinsp · Apr 21, 2013

Sure sounds like something is getting hot when you double the power used. Can you declock and see if it's stable? Like ps 6 or whatever the lowest voltage is.

Qinsp · Apr 21, 2013

I'm clueless but want to learn.

My Intel workstation mobo has an application that reads all the temps. RAM PSU Chipset, etc. And a light comes on if any get hot.

Is there something like for SM?

Spazturtle · Apr 21, 2013

Yeah its stable at lower voltage.
The crashes have got more frequent since adding a GPU (HD5450 around 15W)
Everything seams cool, the Enzotech heatsinks on the Mosfets are ok to touch.
The case has 2*230mm fans, 1*200mm fan and 1*140mm fan.

Qinsp · Apr 21, 2013

Damn. Almost sounds like it's actually GPU crash?

Did you jumper On-Board VGA to OFF with the 5450 installed?

What I'm thinking is the mobo has a bad VGA chip in it?

Just throwing it out there.

Spazturtle · Apr 21, 2013

Yeah the jumper is set for on board vga to off.
I think the extra 15w the 5450 draws might be pushing the psu over the edge. Its 750w psu so it should be able to handle it all but mabey it is just dieting.

tear · Apr 21, 2013

What brand/model is the PSU?

Spazturtle · Apr 21, 2013

Its a cosair tx750w (version 1), its a single 12v rail @ 60A

Qinsp · Apr 21, 2013

750w should be plenty if the all the 8-pins are getting full juice.

Spazturtle · Apr 21, 2013

Well 2 of the 8 pin connectors are connected. Ill see if it boots with only 1 connected.
1 of the connectors is done with a pci-e to 8 pin converter so I will have to check if that is working.

Quisarious · Apr 21, 2013

Spazturtle said:
Well 2 of the 8 pin connectors are connected. Ill see if it boots with only 1 connected.
1 of the connectors is done with a pci-e to 8 pin converter so I will have to check if that is working.

2 is plenty. I've pulled 850 at the wall for months with just 2. Your 2p will be a fraction of that (I'd guess at 3.0 you're probably pulling ~400 max, including GPU).

Qinsp · Apr 21, 2013

Heat increases with resistance. Do anything you can to minimize resistance (lots of wires at low amps, trumps a few wires at high amps).

tear · Apr 21, 2013

Viva La Resistance!

Spazturtle · Apr 21, 2013

Ok the PSU has 1 8 pin connector (A) and I am using a pci-e to 8pin adapter (B)
OC is done with rc.local and fah starts as service
With only A connected thekraken-FahCo runs for around 21:00:00 (using top to mesure time+)
With only B connected it runs from around 54:00:00
Both were connected to the main 8 pin socket.

tear · Apr 21, 2013

Have you seen any errors reported by the BIOS after spontaneous reboot?

Has there been any change w/2.0 BIOS?

Do you have another 500W+ PSU to try?

I would also consider moving second CPU to socket 4.

Spazturtle · Apr 21, 2013

Would a Corsair 500w psu do?

Moving sockets may take a while if it comes down to that
here is inside the case: http://oi47.tinypic.com/264qk2u.jpg

tear · Apr 21, 2013

Spazturtle said:
Would a Corsair 500w psu do?

As long as it's single rail -- it definitely should do.

I've got 4P 6200 ES in same configuration as yours -- 3.0 GHz / ~1.175V and I'm pulling 800-ish Watt.
500W should be enough to power 2 CPUs in a 4P board.

Moving sockets may take a while if it comes down to that
here is inside the case: http://oi47.tinypic.com/264qk2u.jpg

I hear you -- that's why I listed it last

Spazturtle · Apr 22, 2013

Same issue with different PSU, BIOS 2.0 didn't help (funnyly 3.0 sees the ram as 1600mhs but 2.0 as 800mhz)

When I removed the 4 newest sticks of ram (so from quad in both to dual) it takes longer to crash but still crashes. Memtest 86+ seamed to crash with 8 sticks this time; it was running and then power off and no reboot.

Took a while to boot after bios update, had to remove video card before it would cold boot.

Note I am testing the cpus at 3GHz @ 1.2250v

arestavo · Apr 22, 2013

My 2P Asus G34 board was SUPER finniky about 8-Pin EPS adapters. I eventually had to use my 6 pin PCI-E connectors (two of them, turned around backwards) in order to gain stability. Have you tried this trick yet?

Spazturtle · Apr 22, 2013

Err could you draw a digram in ms paint or somthing. I don't think I usnderstand how to plug 2 6 pin conectors into 1 8 pin.

tear · Apr 23, 2013

Note I am testing the cpus at 3GHz @ 1.2250v

Won't hurt...

Spazturtle said:
Memtest 86+ seamed to crash with 8 sticks this time; it was running and then power off and no reboot.

But that is pretty disturbing.... smells like overcurrent protection kicked in on the board
(though I'm not sure if SM actually has it anywhere) or the PSU. A short?

arestavo · Apr 23, 2013

Take the two 6 pin adapters, turn then 180 from normal (clips are now bacwards), wiggle four of the 6 pin connectors into the 8 pin socket, you will be left with 2 pins sticking out on either side. This turns it into an 8 pin EPS 12V, or the closest thing to one without buying a new PSU.

Spazturtle · Apr 23, 2013

I though I saw a kernel panic that said something like "CPU #31 is already initialised"
Been scouring the kern.log as well, here are some things I found interesting.

Code:

Apr 23 20:05:17 speedstar kernel: [    0.172925] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 OK
Apr 23 20:05:17 speedstar kernel: [    0.370354] smpboot: Booting Node   2, Processors  #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
Apr 23 20:05:17 speedstar kernel: [    0.675552] Brought up 32 CPUs
Apr 23 20:05:17 speedstar kernel: [    0.675560] smpboot: Total of 32 processors activated (179203.25 BogoMIPS)

Code:

Apr 21 19:43:34 speedstar kernel: [    0.440432] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Apr 21 19:43:34 speedstar kernel: [    0.440537] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 OK
Apr 21 19:43:34 speedstar kernel: [    0.533885] smpboot: Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 OK
Apr 21 19:43:34 speedstar kernel: [    0.639818] smpboot: Booting Node   2, Processors  #16 #17 #18 #19 #20 #21 #22 #23 OK
Apr 21 19:43:34 speedstar kernel: [    0.843302] smpboot: Booting Node   3, Processors  #24 #25 #26 #27 #28 #29 #30 #31

The last node doesn't have a OK on it (node 2 in the first example and 3 in the second).
EDIT: My IRC's resident Linux SysAdmin says he thinks the last node drops the OK as if it works it just continues.

Code:

Apr 23 20:03:13 speedstar kernel: [    0.208785] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 OK
Apr 23 20:03:13 speedstar kernel: [    0.406165] smpboot: Booting Node   2, Processors  #16 #17 #18 #19 #20 #21
Apr 23 20:03:13 speedstar kernel: [    9.707076] smpboot: CPU21: Not responding
Apr 23 20:03:13 speedstar kernel: [    9.707234]  #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
Apr 23 20:03:13 speedstar kernel: [    9.838532] Brought up 31 CPUs
Apr 23 20:03:13 speedstar kernel: [    9.838539] smpboot: Total of 31 processors activated (173596.17 BogoMIPS)

Huh I swear I had a core 21 earlier, must have left in on the train.

Code:

Apr 23 19:19:52 speedstar kernel: [    0.821655] mtrr: your CPUs had inconsistent fixed MTRR settings
Apr 23 19:19:52 speedstar kernel: [    0.821656] mtrr: your CPUs had inconsistent variable MTRR settings
Apr 23 19:19:52 speedstar kernel: [    0.821657] mtrr: probably your BIOS does not setup all CPUs.
Apr 23 19:19:52 speedstar kernel: [    0.821658] mtrr: corrected configuration.

Dunno what this means.

EDIT: Linux SysAdmin has tough me how to stop the rig from rebooting on kernel panic, so if it kernel panics I will be able to see the error message.

402blownstroker · Apr 23, 2013

Spazturtle said:
EDIT: Linux SysAdmin has tough me how to stop the rig from rebooting on kernel panic, so if it kernel panics I will be able to see the error message.

The best method that I have found and use is to boot off a livecd once the machine crashes. Once the livecd has boot up, mount the system partition and then look at the logs. Usually the last thing or two in the messages file is the thing that caused the issue.

tear · Apr 23, 2013

If kernel.panic sysctl is set to zero (or something sufficiently large) the machine should not
reboot. Make sure not to use X11 either -- panics typically only make it to the console.

EDIT: if it does reboot with kernel.panic=0 then the it's *extremely* unlikely that it's
Linux-triggered

CPU (core) failing to boot is disturbing -- never seen anything like that.

To ensure reproducibility I'd consider doing AC power-cycle before each test (EDIT
or: after each crash, if you will

-- IOW it makes sense to remove warm reset as a variable.

I wouldn't worry about mtrrs. The kernel detected that memory view of individual cores
was not consistent and fixed it up. I don't have a machine running 2.0 BIOS handy but
Quisarious was kind enough to check his 3.x-based system and he didn't see those
messages.

Could be that his machine doesn't suffer from the issue or that his kernel isn't as verbose
as yours.

Spazturtle · Apr 23, 2013

Yeah the 2.0 bios shows my ram as 800mhz whilst the 3.0 showed it as 1600mhz, I will probably go back to 3.0 soon now that I have eliminated the bios. Also doesn't look like the OC method Qinsp showed us works in bios 2.0

In the next few days I do plan to take the cpus out, check the pads are clean and then check the cpu sockets for bend or missing pins. Then place then in the reverse sockets (so what is now cpu 0 becomes cpu 1 and the current cpu becomes cpu0) at this point a bad connection would be the nicest thing to find.

I have taken the original 4 ram sticks out and cleaned the sockets and put the new 4 ram sticks in and still crashes. So having 8 ram sticks in makes it more unstable then 4 regardless of what ram is used.

tear · Apr 23, 2013

Yeah the 2.0 bios shows my ram as 800mhz whilst the 3.0 showed it as 1600mhz

This could just be cosmetic difference. I'd probably double check configured DDR3
mode with tpc, for instance:

Code:

sudo tpc -dram

or (w/filtered speeds):

Code:

sudo tpc -dram | grep -E 'Node|frequency'

If it shows DDR3-1600 -- you're golden. If it doesn't -- we have a problem.

2P crashes

[H]ard|DCer of the Year 2011

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard|DCer of the Year 2011

2[H]4U

[H]ard|DCer of the Year 2011

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard DCOTM x3

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

2[H]4U

[H]ard|DCer of the Year 2011

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard|Gawd

[H]ard|DCer of the Year 2011

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Year 2011

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Month - Nov. 2012

[H]ard|DCer of the Year 2011

[H]ard|Gawd

[H]ard|DCer of the Year 2011