Trying to tune 16 core 62xx ES

theGryphon · Jan 6, 2013

I have a question, anyone jump in: did you have to put heatsinks on the taller components? Is it recommended, or did you do it just in case?

overdoze said:
I used 3M thermal tape and thrown away marker's board aluminum frame cut out seems ok as well.
not 24 but lots' more than that just a few posts above
http://hardforum.com/showpost.php?p=1039350752&postcount=178
http://hardforum.com/showpost.php?p=1039367180&postcount=246

overdoze · Jan 6, 2013

They are 6.5 x 6.5 mm
The mosfet will burn at high OC not the Ferrite chokes (the taller components). It does not hurt to sinks them. The more the merrier, but most of the time it is in the way of the cpu heatsink mount.

theGryphon · Jan 6, 2013

Yeah, I thought the chokes would be fine without the heatsinks, so I passed them. You gave me a second thought with that photo

overdoze said:
They are 6.5 x 6.5 mm
The mosfet will burn at high OC not the Ferrite chokes (the taller components). It does not hurt to sinks them. The more the merrier, but most of the time it is in the way of the cpu heatsink mount.

overdoze · Jan 6, 2013

yeah, basically the VRM area must dissipate 20% of the total cpu heat. If your OC cause the cpu to consume 200W that is a whooping 40W of heat that the VRM has to dissipate. In my case, I had the heatsink on and no fan, it still smoked those mosfets

overdoze · Jan 6, 2013

-alias- said:
Can only answer part of your question. I have done some notes of my own testing of my 4P SM H8QGI-F & 4 x Opteron 6276es. Memeory is 16 x 4GB Corsair Vengeance® Low Profile Blue 4GB Dual Channel DDR3 Memory Kit (CML4GX3M2A1600C9B) The memory is working only at 1333MHz.

Testing OC for 4P Opteron 6276es pr. 21.11.2012
P8101 3.0GHz VCore 1.1000V TPF = 10:32 PPD = 508K Power Draw = 780W stable
P8101 3.1GHz VCore 1.15000V TPF = 10:37 PPD = 502K Power Draw = 835W stable
P8101 3.2GHz VCore 1.16125V TPF = 10:10 PPD = 535K Power Draw = 855W stable
I can not explain the big different in TPF between 3.2 - 3.3GHz!
P8101 3.3GHz VCore 1.1750V TPF = 9:36 PPD = 583K Power Draw = 880W stable

Do not overclock these cards longer than this without heatsinks on all components, the reason why not,
you can read about here: #162

P8101 3.4GHz VCore 1.2000V TPF = 9:20 PPD = 609K Power Draw = 950W stable
P8101 3.5GHz VCore 1.2500V TPF = 9:04 PPD = 635K Power Draw = 990W stable
P8101 3.6GHz VCore 1.2750V TPF = 8:48 PPD = 665K Power Draw = 1035W stable
P8101 3.7GHz VCore 1.3000V TPF = 8:xx PPD = xxxK Power Draw = 1150W unstable

P8102 3.4GHz VCore 1.2000V TPF = 7:18 PPD = 880K Power Draw = 905W stable

P6958 3.6GHz VCore 1.2750V TPF = 36sec PPD = 237K Power Draw = 940W stable
In my experience the Power DRaw vary from reboot to another reboot on the same WU, and from WU to WU!

Update: Added 3.0 - 3.2GHz values.
Update 29.12.2012, 3.0GHz values.

I finally got mine running stable at 3.7ghz 1.275V. My frametime is around 8:50 on 8101.. Possibly because of my slow ram.
Temperature at VRM is 70C, so I'm at limit, unless I watercool the VRM as well

theGryphon · Jan 6, 2013

I have 3 x 140mm fans blowing on the front, and 2 x 200mm fans sucking at the back. Nothing blowing vertically on the board, but I'm hoping I'm covered

overdoze said:
yeah, basically the VRM area must dissipate 20% of the total cpu heat. If your OC cause the cpu to consume 200W that is a whooping 40W of heat that the VRM has to dissipate. In my case, I had the heatsink on and no fan, it still smoked those mosfets

knopflerbruce · Jan 6, 2013

overdoze said:
yeah, basically the VRM area must dissipate 20% of the total cpu heat. If your OC cause the cpu to consume 200W that is a whooping 40W of heat that the VRM has to dissipate. In my case, I had the heatsink on and no fan, it still smoked those mosfets

I pulled 300w pr CPU on an ASUS KGPE-D16 rig with bad airflow around the mosfet/atx/8pin connector area and smoked one of the 8 pin connectors

thank god I used a modular PSU, just need to replace that cable - and the 8pin on the mobo.

Mr.Nosmo · Jan 12, 2013

Just got my 6282SE ES CPU's today (well DHL tried to deliver them 4 days ago, but I wasn't home).

Does any body have any good suggestions where to install them & what do the numbers mean?

CPU#1:
ZS262669TGG45
FA 1123DPM
FD18042F10051

CPU#2:
ZS262669TGG45
FA 1123DPM
FD18042F10054

CPU#3:
ZS262669TGG45
FA 1123DPM
FD18042F10057

CPU#4:
ZS262669TGG45
FA 1123DPM
FD11602F10055

CPU#5:
ZS262669TGG45
FA 1123DPM
FD13902F10089

tear · Jan 12, 2013

Given they are ES, it shouldn't matter where you install them (as you have ultimate control of supply voltages, frequencies, etc.).

Side note -- even if they were production CPUs, you wouldn't be able to judge by CPU's
cover (so to speak).

From memory (I'm sure rest is covered in the interwebz with some confidence level...):

ZS262669TGG45: (first) 26: 2600 MHz nominal frequency, GG45: 16 cores (G), 16 MB cache (G), revision B1 (45)
FA 1123DPM: 1123: manufactured in week 23 of 2011

knopflerbruce · Jan 12, 2013

Are GG44 chips B0? A guy in my team grabbed 4 of those, 2.3GHz ones.

The F1 part is in some way also related to production week, btw. 1 means last didit of the year is 1, so for you it means 2011, and F means June. The odd part is that the equivalent to F1 and 1123 on other chips don't ALWAYS match, but 90+% of the time they do. The D is related to a day of the week (I've never seen anything but A-G, R-X and M. M could mean 'mixed', A and R = Monday, G and X Sunday). D is then Thursday. P means malaysia, it's a factory ID. The last M I do not know, it's always there.

The S in ZS2... could mean 'server', too. (desktop chips got a D there).

Mr.Nosmo · Jan 13, 2013

Thanks for the info guys!

I'm about to order the ram and I have nailed it down to 2 options:

http://www.alternate.de/html/product/G.Skill/DIMM_16_GB_DDR3-1600_Quad-Kit/984610/?
or
http://www.alternate.de/html/product/Corsair/DIMM_16_GB_DDR3-1600_Quad-Kit/962979/?

Should I just get the G.Skill (lowest price) or are the Corsair better?

AXm77 · Jan 13, 2013

So what about ZS272957TGG47?
2.7GHz 16 core 16MB L3 revision B3? Or it is PD?

Mr.Nosmo · Jan 13, 2013

The sellers description says itl: It's a 6284 (Interlago) - Not Piledriver. Based on the ..GG47, I guess it's a Rev B2.

Jeanjean · Jan 13, 2013

Hi.

Nobody knows the difference in term of PPD between IL and PD at same speed ?

knopflerbruce · Jan 13, 2013

Mr.Nosmo said:
The sellers description says itl: It's a 6284 (Interlago) - Not Piledriver. Based on the ..GG47, I guess it's a Rev B2.

I think so, too. B3 would be strange, at this point. Sucks there is no date code. I wonder if the temp seonsors work on that one, I got a pair og MC ES spicy hcips with the same "layout" - those sensors read 0c all the time.

Mr.Nosmo · Jan 18, 2013

Before I install my coolers, should I Lap my 4 6282SE ES's? I did it with success on my i7-980X (got 12C lower temp on Core 0+1 & 4+5 @ 4.4GHz/1.35V)....

bowlinra · Jun 6, 2013

I've spend 3 nights reading everything about the ES IL I can find here, and I must be missing something.

I've taken my 6166HE SM H8QGi-+F with OCNG ver 4 system and flash the bios back to stock v3.0b and installed the 4x 6272 ES GG44 B0.

When I disable PowerNow, I can't adjust the freq beyond 2.1 (stock). I create a script using tear steps and it appears to run, but get no change with clockspeed. Are reboots required for changes? Do the changes survive reboots? I must be missing a step.

Where does one start troubleshooting?

tear · Jun 6, 2013

I believe brilong has run into similar issue. Keep PowerNow enabled in the BIOS.
We do disable APM/CPB/Turbo with TPC anyway so it shouldn't hurt anything
(EDIT: though it would be cool to find why this happens one of these days...).

I would expect the script/approach to still be viable even with PowerNow enabled.

One caveat, I would make sure that your scaling governor is set to performance
so your (tpc) P-state changes don't interfere with OS P-state changes.
If you used [H]'s fahinstall, you should be set in this department. Otherwise (if on Ubuntu
12.04) run:

Code:

sudo update-rc.d ondemand disable

and then reboot/power-cycle.

bowlinra · Jun 6, 2013

tear said:
(EDIT: though it would be cool to find why this happens one of these days...).

I'll be glad to help figure this out.. I'm Ubuntu noob. I'm running the [H] ubuntu install.

Code:

bowlinra@amd4p:~$ sudo update-rc.d ondemand disable
update-rc.d: warning: ondemand start runlevel arguments (none) do not match LSB Default-Start values (2 3 4 5)
 Disabling system startup links for /etc/init.d/ondemand ...
 Removing any system startup links for /etc/init.d/ondemand ...
   /etc/rc2.d/S99ondemand
   /etc/rc3.d/S99ondemand
   /etc/rc4.d/S99ondemand
   /etc/rc5.d/S99ondemand
 Adding system startup for /etc/init.d/ondemand ...
   /etc/rc2.d/K01ondemand -> ../init.d/ondemand
   /etc/rc3.d/K01ondemand -> ../init.d/ondemand
   /etc/rc4.d/K01ondemand -> ../init.d/ondemand
   /etc/rc5.d/K01ondemand -> ../init.d/ondemand
bowlinra@amd4p:~$

I'll get the reboot and BIOS changes.

bowlinra · Jun 6, 2013

tear said:
One caveat, I would make sure that your scaling governor is set to performance
so your (tpc) P-state changes don't interfere with OS P-state changes.
If you used [H]'s fahinstall, you should be set in this department. Otherwise (if on Ubuntu
12.04) run:

Code:

sudo update-rc.d ondemand disable

and then reboot/power-cycle.

I'm running the [H] install and Ubuntu 12.04.. I ran the command to be sure. Upon reboot I get a pop-up error message "System program problem detected." I'm not sure what to do with this, haven't seen it before.

I've got back into BIOS and Disabled PowerNow, CPB Mode, CPU DownCore Mode, Clock Spread Spectrum.
BIOS History started stock -> ocng v1 -> ocng v3 -> ocng v4 -> SM 3.0b.
Memory is the Crucial 1600 Cas 8 v1.5 recommended memory.
Running tpc 0.44-rc2, also have a special version use in one of the threads of "TurionPowerControl"

Rebooted
Installed the voltcheck.sh

Code:

bowlinra@amd4p:~$ chmod +x ./voltcheck.sh
bowlinra@amd4p:~$ ./voltcheck.sh
CPU1 (min/avg/max): 966/968.0/970 mV
CPU2 (min/avg/max): 1030/1032.0/1036 mV
CPU3 (min/avg/max): 1044/1044.4/1046 mV
CPU4 (min/avg/max): 1082/1084.4/1090 mV

bowlinra@amd4p:~$ sudo clockspeed
Clockspeed (OCNG4.2)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
Refclock: 200.100 MHz
Clockspeed: 2113.629 MHz
bowlinra@amd4p:~$

I created a ./oc26.sh script

Code:

bowlinra@amd4p:~$ ls -l oc26.sh
-rwxr-xr-x 1 bowlinra bowlinra 309 Jun  6 19:35 oc26.sh

bowlinra@amd4p:~$ more oc26.sh
FREQ=2600
VCORE=1.1750
sudo ~/TurionPowerControl -boostdisable
sudo ~/TurionPowerControl -fo 1
sudo ~/TurionPowerContorl -ps 2 -vcore $VCORE -freq $FREQ
sudo ~/TurionPowerControl -ps 1 -vcore $VCORE -freq $FREQ
sudo ~/TurionPowerControl -ps 0 -vcore $VCORE -freq $FREQ
sleep 1
sudo ~/TurionPowerControl -fo 0


bowlinra@amd4p:~$ sudo ./oc26.sh
TurionPowerControl 0.44-rc2 (trunk-r177M)
Turion Power States Optimization and Control - by blackshard
Boost Lock Disabled.  Unlocked processor
Fid, Did, Vid, NodeTdp, NumBoostStates and CStateBoost can be edited
Boost disabled
APM disabled
Done.

TurionPowerControl 0.44-rc2 (trunk-r177M)
Turion Power States Optimization and Control - by blackshard
PState set to 1
Done.

sudo: /home/bowlinra/TurionPowerContorl: command not found
TurionPowerControl 0.44-rc2 (trunk-r177M)
Turion Power States Optimization and Control - by blackshard
All nodes all cores pstate: 1 -- set core voltage to 1.1750
All nodes all cores pstate: 1 -- set frequency to 2600.0000
Done.

TurionPowerControl 0.44-rc2 (trunk-r177M)
Turion Power States Optimization and Control - by blackshard
All nodes all cores pstate: 0 -- set core voltage to 1.1750
All nodes all cores pstate: 0 -- set frequency to 2600.0000
Done.

TurionPowerControl 0.44-rc2 (trunk-r177M)
Turion Power States Optimization and Control - by blackshard
PState set to 0
Done.
bowlinra@amd4p:~$

bowlinra@amd4p:~$ sudo clockspeed
Clockspeed (OCNG4.2)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
Refclock: 200.000 MHz
Clockspeed: 2103.089 MHz

bowlinra · Jun 6, 2013

Progress..

Found a typo in script, that I apparently copied from this post.http://hardforum.com/showpost.php?p=1039853330&postcount=32

Code:

...
sudo ~/TurionPowerControl -fo 1
sudo ~/TurionPowerCont[COLOR="Red"]or[/COLOR]l -ps 2 -vcore $VCORE -freq $FREQ
sudo ~/TurionPowerControl -ps 1 -vcore $VCORE -freq $FREQ
...

Is there a difference in TurionPowerControl and tpc, I have both installed? When do you use one or the other?

Script is now working. I jumped to FREQ=2800 and VCORE=1.1750. It folded it for an hour and assumed it crash it was powered down. I rebooted and got another pop-up error "Sorry, Ubuntu 12.04 has experienced an internal error" Problem type Crash and Title "colord crashed with SIGSEGV in dbus_message_get_reply_serial()"

I jumped to FREQ=2700 and VCORE=1.1750. and started folding again.
Here is the ./voltcheck.sh and clockspeed

Code:

bowlinra@amd4p:~$ sudo ./voltcheck.sh
CPU1 (min/avg/max): 1118/1123.2/1130 mV
CPU2 (min/avg/max): 1108/1117.2/1126 mV
CPU3 (min/avg/max): 1112/1116.0/1120 mV
CPU4 (min/avg/max): 1110/1115.2/1120 mV

bowlinra@amd4p:~$ sudo clockspeed
Clockspeed (OCNG4.2)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
Refclock: 200.101 MHz
Clockspeed: 2701.363 MHz
bowlinra@amd4p:~$

bowlinra@amd4p:~$ sudo tpc -temp
[sudo] password for bowlinra:
TurionPowerControl 0.44-rc2 (export)
Turion Power States Optimization and Control - by blackshard

Detected processor: Family 15h (Bulldozer/Interlagos/Valencia) Processor
Machine has 8 nodes
Processor has 8 cores
Processor has 7 p-states
Processor has 2 boost states
Processor temperature slew rate:9.0Â°C

Temperature table:
Node 0  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37   C6:37   C7:37
Node 1  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37   C6:37   C7:37
Node 2  C0:46   C1:46   C2:46   C3:46   C4:46   C5:46   C6:46   C7:46
Node 3  C0:46   C1:46   C2:46   C3:46   C4:46   C5:46   C6:46   C7:46
Node 4  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33
Node 5  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33
Node 6  C0:31   C1:31   C2:31   C3:31   C4:31   C5:31   C6:31   C7:31
Node 7  C0:31   C1:31   C2:31   C3:31   C4:31   C5:31   C6:31   C7:31

Done.
bowlinra@amd4p:~$

I should be aiming for what kinda of Freq and Vcore? Also should I be concerned by the error message popping up in the GUI?

Grandpa_01 · Jun 7, 2013

TurionPowerControl and tpc are the same just the tpc version is the newest version. CPU #2 appears to be running a bit warmer than it's counter parts but well within limits so no big deal it may be the tim but also may be the cpu. For tpc just run
sudo tpc -fo 1
sudo tpc -ps 2 -vcore $VCORE -freq $FEQ
sudo tpc -ps 1 -vcore $VCORE -freq $FREQ
I am not sure what you should be shooting for as far as Freq and Vcore but any error when booting usually points to some sort of instability I would try upping the v a little if possible

bowlinra · Jun 7, 2013

I notice that yesterday and I pull the CM 212 off and re-positioned it and thought I tighten it down pretty good.. Maybe it need clean off and tim reapplied.

I also guess I need to look into the voltage regulator heat sinks. Do we know the optimal mix of Enzotech MOS-C10 and Enzotech MOS-C1 for a SM H8QGi+-F broad?

tear · Jun 7, 2013

bowlinra, nice find, thank you

Still, you copied the instructions from troubleshooting thread with WIP, non-released version of TPC.

These instructions only work with the TPC quoted in the post and I would generally recommend
against using them (both instructions and the TPC) for typical OC (what you're after) as the syntax
of WIP TPC is different compared to regular TPC and it may change again before release.

What's most appropriate for your chips is installing TPC per the G34 checklist thread (see guides)
and using something along these lines: http://hardforum.com/showpost.php?p=1039116320&postcount=7
for OC (like you've observed, you may need to have PowerNow enabled).

Also, to remove WIP TPC you can use:

Code:

rm ~/TurionPowerControl

tpc and TurionPowerControl are the same (tpc is just a symlink) if you install an official version.
It's not necessarily true in case of one-off versions (as I don't usually bother with making them
lean and mean...).

bowlinra · Jun 7, 2013

Oh... removed TurionPowerControl..

What about the pop-up error message? That appeared upon reboot after this "sudo update-rc.d ondemand disable"

Uploaded with ImageShack.us

tear · Jun 7, 2013

Just to be clear, the idea was to remove the WIP TurionPowerControl binary from your home directory.
Stock TPC should still be installed as _both_ TurionPowerControl and tpc in /usr/bin.

That said,

Code:

ls -l /usr/bin/Turion* /usr/bin/tpc

should return two lines, one binary (TurionPowerControl) and one symlink (tpc).

And

Code:

ls -l $HOME/TurionPowerControl

should return 'No such file or directory' error.

About colord -- its function is quite far from what we're doing so the crash is unlikely to be related.

More than that, in this post I said that update-rc.d line was applicable only to non-[H] installations
(as [H]'s fahinstall performs this operation anyway). In other words, you didn't need to run it
and running it didn't in fact modify the configuration (fahinstall did it first).

So... colord's crash is likely caused by something else. Indeed, google returns quite many hits https://www.google.com/search?q=colord+"dbus_message_get_reply_serial"

Colord isn't something we critically need for FAH operation so I'd probably ignore it...

bowlinra · Jun 7, 2013

Excellent! colord sounds unimportant and tpc appears to be installed fine. On to more OC'ing and watching the heat.

Thanks for all the awesome help once again!

plext0r · Jun 7, 2013

tear said:
I believe brilong has run into similar issue. Keep PowerNow enabled in the BIOS.
We do disable APM/CPB/Turbo with TPC anyway so it shouldn't hurt anything

True. Until my GL board bought the farm (in the RMA process now), I left PowerNow enabled, but then I software-disabled it using tpc. Only this allowed me to run the IL ES chips above their stock clocks.

bowlinra · Jun 9, 2013

System has locked up 3 or 4 times today.. I can't seem to push the 6272 ES to 2.7Ghz, even had a lockup at 2.6Ghz. I would have thought I should be able to do better. Any thoughts? Ideas?

tear · Jun 9, 2013

Few things you can do:

1. Check/monitor temps at load (tpc -temp / tpc -mtemp).

2. What vcore are you at? Does cranking it up a little bit help?

3. Check for MCEs: http://hardforum.com/showpost.php?p=1039812753&postcount=2 (second code
  box)

4. Try isolating the issue, run FAH on two CPUs at a time (-smp 32); take advantage of Kraken's
  'startcpu' configuration option (see http://hardforum.com/showthread.php?p=1039857981 on how to
  use it) to test CPUs 32-63

bowlinra · Jun 9, 2013

At FREQ=2700 and VCORE=1.1750 server locks up in minutes.. (Lock being, appears to shutdown with no fans spinning, only powers back on if the cord is physically disconnected from the PSU.) I've even pulled it off UPS, encase there was an issue with overdraw, seeing the same problem.

I'm not sure how much headroom, I have on the vcore.. I changed it to 1.1800, but it appears to round down to 1.1750.

1. Temp appear fine, I can confirm with the 62xx chips Nodes 2&3 are CPU3 (Not CPU2).

Code:

Node 0  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38   C6:38   C7:38
Node 1  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38   C6:38   C7:38
Node 2  C0:45   C1:45   C2:45   C3:45   C4:45   C5:45   C6:45   C7:45
Node 3  C0:45   C1:45   C2:45   C3:45   C4:45   C5:45   C6:45   C7:45
Node 4  C0:34   C1:34   C2:34   C3:34   C4:34   C5:34   C6:34   C7:34
Node 5  C0:35   C1:35   C2:35   C3:35   C4:35   C5:35   C6:35   C7:35
Node 6  C0:32   C1:32   C2:32   C3:32   C4:32   C5:32   C6:32   C7:32
Node 7  C0:32   C1:32   C2:32   C3:32   C4:32   C5:32   C6:32   C7:32

2. Can't get all the scripts run at 2700 before crashing, so I've drop back to FREQ=2600 and VCORE=1.1750
3. sudo tpc -l Output
sudo tpc -CM

Code:

Ts:695466 -
Node 0  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 40
Node 1  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 40
Node 2  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 48
Node 3  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 48
Node 4  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 36
Node 5  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 36
Node 6  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 34
Node 7  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 34
Node0
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node1
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node2
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node3
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node4
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node5
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node6
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
Node7
 C0:     0     0   589     0     0     0     0       C1:     0     0   589     0     0     0     0
 C2:     0     0   589     0     0     0     0       C3:     0     0   589     0     0     0     0
 C4:     0     0   589     0     0     0     0       C5:     0     0   589     0     0     0     0
 C6:     0     0   589     0     0     0     0       C7:     0     0   589     0     0     0     0
MinTctl:33       MaxTctl:48

Code:

bowlinra@amd4p:~/fah$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(lsrt /var/log/{messages,syslog}*)
ls: cannot access /var/log/messages*: No such file or directory
bowlinra@amd4p:~/fah$

4. I'm still thinking back to CPU4 isn't an exact match cpu, could this be creating these problems? http://hardforum.com/showpost.php?p=1039938919&postcount=17
I'll need to to read up on how to isolate the cpus with Krakens and get that working.

tear · Jun 9, 2013

A-ha! Very important detail. So, not a lock-up (freeze) but spontaneous poweroff/shutdown.

Give this one a read in your spare time (somewhat similar): http://hardforum.com/showthread.php?t=1760178

The kind of poweroff/shutdown you've described suggests PSU's over-current protection
kicking in. What PSU are you using? Did you connect at least two EPS12V cables?
Are you using any X->EPS12V adapters? If so, what kind and how many?
Do you have another PSU you could try?

I'm wondering if it's possible for something-else-than-+12V to hit the current limit...
that would be quite... strange.

bowlinra · Jun 9, 2013

Using EVGA classified SR-2 1200 watt (Silver) PSU http://www.evga.com/products/pdf/100-ps-1200-gr.pdf
I have both EPS12V plugged in. Same PSU and board combo have been running a set 6166HE @ 2.27Ghz 522 watts for an about year.

I have brand new SeaSonic 850W (Gold) sitting for another project but that would like to be tight. It's currently pulling Amd 6272ES @ 2.6Ghz 773 Watt via Kill-A-Watt. I have a Corsair 1200 on my Amd 6176SE 2.748Ghz 699 Watt, but that would be a little work to swap the two.

Should I look to isolate CPU4 and OC CPU 1-3 or does this go down another solution?

bowlinra · Jun 9, 2013

bowlinra said:
Should I look to isolate CPU4 and OC CPU 1-3 or does this go down another solution?

Status update:
ran my oc27.sh

Code:

bowlinra@amd4p:~$ cat ./oc27.sh
FREQ=2700
VCORE=1.1825
sudo tpc -boostdisable
sudo tpc -fo 1
sudo tpc -set ps 2 freq $FREQ vcore $VCORE
sudo tpc -set ps 1 freq $FREQ vcore $VCORE
sudo tpc -set ps 0 freq $FREQ vcore $VCORE
sleep 1
sudo tpc -fo 0

bowlinra@amd4p:~$ sudo clockspeed
Clockspeed (OCNG4.3)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
Refclock: 200.000 MHz
Clockspeed: 2700.598 MHz

Uninstall thekraken and Rewrapped with starting at 32 and started folding with these commands.

Code:

thekraken -u
thekraken -i -c startcpu=32
screen ./fah6 -smp 32

Confirm Temps on Nodes 4-7 (Knowing this is physical CPU socket 2 & 4)

Code:

Temperature table:
Node 0  C0:24   C1:24   C2:24   C3:24   C4:24   C5:24   C6:24   C7:24
Node 1  C0:24   C1:24   C2:24   C3:24   C4:24   C5:24   C6:24   C7:24
Node 2  C0:25   C1:25   C2:25   C3:25   C4:25   C5:25   C6:25   C7:25
Node 3  C0:25   C1:25   C2:25   C3:25   C4:25   C5:25   C6:25   C7:25
Node 4  C0:36   C1:36   C2:36   C3:36   C4:36   C5:36   C6:36   C7:36
Node 5  C0:36   C1:36   C2:36   C3:36   C4:36   C5:36   C6:36   C7:36
Node 6  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33
Node 7  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33

Working a P8836 TPF 3:17, will need to run about 5 hours, assuming success.

tear · Jun 9, 2013

My guess is as good as yours

The 'binary search' that you're doing lets us kill two birds with one stone:
1. It helps us isolate the issue to single CPU (shall it be a CPU that's causing it)
2. If it's overall power delivery issue then you should be able to crank 3.0 GHz out of CPU
  pair (but one pair at a time!) without 'the symptom' manifesting... (EDIT: and, ofc, operate
  at 2.7 w/o issues as well).

I would _not_ obsess about the odd CPU, personally...

EDIT: vcore resolution is 12.5 mV so, next step after 1.1750V is 1.1825V.

bowlinra · Jun 10, 2013

So upon successful completion, I should go freq=2800, instead of switching the CPU pairs correct?
About 2 hours in and still running.

Code:

bowlinra@amd4p:~$ sudo ./voltcheck.sh
CPU1 (min/avg/max): 1206/1206.0/1206 mV
CPU2 (min/avg/max): 1114/1121.2/1128 mV
CPU3 (min/avg/max): 1220/1220.0/1220 mV
CPU4 (min/avg/max): 1118/1123.6/1128 mV

tear · Jun 10, 2013

I think doing either makes sense, that is, cranking the clock on current CPU pair
or switching to the other pair.

You've had the most experience -- if you're certain that 4P config would've collapsed by now,
then you can vouch this pair as "good" (or at least not horribly broken) and switch to the
other pair.

Though, like I said, I don't have a strong preference

bowlinra · Jun 10, 2013

Completed WU at Freq=2800. (So 1 WU at 2700 and another at 2800)

Changed to run with the other pair at 2700. Used the following commands.

Code:

sudo ./oc27.sh
thekraken -u
thekraken -i
screen ./fah6 -smp 32

Confirm Temps on Nodes 0-3 (Knowing this is physical CPU socket 1 & 3)

Code:

Temperature table:
Node 0  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38   C6:38   C7:38
Node 1  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38   C6:38   C7:38
Node 2  C0:44   C1:44   C2:44   C3:44   C4:44   C5:44   C6:44   C7:44
Node 3  C0:44   C1:44   C2:44   C3:44   C4:44   C5:44   C6:44   C7:44
Node 4  C0:19   C1:19   C2:19   C3:19   C4:19   C5:19   C6:19   C7:19
Node 5  C0:19   C1:19   C2:19   C3:19   C4:19   C5:19   C6:19   C7:19
Node 6  C0:20   C1:20   C2:20   C3:20   C4:20   C5:20   C6:20   C7:20
Node 7  C0:20   C1:20   C2:20   C3:20   C4:20   C5:20   C6:20   C7:20

bowlinra@amd4p:~$  sudo clockspeed
Clockspeed (OCNG4.3)
Family 15h
Turbo is supported. 2 boost state(s).
Running, please wait...
sudoRefclock: 200.001 MHz
Clockspeed: 2700.014 MHz

bowlinra@amd4p:~$ sudo ./voltcheck.sh
CPU1 (min/avg/max): 1126/1130.8/1136 mV
CPU2 (min/avg/max): 1218/1219.6/1220 mV
CPU3 (min/avg/max): 1124/1127.6/1132 mV
CPU4 (min/avg/max): 1228/1228.0/1228 mV

Working a P8571 TPF 3:20, will need to run about 5 hours, assuming success. Then will move to 2800 and run another WU.

bowlinra · Jun 11, 2013

Status update.
Running on cores 0-31 almost full day. Run for 1 WU at 2700 and WU at both 2800 and 2900.. Currently running WU at 3000 overnight..

sudo tpc -l | pastebinit #Output

Cores 32-63 have complete WUs at both 2700 and 2800. Hope I can get a 3000 kicked off in the morning.

Next step back to -smp 64 at 2700?

tear · Jun 11, 2013

Guess so! Just to confirm it collapses like it did in the past...

If it does, I would definitely experiment with PSUs. AX1200 is a solid performer many of us
use with 4Ps, some at 90%+ of its output. Right, Grandpa?

Trying to tune 16 core 62xx ES

[H]ard|Gawd

Weaksauce

[H]ard|Gawd

Weaksauce

Weaksauce

[H]ard|Gawd

Limp Gawd

n00b

[H]ard|DCer of the Year 2011

Limp Gawd

n00b

Limp Gawd

n00b

Weaksauce

Limp Gawd

n00b

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|DCer of the Year 2013

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard DCOTM x3

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

Limp Gawd

[H]ard|DCer of the Year 2011