Trying to tune 16 core 62xx ES

Guess so! Just to confirm it collapses like it did in the past...

If it does, I would definitely experiment with PSUs. AX1200 is a solid performer many of us
use with 4Ps, some at 90%+ of its output. Right, Grandpa? ;)

My only beef with AX1200 is Corsair shipped my PSU with badly-wired 8-pin cables (they had all four positive reversed with all four ground). I ended up returning it for a Seasonic 1250W PSU since I've usually used Seasonic for all my other PSUs, went with Corsair once and got screwed.

Yes, I could have used my pin-removal tool to rewire the 8-pin cable correctly, but who knows what else they did wrong on that unit?

:confused:
 
I wasn't suggesting a purchase (since you own one) but trying a unit that is known to have withstood
systems consuming close to 1.2kW.

Your 850W unit should be fine as well (at least enough to eliminate/confirm the PSU being a variable).
My 3.0 IL at ~1.175V consumed 800-820W at the wall depending on ambient temp.

Both these units are single rail designs which is worth noting.
 
Complete 2 WUs at 3000 and finishing a 3rd.. Maybe it has something to do with the UPS. I'm plugged direct into the wall for all these test.. I'm plan to leave it alone and valid the PSU is solid, before moving back to the UPS. <Side note - Now that I'm watching very carefully its working just fine...>
 
I have 2 AX1200 and had to RMA both of them as some point. I've had no problem getting either of them replaced, at least Corsair stand behind the product. I have no issues with them, there just in use in other systems currently.. I'll pull one, if it comes to that.
 
Some folks have seen issues folding BA units but didn't see issues w/SMP.
Similarly, configurations w/o Kraken are less likely to crash the H/W compared
to configurations with Kraken.

Just two more cents from me.

If you have one, use backed-up BA unit to stress the system (remember to incl. Kraken).
If not, drop me a PM. More info on how to prep a unit for standalone testing: http://hardforum.com/showpost.php?p=1038639726&postcount=433
 
Last edited:
Ran everything fine in 32 cores mode up to 3.0Ghz. I didn't push it past that. Moved to 64 cores and 3.0Ghz SMP fold and believe it completed 2 WUs. Any Bigadv WUs dies as soon as all the threads wrap up to 6400%. At 2.7Ghz, I started FAH client then ran to the garage to check the Kill-a-watt, watched power off as it was pushing 795 Watts.. (I assume this was the problem for failed attempts with 3.0, 2.9 and 2.8, I didn't have the meter on it at the time). Then went to 2.6Ghz and it seems to max out about 775 Watts and still folding.

I'll swap PSU tomorrow night, looks like you nailed this one.
 
Last edited:
Swapped PSUs to a brand new Seasonic 850W, it been running over 2.7Ghz for 2 hours, currently at 2.9Ghz @ 816W.
I'll let that bake for another 24 hours. Then play music PSU and move the Corsair 1200AX to this rig and see how much we can OC. I've also looked in to ordering some Enzotech MOSFET Cooler (MOS-C1, MOS-C10), any idea how many to buy of each for a SM H8QGi board?
 
Swapped PSUs to a brand new Seasonic 850W, it been running over 2.7Ghz for 2 hours, currently at 2.9Ghz @ 816W.
I'll let that bake for another 24 hours. Then play music PSU and move the Corsair 1200AX to this rig and see how much we can OC. I've also looked in to ordering some Enzotech MOSFET Cooler (MOS-C1, MOS-C10), any idea how many to buy of each for a SM H8QGi board?

I ended up using the Enzotech BMR-C1 since the MOS-C1 was out of stock or discontinued at multiple stores. I got two packs from Amazon for my GL board and I had to cut two in half to fit around some capacitors between the VRMs. I believe the Gi board can make do with all 14x14mm heatsinks.

The BMR-C1 ends up covering 2, 3 or 4 VRMs at once and I believe I ended up using 14 of them for the GL board.

http://www.amazon.com/ENZOTECH-Memory-Ramsink-BMR-C1-Heatsink/dp/B002BWXW6E/ref=sr_1_1?ie=UTF8&qid=1371131668&sr=8-1&keywords=bmr-c1&tag=hardfocom-20
 
As an Amazon Associate, HardForum may earn from qualifying purchases.
Swapped PSUs to a brand new Seasonic 850W, it been running over 2.7Ghz for 2 hours, currently at 2.9Ghz @ 816W.
I'll let that bake for another 24 hours.

With the new 850W PSU, system has "locked up" twice (ie. can't ping, can't SSH, BUT the motherboard fans are running at its continues to drawing some 645W) once at 2.9 and another at 2.8Ghz. I can't that's an improvement.

Any ideas?

Code:
bowlinra@amd4p:~/fah$ dmesg | tail -n 20 | pastebinit
Pastebinit Output
 
That's one nasty SOB you've got there.

Try folding at 2.8 one/two more times and see what happens (if it crashes the same or different way).

Next (after one/two extra runs) I would check for MCEs: http://hardforum.com/showpost.php?p=1039812753&postcount=2
(second code box). If there are no MCEs, I'd bump vcore to 1.20 (still at 2.8) and see if it makes
the machine collapse sooner (or perhaps later).
 
Been folding for 23hrs at 2800.

bowlinra@amd4p:~$ sudo tpc -l | pastebinit #Output

sudo tpc -CM
Code:
MinTctl:27       MaxTctl:42

Ts:83073349 -
Node 0  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 35
Node 1  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 35
Node 2  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 42
Node 3  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 42
Node 4  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 33
Node 5  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 33
Node 6  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 27
Node 7  c0:ps2 - c1:ps2 - c2:ps2 - c3:ps2 - c4:ps2 - c5:ps2 - c6:ps2 - c7:ps2 - Tctl: 27
Node0
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node1
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node2
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node3
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node4
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node5
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node6
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
Node7
 C0:     0     0   591     0     0     0     0       C1:     0     0   591     0     0     0     0
 C2:     0     0   591     0     0     0     0       C3:     0     0   591     0     0     0     0
 C4:     0     0   591     0     0     0     0       C5:     0     0   591     0     0     0     0
 C6:     0     0   591     0     0     0     0       C7:     0     0   591     0     0     0     0
MinTctl:27       MaxTctl:42

I haven't ever seen this command work.. I assume this file is created on the errors? or am I missing some files?

Code:
bowlinra@amd4p:~$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(ls -rt /var/log/{messages,syslog}*)
ls: cannot access /var/log/messages*: No such file or directory
 
It checks two file patterns: /var/log/messages* and /var/log/syslog*. Some Linux distributions
use only one of them -- in that case grep will report it missing but will still check the other one.

Extra cooling helps only a little -- it doesn't bring you closer to understanding which
specific component/area needs attention. That's why I was hoping to see *something*
in the logs.

Also, your crashes could be kernel panics (which, by nature, they can't be logged) -- can you
switch to text console to make sure we see it shall you run into one? -- press Ctrl+Alt+F1.
 
Code:
bowlinra@amd4p:~$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(ls -rt /var/log/{messages,syslog}*)
ls: cannot access /var/log/messages*: No such file or directory

Ok, so not having a /var/log/messages file means, there has been no error messages recorded. (Probably shutdown before written?)

Also, your crashes could be kernel panics (which, by nature, they can't be logged) -- can you
switch to text console to make sure we see it shall you run into one? -- press Ctrl+Alt+F1.
If I find the box (locked up and unresponsive over the network, but the fans still spinning), plug the monitor & keyboard in and hit "Ctrl+Alt+F1" and take a picture of the screen.

Since I'm home and able to watch it more carefully. Should I move up to 2.9Ghz to increase the likelihood of a problem?
 
I
p10707752.jpg

That is either the cleanest garage or dirtiest living room I have seen :D
 
Code:
bowlinra@amd4p:~$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(ls -rt /var/log/{messages,syslog}*)
ls: cannot access /var/log/messages*: No such file or directory

Ok, so not having a /var/log/messages file means, there has been no error messages recorded. (Probably shutdown before written?)
Rather the fact that nothing was found in syslog file :)

If I find the box (locked up and unresponsive over the network, but the fans still spinning), plug the monitor & keyboard in and hit "Ctrl+Alt+F1" and take a picture of the screen.
You need to do this ahead of time. When locked up, Ctrl+Alt+F1 will yield nothing...


Since I'm home and able to watch it more carefully. Should I move up to 2.9Ghz to increase the likelihood of a problem?
It's really your call... I would probably leave it hoping for some less catastrophic event to occur.
 
Well it still running 3 days later.. I'm getting ready to head out of town for a week and thinking about just leaving it running.
 
You won't believe what happened today.. Circuit breaker blew before I could get home, and blew again an hour or so after get home.. I couldn't figure out why I could power up the UPS until....



Uploaded with ImageShack.us

Fortunately nothing caught fire and I discovered it right before leaving on a 9 day vacation. (I normal leave my systems running while out.)

Anyways long story.. Still nothing in MCE and the OC seems pretty stable *knocking on wood*
 
It looks to me like you are using an extension cord rolled up in the box, and when it is rolled up it goes hot under load. The load must always be with the cable out of the box, and this may well be the reason why things went wrong.
 
Last edited:
Thanks AndyE, and as they say, it is the first that is difficult, so I hope the next one comes easier.:D
 
V220 is an interesting thought. Would that lower the amps and the electric bill? Ie swap the back plane on the UPS to v220 and connect the servers normally v110? Or is that wishful thinking?
 
220V supply should improve power conversion efficiency in practically all cases by like 1-3%, but you would want to run 220 into the PSU. Nearly all recent PSUs of quality are 110/220 auto-sensing and even cheaper or older PSUs will have a 110/220 switch, but you will probably want to confirm for your specific model anyway. I am not sure if it's practical to reconfigure a UPS for 220V output aside from models designed for it, you'd probably be better off finding a 220V UPS (possibly one at surplus with dead batteries that you can swap your good batteries into, if budget conscious). That said, when running close to or exceeding the rated amperage of your power connectors or wiring at 110V, or close to the rated power handling capacity of the PSU, you likely stand to gain somewhat more than that 1-3% improvement in efficiency by using 220V with connectors appropriate for your load :p
 
Last edited:
Good answer Tiberville.
I was once told to balance my load in my house as it would increase efficiency of the transformer at the pole. Although only slightly the theory was and makes sense to me that, if one side of the 220v transformer (110v) was loaded higher, then the other side would be somewhat inefficient and the loasded side would draw slightly more amperage. Slight being the key word here. Made sense to me and from time to time I go with a meter and check my two 220 legs and see what each is pulling. If one side is higher than the other I try switching out breakers till I get it as close as possible to balanced.
Probably doesn't make too much difference but every penny counts.
 
This probably needs to be moved to a separate thread. BUT I was told by someone that moving from 110V to 220V dropped the Amps used in half. Would that not drop the watts used in half. Thus my folding electric bill in half?
 
This probably needs to be moved to a separate thread. BUT I was told by someone that moving from 110V to 220V dropped the Amps used in half. Would that not drop the watts used in half. Thus my folding electric bill in half?

P = VA. (P = watts, V = Volts, A = amps)

If you double the volts, you halve the amps, but the power remains exactly the same. Sorry. :(
 
Back from vacation and discovery I'm completely maxing my 15Amp circuit to my rack, would appear to be the root cause of these problems.. I'm in process of figuring out how to get more power to the garage. Leaning toward a 60Amp sub panel in the garage, so 240V could be an option.
 
Also, your crashes could be kernel panics (which, by nature, they can't be logged) -- can you
switch to text console to make sure we see it shall you run into one? -- press Ctrl+Alt+F1.

Ok.. Final captured something worth looking at, folding at 2900 and it hung within about 3 hours. Here is the screen picture.


Uploaded with ImageShack.us

Nothing in the MCE logs either.
 
No MCEs, k...

It looks like other things are going on, though... seems the machine may have run
into some rcu issues -- they triggered the NMI backtraces that filled the whole
screen (and then some). Finally (5 minutes later) we can see a soft-lockup on CPU 56.

RCU issues are fairly common when OCing (though I do not know what exactly is causing them)
and, from my experience, are harmless as long as you don't get too many of them
("inexact science").

Soft-lockup -- haven't seen those in context of OCing -- I got nothing. Though at least
we do have a CPU number.

Was the machine completely hung when you saw the screen?

In any event, I'd like to see the logs (hoping most of it got saved) -- can you tar/zip
all syslog* files from /var/log and e-mail them to me? -- [email protected]

Also, what I would probably do (and yes, it's painful) -- run at 2900 again and wait
for another lockup. The hope is that, with enough data, patterns will emerge.
 
I've PMed the logs and running at 2900 and watching.
Here is the contents of the /var/logs
Code:
bowlinra@amd4p:/var/log$ dir
alternatives.log    bootstrap.log   hp                     samba
alternatives.log.1  btmp            installer              speech-dispatcher
apache2             btmp.1          jockey.log             syslog
apport.log          ConsoleKit      jockey.log.1           syslog.1
apport.log.1        cups            kern.log               syslog.2.gz
apport.log.2.gz     dist-upgrade    kern.log.1             syslog.3.gz
apport.log.3.gz     dmesg           kern.log.2.gz          syslog.4.gz
apport.log.4.gz     dmesg.0         kern.log.3.gz          syslog.5.gz
apport.log.5.gz     dmesg.1.gz      kern.log.4.gz          syslog.6.gz
apport.log.6.gz     dmesg.2.gz      lastlog                syslog.7.gz
apport.log.7.gz     dmesg.3.gz      lightdm                udev
apt                 dmesg.4.gz      mail.err               ufw.log
auth.log            dpkg.log        mail.log               unattended-upgrades
auth.log.1          dpkg.log.1      news                   upstart
auth.log.2.gz       dpkg.log.2.gz   pm-powersave.log       wtmp
auth.log.3.gz       dpkg.log.3.gz   pm-powersave.log.1     wtmp.1
auth.log.4.gz       faillog         pm-powersave.log.2.gz  Xorg.0.log
boot                fontconfig.log  pm-powersave.log.3.gz  Xorg.0.log.old
boot.log            fsck            pm-powersave.log.4.gz

bowlinra@amd4p:/var/log$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(ls -rt /var/log/{messages,syslog}*)
ls: cannot access /var/log/messages*: No such file or directory
 
Ok, I have extended/revised the grep to look like this (you may wish to update your notes):
Code:
sudo zgrep -Ei 'mcelog|machine.check|hardware.error|detected.stall|lockup|oops:' $(ls -rt /var/log/{messages,syslog}* 2> /dev/null)
I'm thinking we should integrate it with fahdiag somehow...

Anyway, when I run this zgrep on your logs, I get:
Code:
syslog.2:Jun 29 00:57:44 amd4p kernel: [174602.212241] Oops: 0000 [#1] SMP 
syslog.2:Jun 29 00:58:46 amd4p kernel: [174602.437479] Watchdog detected hard LOCKUP on cpu 60
syslog.2:Jun 29 00:58:46 amd4p kernel: [174603.000194] Watchdog detected hard LOCKUP on cpu 21
syslog.2:Jun 29 00:58:46 amd4p kernel: [174662.444131] INFO: rcu_sched detected stalls on CPUs/tasks: { 60} (detected by 6, t=15002 jiffies)
syslog.1:Jun 29 13:57:32 amd4p kernel: [41040.680246] INFO: rcu_bh detected stall on CPU 26 (t=0 jiffies)

We see CPU 60 failing, possibly taking CPU21 with it.
Zero-jiffy long stall (CPU26) can be ignored given we're seeing more serious issues.

In your pic, we see a collapse of CPU56 -- same socket as CPU60.

You could consider cranking vcore only on last CPU -- nodes 6/7 (but not others).

This actually brings us to a fine-tuning topic, that is, configuring different vcores
for different CPUs (for optimal performance*). There's a process we've been using
for a while but it has never been written down. I may go about posting it once
I settle in EU (3 weeks from now); in the mean time, feel free to jump into #hardocp
or #hardfolding on freenode.

*) in other words, it may be you've been cranking voltages of all CPUs even though
&#8194;&#8194;only one CPU actually needed it
 
Back
Top