OC 4x 6172 on [H] Appliance

bowlinra · Jul 15, 2013

I run across a deal on some 6172s, I just could pass up (I'm sure you understand). I wanted to make sure I'm get the most I can out of them. I wasn't seemingly able to get anything over 232 (2.436 Ghz) to fold for over 24 hours. So I rebuilt the system with the [H] Appliance image and started over, getting basically the same results. I'm I topped out, or anyone have some ideas?

System: H8QGi-+f w/ 4x 6172 w/ CM 212s
Memory is the recommended 16 sticks - Crucial (Yellow ones) 1600 Cas-8
PSU : Corsair AX1200

Running at 233 - 234, I would get a crash reboot back to the login prompt.

While attempting 235 - 237, I would get a crash reboot and hang at the BIOS boot with following

Code:

Node: Bank interleave requested but not enabled
Node0: HT Link SYNC Error
Node1: HT Link SYNC Error
Node2: HT Link SYNC Error
Node3: HT Link SYNC Error
Node4: HT Link SYNC Error
Node5: HT Link SYNC Error
Node6: HT Link SYNC Error
Node7: HT Link SYNC Error
Press F1 to Resume

Using the Non Appliance install I along saw the following at the console (Ctrl+Alt+F1) <Side questions anyway to go back to the GUI from the console?>

Code:

I Ignored bunches of Stack / Call Traces
INFO: rcu_bh detected stall on CPU 19 (t=0 jiffies)
[Hardware Error]: CPU 22oMCO_STATUS[-|CE|-|AddrVICECC]: 0x946e4000da000145
[Hardware Error]: Data Cache Error: Data/Tag DWR error.
[Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR
INFO: rcu_bh detected stall on CPU 23 (t=0 jiffies)

horde@amd-6172:~$ fahdiag | pastebinit
http://paste.ubuntu.com/5879518/

horde@amd-6172:~/fah$ sudo tpc -dram |pastebinit
http://paste.ubuntu.com/5879528/

horde@amd-6172:~$ sudo egrep -i 'mcelog|machine.check|hardware.error' $(ls -rt /var/log/{messages,syslog}*) |pastebinit
ls: cannot access /var/log/messages*: No such file or directory
http://paste.ubuntu.com/5879521/

PS> I didn't want to clutter the guide with this stuff, if it turns out useful we can link it.

tear · Jul 16, 2013

"Bank interleaving requested but not enabled" is effectively informational -- tells you that
you're using single rank DIMMs (which is completely ok).

HT Link Sync Error OTOH indicates HT troubles. Try alternative tuning.
If that doesn't help (specifically with HT link sync errors) then you should consider cleaning
CPU pads if you haven't done so before installation.

Spontaneous reboots *could* indicate memory issues. I'd recommend running a pass of
memtest86* at the failing OC (233/234). If it fails, try relaxing memory timings (advanced
option in ocng-cu) _or_ enabling 'allow speeds > DDR3-1333' (ditto).

*) it should already be installed in the appliance and available in GRUB menu

Zero-jiffies long stalls are *generally* harmless (even though something funky is happening
in the CPU).

L1 cache error was corrected (CECC) but it is telling you that you're pushing given CPU.

Also, interesting how CPUs 19, 22 and 23 (vide issues from your console) are all on the
same socket (socket 2).
Sockets on a Gi In order of power regulation quality: 4-1-3-2.
That kinda suggests that CPU in socket 2 is "struggling". Could it be responsible for your
spontaneous reboots? Too early to tell. It does make sense to try to eliminate HT and
memory first, however.

To go back to GUI, press Alt+F7 (or otherwise sweep over Alt+Fx keys, one of them should work).

In any event, I consider 230 already _very_ healthy OC on 6172/4 chips.

bowlinra · Jul 16, 2013

I only get the HT errors following a Spontaneous reboot, these all clear with an additional reboot. I'll finish up the current WU and get the memtest86 started. 232 seems to be stable, what would you suggest be the next speed?

tear · Jul 16, 2013

Ok, I was under impression that you witnessed spontaneous reboots that were _not_ accompanied by HT
link sync errors. That's false impression, basically?

If so, then forget memory suspicion -- enable alternative HT tuning, see how that affects the issue.

m33pm33p · Jul 16, 2013

Ive had problems like this before with my first set of 6166's. I was unable to get ANYTHING over 230 without the system crashing at longin prompt and receiving HT-Retries. Tear and I worked on it for a few days and came to the conclusion that it just wasnt going to clock that high.

I now have 72's and am faced with a similar problem. No crashes at login prompt but anything over 220 and the WU will fail.

If everything else checks out, that may just be your end of the road OC. Ive got to constantly tell myself "OCing is not a given its a possibility." Each chip is different, that socket 2 cpu may just have been a weaker batch.

If you're getting 230, Ill trade you my 220 72?

musky · Jul 16, 2013

I still have a set of 6180s with a top O/C of 218 base clock. The best chips I ever had would do 225 base clock. I don't think I would cry too much @ 230...

Deleted member 12106 · Jul 16, 2013

m33pm33p said:
Ive had problems like this before with my first set of 6166's. I was unable to get ANYTHING over 230 without the system crashing at longin prompt and receiving HT-Retries. Tear and I worked on it for a few days and came to the conclusion that it just wasnt going to clock that high.

I now have 72's and am faced with a similar problem. No crashes at login prompt but anything over 220 and the WU will fail.

If everything else checks out, that may just be your end of the road OC. Ive got to constantly tell myself "OCing is not a given its a possibility." Each chip is different, that socket 2 cpu may just have been a weaker batch.

If you're getting 230, Ill trade you my 220 72?

Try a different mobo

m33pm33p · Jul 16, 2013

sc0tty8 said:
Try a different mobo

Sure, I'll trade you?

Deleted member 12106 · Jul 16, 2013

m33pm33p said:
Sure, I'll trade you?

Sure, money for board works for me

m33pm33p · Jul 16, 2013

sc0tty8 said:
Sure, money for board works for me

My board + $25?

Anywho OP...I think what everyone is saying is thats not a bad OC in even the slightest bit. That may just be what you're stuck with.

tear · Jul 16, 2013

For the record, ballpark OC does depend on chip 'class'.

To me, a *good* OC is:
6164 HE/6166 HE: 250
6168/6172/6174: 230
6176 SE/6180 SE: 225

Regardless, it makes sense to exhaust all options before giving up

bowlinra · Jul 16, 2013

Tear> Thanks for all the OC reference numbers. I didn't realize, I'm already at the top end.

If it ain't broke, I haven't fixed it enough. So I'll push on, as long as we don't feel it's a waste of time.

And to be overly clear, This is what I was see on the box.
1. Box folding at 233 - 237 would run for 10mins to an hour. HFM would lose connection.
2. I would find the box paused in the BIOS post with the above HT Link SYNC Errors.
3. Press F1 to finish BIOS and launches Ubuntu.
4. I'd either lower the OC or just reboot
5. BIOS & Ubuntu would post & boot without errors.
6. Start folding again.

Status update:
I decide to attempt 234 and run memtest. I did see one error on test 4, so I decided to enabling 'allow speeds > DDR3-1333' retest. Currently running.

tear · Jul 17, 2013

Got it.

To recap, it seems there are two separate items to tackle: memory and HT.
Once memory item is addressed you will likely need to look at HT.

Also, side question, do you remember getting any HT retries while running at 233+?

runs2far · Jul 17, 2013

What BIOS are you running?

bowlinra · Jul 17, 2013

Run G60NG4.A11 BIOS

I haven't seen any ht-retries counts. Nor am I getting anything from the Console (Ctrl+Alt+F1) because of the reboot.

Attempted 234 w/ enabling 'allow speeds > DDR3-1333' failed back to #2.
Attempted 234 w/ enabling 'allow speeds > DDR3-1333' & enabling alternative HT tuning failed back to #2.

> #2. I would find the box paused in the BIOS post with the above HT Link SYNC Errors.

Attempting 232 w/ enabling 'allow speeds > DDR3-1333' & enabling alternative HT tuning currently.

horde@amd-6172:~$ sudo tpc -dram |pastebinit
http://paste.ubuntu.com/5886148/

horde@amd-6172:~$ sudo zgrep -Ei 'mcelog|machine.check|hardware.error|detected.stall|lockup|oops:' $(ls -rt /var/log/{messages,syslog}* 2> /dev/null) |pastebinit
http://paste.ubuntu.com/5886150/

tear · Jul 17, 2013

OK... I've got an idea worth considering (no matter the outcome of your current test).
-- going 240 with allowed speeds > DDR3-1333 and trying both (normal and alternative) HT tunes.

If this also fails with #2 (HT issue) then I think there are two options left:
- settling with last stable OC configuration
- cleaning CPU pads with alcohol and (disclaimer: a gamble) *possibly* swapping CPU2 with CPU4

A question related to CPU swap: can you paste nominal CPU voltages?

Code:

sudo tpc -l | grep 0.pstate.0 | pastebinit

bowlinra · Jul 17, 2013

I'll give 240 a shot.

I also just happen to I have another 6172 sitting on the shelf I could bring into the mix.. Another interesting point CPU4 has a broken register, so I don't know the affects (hence the extra cpu).

horde@amd-6172:~$ sudo tpc -l | grep 0.pstate.0 | pastebinit
http://paste.ubuntu.com/5886232/

Interesting result, would have thought all the voltages to be the same.

Temps too:

Code:

Temperature table:
Node 0  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38
Node 1  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37
Node 2  C0:40   C1:40   C2:40   C3:40   C4:40   C5:40
Node 3  C0:39   C1:39   C2:39   C3:39   C4:39   C5:39
Node 4  C0:44   C1:44   C2:44   C3:44   C4:44   C5:44
Node 5  C0:44   C1:44   C2:44   C3:44   C4:44   C5:44
Node 6  C0:40   C1:40   C2:40   C3:40   C4:40   C5:40
Node 7  C0:38   C1:38   C2:38   C3:38   C4:38   C5:38

Assuming these settings:

Code:

Detected board: 'H8QG6'

Reset OCNG configuration to defaults (no/yes) [no]?
Reference clock (200-262) [232]? 240
Configure advanced options (no/yes) [yes]?
    Perform multiple refclock sets (not recommended) (no/yes) [no]?
    Configure refclock on warm reset only (not recommended) (no/yes) [no]?
    Dynamic HT configuration (not recommended) (no/yes) [no]?
    Prevent use of XMP profile 1 (no/yes) [no]?
    Allow effective memory frequencies above DDR3-1333 (no/yes) [yes]?
    Allow unsafe CAS latencies (not recommended) (no/yes) [no]?
    Force 1.5V DIMM Vdd (DANGEROUS) (no/yes) [no]?
    Use alternative HT tuning (no/yes) [yes]?
    Relax memory timings (no/yes) [no]?

Configuration stored successfully.
To ensure proper application, POWER-OFF the machine, then power it back on again.

tear · Jul 17, 2013

Re ocng-cu settings -- yup, they look ok
Could also flip HT tuning (and retest) just to cover both cases.

Semi-random comments:
HT-related issues have been so rare so we haven't really dug into getting
information on the link that causes the HT link sync flood (reset) == can't
really suggest looking into any specific CPU, unfortunately.
General recommendation has been: clean pads of all CPUs and cross your
fingers (if alternative tuning doesn't help and if no HT-retries are reported
before the crash).

Re CPUs in general
AMD does fine-binning of CPUs -- different nominal voltages are common.

On very high level, one could say that better silicon ends up with lower nominal voltage
but don't make it a general rule.

With retail CPUs the best socket population strategy we have boils down to determining
maximal refclock of each CPU while in socket 1 and while trying to maintain ~constant
temperature of the chip/ambient.

Once order of CPUs (strongest to weakest) is determined, you populate the CPUs
by matching strongest CPU with weakest socket, weakest CPU w/strongest socket and
so on... (socket order on Gi/G6: 4-1-3-2).
Then, your Initial refclock is the lowest common from the tests.

This is extremely laborious and I'm not sure if you want to go there...

Broken/missing cap array (that's what I think it is -- http://en.wikipedia.org/wiki/Decoupling_capacitor)
affects voltage regulation and could affect ability to OC. I'm not EE so don't quote me on that.

bowlinra · Jul 18, 2013

Attempted 240 w/ enabling 'allow speeds > DDR3-1333' & enabling alternative HT tuning failed back to #2.
Attempted 240 w/ enabling 'allow speeds > DDR3-1333' & disabled alternative HT tuning failed back to #2.
Attempted 232 w/ enabling 'allow speeds > DDR3-1333' & disabled alternative HT tuning failed back to #2.
Attempted 232 w/ enabling 'allow speeds > DDR3-1333' & enabling alternative HT tuning currently running for just over 45 mins. (longest of the group)

> #2. I would find the box paused in the BIOS post with the above HT Link SYNC Errors.

Both 240 runs spontaneously reboot once the FAH client was resuming from checkpoint. The first 232 run seem to make ~20 mins.

Nothing new in the logs either.
horde@amd-6172:~/fah$ sudo zgrep -Ei 'mcelog|machine.check|hardware.error|detected.stall|lockup|oops:' $(ls -rt /var/log/{messages,syslog}* 2> /dev/null) |pastebinit
http://paste.ubuntu.com/5886487/

tear · Jul 18, 2013

What I'm finding interesting is that time-to-failure appears to depend on refclock* even though it shouldn't
(at least not that much) as OCNG adjusts HT multi so HT always stays < 3 GHz...

*) the higher the OC the quicker it dies -- correct me if I'm wrong

Hmm, there is one more thing I'd like to try if you can stop by IRC (fewer forum round-trips) one evening

bowlinra · Jul 18, 2013

Yes, the faster the speed, the faster the collapse / reboot as I'm seeing it.. I'm a IRC noob, can someone post the links and information to connect.

Folding now 6+ hrs

Code:

Clockspeed (OCNG4.3)
Family 10h
Refclock: 231.954 MHz
Clockspeed: 2435.518 MHz

runs2far · Jul 18, 2013

windows:
get nettalk: http://www.ntalk.de/Nettalk/en/
add a server with the following setup:
IP or server name: irc.freenode.net
leave the rest at default and press next.
pick a nick-name, no passwords or any other fun and press next.
Put this in the "enter the following in commands after login" box:
/join #hardfolding
And you are done, finish and double click the new server to join the IRC fun.

Ubuntu/most distros
install irssi
start irssi in a console window
type the following commands in irssi
/connect irc.freenode.net
/nick pickANick
/join #hardfolding

You can configure irssi in a well fleshed out .irssi.conf file in your home folder, but I am to lazy to describe that here.
If you want to be on IRC for ever and ever, start a screen session and run irssi in that.
PM me if you want more IRC stuff.

bowlinra · Jul 18, 2013

Thanks the for the instructions! Worked like a champ!

m33pm33p · Jul 19, 2013

bowlinra said:
Thanks the for the instructions! Worked like a champ!

What did you end up getting the rig stable at?

scubadiver59 · Jul 20, 2013

m33pm33p said:
What did you end up getting the rig stable at?

Yes, what did you get stable at, or what fix did you apply that allowed you to fold w/o reboots?

I bought a 6166HE turnkey rig off of Core32 and he said he had it up to 258, and that it was folding perfectly with no HT errors; however, when I tried 258 I ended up with multiple reboots and the dreaded "HT Link SYNC Error" in the opening post almost immediately when running a fold.

I cleaned my pads before installing, the latest BIOS is installed, but I gave up on the reboots and came down from Core32's 258 and I'm folding at stock speed until I'm sure everything is working okay. Once this WU is complete, I'll start at tear's recommended 250 and work up from there.

I guess I'm rather lucky that my 6176SE's went up to 238, from tear's acceptable 225, but something is wrong here with my HE's...at least according to Core32's previous stated numbers.

bowlinra · Jul 21, 2013

Still working on it.. Had an irc session with tear and tried abunch of things. Next step would be to swap some cpu2 and cpu4 and see if that does anything.. It may or may not. I've back down to running at 230 ~2.41GHz and getting some WUs done.

My 6166HE set I had up to 257 and just didn't want to push it anymore at the time. As the WUs changed, got a few hangs and I temperatures got hotter, I back it down to 255 of 253 for more stability.

Here is what I was using in the /etc/rc.local for the vcore voltage on those 6166 OCs

Code:

tpc -set vcore 1.0500 && sleep 1 && tpc -fo 1 && sleep 1 && tpc -fo 0

I didn't put alot of research into, or even try to tune it. It was my first 4P setup and I was very happy it was just working!

tear · Jul 22, 2013

A bit OT. One needs to crank volts to maximum allowable on a HE system w/high overclocks.

Yes, 1.05 is a typical maximum but, in case it isn't, one should first determine maximum possible voltage
by means of tpc -l.

runs2far · Jul 22, 2013

What limits the voltage on these boards?

tear · Jul 22, 2013

CPUs.
Retail HEs have some voltage headroom which we are taking advantage of when overclocking.

scubadiver59 · Jul 22, 2013

I finally got the information I needed from Core32--I bought one of his boards--and I have been stable since then.

I upped the vcore to 1.05 across the chips, following the correct steps posted elsewhere, which Core32 pointed me to, and I settled in on his previous 258, that he told me he had been folding at w/o HT Retries.

Two of my chips are 1.075 and two are at 1.05. I have four other chips, for another board, so I may look and see if two of those are 1.075 and then swap them out so I have all the same max vcore on one board. Then I can see how far the 1.075 will let me go.

Of course I may go up to 260+ anyway, as Core32 said he had the board at one time and see what happens, but for now I'm satisfied since there are no more reboots and HT Sync errors.

Now I have a 2P Asus board with some 6124HE's I need to play with...

tear · Jul 22, 2013

Having higher vcore limits (within given bin) _usually_ implies _worse_ silicon, not better.

Ideally, each CPU should be tested individually, then all in set.

Deleted member 12106 · Jul 22, 2013

tear said:
Having higher vcore limits (within given bin) _usually_ implies _worse_ silicon, not better.

Ideally, each CPU should be tested individually, then all in set.

It should be noted this method, while taking time, works exceptionally well. This is the sniper rifle approach.

Shot gun approach is give voltage until no errors are present. This may also be reffered to as the "horse shoe or hand grenade" approach as well. (close enough).

scubadiver59 · Jul 22, 2013

Points taken gentlemen...that's what I have my Asus 2P for--testing.

The 258 setting is running solidly now and I'll leave it there until I feel like shooting for higher!

Thanks to both of you!

tear said:
Having higher vcore limits (within given bin) _usually_ implies _worse_ silicon, not better.

Ideally, each CPU should be tested individually, then all in set.

sc0tty8 said:
It should be noted this method, while taking time, works exceptionally well. This is the sniper rifle approach.

Shot gun approach is give voltage until no errors are present. This may also be refered to as the "horse shoe or hand grenade" approach as well. (close enough).

Deleted member 12106 · Jul 22, 2013

Do note your testing starts over as it wont scale to 4p if that's the testing your doing.

scubadiver59 said:
Points taken gentlemen...that's what I have my Asus 2P for--testing.

The 258 setting is running solidly now and I'll leave it there until I feel like shooting for higher!

Thanks to both of you!

scubadiver59 · Jul 22, 2013

sc0tty8 said:
Do note your testing starts over as it wont scale to 4p if that's the testing your doing.

Oh no...just referring to the checking of the max vcore's on each of my other chips in the 2P.

I understand where tear is coming from with regards to the 1.05 v 1.075 differences, where the 1.05 is the more desirable of the two.

I'll be checking my other chips this weekend...

OC 4x 6172 on [H] Appliance

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

[H]ard|DCer of the Month - August 2013

[H]ard|DCer of the Year 2012

Deleted member 12106

Guest

[H]ard|DCer of the Month - August 2013

Deleted member 12106

Guest

[H]ard|DCer of the Month - August 2013

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Gawd

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

[H]ard|DCer of the Year 2011

Limp Gawd

Gawd

Limp Gawd

[H]ard|DCer of the Month - August 2013

n00b

Limp Gawd

[H]ard|DCer of the Year 2011

Gawd

[H]ard|DCer of the Year 2011

n00b

[H]ard|DCer of the Year 2011

Deleted member 12106

Guest

n00b

Deleted member 12106

Guest

n00b