Supermicro H8QGi/6 and H8QGL Next Generation OC BIOS

Status
Not open for further replies.
Thanks a lot .:)

And does it add up retries ?

I had this idea too, if I'm getting you right. I know ht-retries is giving you cumulative number but it does so for each node and you have to look at the whole table.

Would it be possible that it give a sum of all ht retries first (so that you know if there are _any_ ht retries), then lists the table (so that you know where those ht retries happened)? Just an idea, if it's easy to make :)

In the mean time, I've been folding since last night successfully at 251. Zero ht retries. I bumped up the voltage to 1.025 just to be on the safe side. At these settings, my bench WU had TPF of 13:42, but the other two was at around 13:48. I'm trying to find out if the voltage supplied is too little or too much.

Your magic is working, tear! :D Thanks for everything!
 
I have been folding also since last night with my H8QGL and four 6172 at refclock=240 ( which means that the cpu frequency is equal to 2520 mhz ) and i have a TPF of 18:00 for a WU 6904 and zero retries.

When this WU will be finished, i will try refclock=245 and so on.

For the time being i have limited the cpu voltage to 1.07 volts because cpu temperature are a bit high : 50-55°C.
 
Last edited:
Well if I did this right I am oc'd from 1900 to 2.5? I bought both cpu's and they where listed as 6168's. 1900mhz, one is a 6174 according to the stock bios(2200). When I installed the OC bios, it showed each one at 2200 in the bios. qith the 15% OC, I am showing 2530. Not sure if this is one or both.
 
Well if I did this right I am oc'd from 1900 to 2.5? I bought both cpu's and they where listed as 6168's. 1900mhz, one is a 6174 according to the stock bios(2200). When I installed the OC bios, it showed each one at 2200 in the bios. qith the 15% OC, I am showing 2530. Not sure if this is one or both.

I think you are charting waters no one has gone before :eek:
 
Anyone has H8QGL overclocking results
Which motherboard runs at higher bus speed :confused:

Also what can you do with the HT connect slot on this new board. Can you connect 2 4P together to make an 8P system :eek:
 
Well if I did this right I am oc'd from 1900 to 2.5? I bought both cpu's and they where listed as 6168's. 1900mhz, one is a 6174 according to the stock bios(2200). When I installed the OC bios, it showed each one at 2200 in the bios. qith the 15% OC, I am showing 2530. Not sure if this is one or both.

From what I came to understand, if you put a 6168 and 6174 in the same system, they work at 6168's speed. I'm not sure but it makes sense too. I don't know why the OC BIOS is reporting both at 2200 though... :confused:
 
It's most likely cosmetic bug.

sc0tty8 checked tpc -l and it gave 1900 (nominal frequency of P-state 0) across the board.

That said, 6174 is running at 6168's multi so a pair is operating (w/240) at 1900*240/200 = 2280 MHz.

EDIT: no idea where it may be taking 2200/2640 from...
 
Last edited:
Anyone has H8QGL overclocking results
Which motherboard runs at higher bus speed :confused:

I'm not sure if any one of the models is superior to others, I don't think so. It's all case-by-case, depending on your particular motherboard-CPU combo.

I have a H8QGL-iF+, the cheapest 4P model out there, and I have the 6166HE's. Now I'm folding at 253, zero ht-retries since morning. I bumped the voltage up to 1.0375. Temps as reported by TPC are 47-52. Power draw reported by the UPS is round 650W. Average TPF on 6903 is 13:38.
 
I'm sorry if I'm posting too often here but this is so awesome, I have to share :)
I'm at 257 now, still with 1.0375 Vcore, still no ht retries.
That's freakin' 2.3GHz with 6166HE's! :D
TPF is 13:29 on 6903, power draw is around 650-660W.

No, this is not an April Fools' joke :p
 
I'm sorry if I'm posting too often here but this is so awesome, I have to share :)
I'm at 257 now, still with 1.0375 Vcore, still no ht retries.
That's freakin' 2.3GHz with 6166HE's! :D
TPF is 13:29 on 6903, power draw is around 650-660W.

No, this is not an April Fools' joke :p

OC-ing the heck out of things is the whole reason [H]ard|OCP exists ;)

Fold on!
 
Hard lockup again at 215 with my 6176s on a 6904, it did however fold a complete 6903 first..... odd
Back to 208, of wich the timings on my ram are now 6-6-5-17 1066, so far 20% of the 6904 and still going.

Really wish I knew for sure what my problem is, probably the ram.... but I dont feel like buying a 3rd set for this rig.
 
In 16-DIMM config you can mix and match any way you want.

Have you tried running FAH in non-XMP config or running memtest (in XMP config) yet?
 
Anyone else have trouble getting into the Bios on a GL? It boots to Ubuntu OK but when I bang the del or f1 key after the [H} boot logo it just goes to a blank screen. I did use the hotkey (F9) to load the optimal defaults, I just read through the thread and saw the post recommending not to do that....

I just wanted to check the clock speed in the bios to see if it "Took".

Will it always require a complete shutdown? Or is that only for when one sets a new clock speed?

Thanks
 
Anyone else have trouble getting into the Bios on a GL? It boots to Ubuntu OK but when I bang the del or f1 key after the [H} boot logo it just goes to a blank screen.
That's odd... hasn't been seen before...

dfonda said:
I did use the hotkey (F9) to load the optimal defaults, I just read through the thread and saw the post recommending not to do that....
Lies -- I didn't know what I was talking about. F9 is fine.

dfonda said:
I just wanted to check the clock speed in the bios to see if it "Took".
You will not find net clock speed in there. 'dmesg | grep -o Detected.*' is the
best source of data atm.

dfonda said:
Will it always require a complete shutdown? Or is that only for when one sets a new clock speed?
Many GLs are experiencing the 'reboot' problem == board unable to reboot after refclock
application and after power-cycle. You may or may not be affected by it.

Fix is planned but no research has been done due to other priorities.


HTH!
tear
 
I'm sorry if I'm posting too often here but this is so awesome, I have to share :)
I'm at 257 now, still with 1.0375 Vcore, still no ht retries.
That's freakin' 2.3GHz with 6166HE's! :D
TPF is 13:29 on 6903, power draw is around 650-660W.

No, this is not an April Fools' joke :p

Those chips are freaks! I am having troule getting over 220. I am still happy thpough, getting 1.98GHz out of 1.8GHz is nice. Making about 16m20s TPF on 6903.
 
Do you feel like troubleshooting it further, 402?
 
Those chips are freaks! I am having troule getting over 220. I am still happy thpough, getting 1.98GHz out of 1.8GHz is nice. Making about 16m20s TPF on 6903.

I know! :D
To see the effect of voltage, I reduced it to 1.025 and the TPF went up to 13:40's, then I bumped it up to 1.05, the TPF went down to 13:24.
Given that the max VID is 1.05 for these chips, I feel like I'm getting to the point (if I'm not already there) beyond which there won't be any performance improvement.

I'll start trying 259 with 1.05, and see what happens... :)
 
Don't assume a production efficiency, positive or negative, based on CPU vCore. The 16-second variation you document in your post above can be observed simply by rebooting the computer or stopping/restarting the Folding client.

Could core voltage, in and of itself, make a difference? I suppose yes, but it would seem to me that either the cores function properly or they don't, nothing in between.
 
Don't assume a production efficiency, positive or negative, based on CPU vCore. The 16-second variation you document in your post above can be observed simply by rebooting the computer or stopping/restarting the Folding client.

Could core voltage, in and of itself, make a difference? I suppose yes, but it would seem to me that either the cores function properly or they don't, nothing in between.

I agree that the difference could easily be a fluke or simply a variation within normal limits. However, I also came to believe from my OC'ing experience with Intel chips that there is always a sweet spot for core voltage. Not exact science at all, but my gut feeling is that vcore in this case made a difference :)
 
Linden said:
Don't assume a production efficiency, positive or negative, based on CPU vCore. The 16-second variation you document in your post above can be observed simply by rebooting the computer or stopping/restarting the Folding client.

Yes, those are exclusively GROMACS variations. If clocks are there, they are there.

Single-shot-resumed-from-checkpoint data points are virtually useless and only create noise.
 
Just more noise but....... ;)
Could a possibility be that when he adjusts the vcore then adjusts the clock multiplier, powers off and back on, that the [H] BIOS would "tune" differently and thus he might be getting tighter RAM timings?
 
Just more noise but....... ;)
Could a possibility be that when he adjusts the vcore then adjusts the clock multiplier, powers off and back on, that the [H] BIOS would "tune" differently and thus he might be getting tighter RAM timings?

No, I don't think it affects the BIOS at all. Vcore application is made after a boot. During the boot, the CPUs are at their stock voltages.

No, I agree that these fluctuations are pretty common in folding and my feeling above is not a conclusion derived from any scientific methodology. Maybe I just like the idea that I'm getting slightly better performance :)
 
Hey tear,
I'm getting errors when attempting to resume from a checkpoint. It happened with 257, I thought it was a fluke with that WU. I upped to 259 and I'm getting the same error with a different WU. I had no errors with 255. I stop the client as usual with Ctrl+C and it looks like it's exiting cleanly.

Do you think high refclock is causing some IO issues? Not being able to write to disk nicely, etc?

Could it mean it's not processing the WU correctly due to instability of CPUs?

Thanks for your input, as always! :)
 
Or you've just been unlucky and happened to have run into checkpoint write bug.
Code:
    CAUTION: FahCore_a5 is known to be problematic at user-induced shutdowns*.
             To be on a safe side make a backup of complete client directory
             before hitting Ctrl+C!

    To tell whether checkpoint was written correctly check the size
    of work/wudata_XX.ckp file (XX being current slot number).
    It should be 75160 (for core A5). If it's not -- better switch to backed up
    directory.

    *) http://foldingforum.org/viewtopic.php?f=55&t=17774

Best practice is _not_ experimenting on live WUs.
 
That's a relief actually, thanks! I was not aware of such a bug. Is it just a coincidence that I got the bug twice in a row, on different WUs?

The ckp file is really smaller at 75140!

Is there a way to avoid this bug? Thanks again! :)
 
It is a coincidence.

Despite what you can read in the FF, there is no way to avoid this bug.
 
Yeah, after a few hours of controlled testing, I kinda showed myself that performance with vcore at 1.0375 and 1.05 are equivalent. I'll take 1.0375 as it results in 20 W less power draw.
By the way, I've been doing this experiment at 260 :D It's doing just fine with zero ht-retries, and it survived several power-cycles. Before this, I tried 261 but was denied. So, this is the end of road for me.
At 260, 2.34GHz, TPF on my bench 6903 WU is 13:16. Power draw is around 660-670W. Temps are in check at 48-52. I'll keep testing of course and make sure it's 100% stable; hopefully I won't have to back down.

That's 30% OC ladies and gents, and it's all thanks to tear, musky and others who helped develop this thing! You guys deserve a huge THANK YOU! :)
 
That's a relief actually, thanks! I was not aware of such a bug. Is it just a coincidence that I got the bug twice in a row, on different WUs?

The ckp file is really smaller at 75140!

Is there a way to avoid this bug? Thanks again! :)

Same problem here with 240 so i don't think that it is a coincidence.
 
Run it stock, restart WU 20 times (each time let it fold for 5+ minutes). Then, let know of
your findings:D
 
In other words, you are trying to add an overclock correlation that does not exist. If you really had I/O issues with any given overclock, you would have difficulty running the OS and the client would fail to write results and checkpoints all the time. That does not happen. What you are running into is stopping the client while it is writing checkpoints. That would happen with the same frequency at stock clock speeds. You see it more often when overclocking simply because you are stopping and starting the client more often while dialing in the overclock.

If you want some insurance against this happening, use the stop script from the backup scipt. It will create two backups before it kills the client. If you have a checkpoint issue, you simply need to restore one of the backups, restart the client, and you are set. You can remove the actual cron job if you don't want the automatic backups.
 
Last edited:
And if your looking at your terminal window to determine when checkpoints are written, DON'T. The info isn't there. You have to go into the work folder and look at the time stamp on the checkpoint file. Then make sure the client isn't about to write a checkpoint before shutting down the client.
 
Thanks a lot Tear and Musky.
I will try this when i will have time.:eek:

Moreover, with refclock=245 my computer suddenly rebooted during folding ( WU 6904 22% completed ) and stopped with this message : HT link sync error .
How is it possible because i never detected any HT-retries ?
 
Quote:Moreover, with refclock=245 my computer suddenly rebooted during folding ( WU 6904 22% completed ) and stopped with this message : HT link sync error .
How is it possible because i never detected any HT-retries ?




I get this too. I think mine is related to the power supply. I get it at 220.

I'm going to pick another up this week.
 
I get this too. I think mine is related to the power supply. I get it at 220.
I'm going to pick another up this week.

Not thinking power supply on this one. I have seen this issue maybe once/twice early on with the new 6166HE rig.
I have a dual redundant 1400W PSU, 80 Plus Gold Certified. Should be plenty of power.
 
Quote:Moreover, with refclock=245 my computer suddenly rebooted during folding ( WU 6904 22% completed ) and stopped with this message : HT link sync error .
How is it possible because i never detected any HT-retries ?




I get this too. I think mine is related to the power supply. I get it at 220.

I'm going to pick another up this week.

I believe core32 is correct here, I have seen this also when there was no HT Retries showing while running and I also have a 1400W redundant power supply. While it may be a voltage problem of the cpu's not getting enough voltage, I do not believe it is caused by the power supply.

Speaking of which how is the v modding of the motherboard coming along core32 any progress.?
 
Status
Not open for further replies.
Back
Top