Supermicro H8QGi/6 and H8QGL Next Generation OC BIOS

Status
Not open for further replies.
Hey tear,
I just flashed the new BIOS for GL and my 6166's show as 8 cores each. Not OC'ed yet, not used any sh script. Just flashed, rebooted, loaded defaults (f9), saved (f10), and rebooted again to setup.
Any thoughts, solutions?
Thanks man!
 
Oops. I must've uploaded shareware version by accident... :D
 
But really. No idea what F9 really loads. I explicitly pick Load Optimal Defaults.
Will double check later but for now avoid loading settings by means of a hotkey.

Also, per our IRC conversation; power-cycling the machine resolved this issue.

I revised instructions as well -- power-cycle the machine after BIOS flash.
 
I would also like to note that rebooting my two GL systems fails after I OC the CPUs (run smocng.sh with any value greater than 200). The systems never post after a reboot. I have to turn the power off and back on to get the system to boot again.

This isn't an issue for me, but I wanted to note it in case others run into the problem.
 
I thought the instructions said to power cycle the machine and not to reboot it :confused:

After you run smocng.sh, you have to power cycle the system (power off and then back on). What I was pointing out is a reboot later on (after the initial power cycle from changing the OC) will not POST for me. Basically rebooting capability no longer works. If I run smocng.sh to put the system back to stock speeds, then rebooting works again.

I experience this issue on both of my GL systems. Tear has not been able to replicate the problem. I just wanted to point out this is a potential issue.
 
My 6166s ran successfully through the night at 230 :)
232 did not post :(
Now running 231, so far so good... It's possible that's where I'll end up.

Is it possible that the board is not liking some numbers like 232 but it would work at say 235?
 
Well done!

Is it possible that the board is not liking some numbers like 232 but it would work at say 235?
That's very unlikely. But feel free to try, say, 235...

Also, while at 231, can you show me output of tpc -htstatus ?
 
TPF increased from 230 to 231, from 14:55 to 15:30 :(
ht-retries is giving full zeroes though :confused:
 
A case of DLB not engaged perhaps?

tear said:
Also, while at 231, can you show me output of tpc -htstatus ?
^^^ please :)
 
It said DLB is engaging...
Here comes htstatus:

Hypertransport Status:
Node 0 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 0 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 0 Link 1 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 0 Link 2 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 0 Link 3 Sublink 0 Bits=16 Coh=0 SpeedReg=11 (2000MHz)

Node 1 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 1 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 1 Link 1 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 1 Link 2 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 1 Link 3 Sublink 0 not connected

Node 2 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 2 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 2 Link 1 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 2 Link 2 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 2 Link 3 Sublink 0 not connected

Node 3 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 3 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 3 Link 1 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 3 Link 2 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 3 Link 3 Sublink 0 not connected

Node 4 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 4 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 4 Link 1 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 4 Link 2 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 4 Link 3 Sublink 0 not connected

Node 5 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 5 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 5 Link 1 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 5 Link 2 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 5 Link 3 Sublink 0 not connected

Node 6 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 6 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 6 Link 1 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 6 Link 2 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 6 Link 3 Sublink 0 not connected

Node 7 Link 0 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 7 Link 0 Sublink 1 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 7 Link 1 Sublink 0 Bits=8 Coh=1 SpeedReg=13 (2400MHz)
Node 7 Link 2 Sublink 0 Bits=16 Coh=1 SpeedReg=13 (2400MHz)
Node 7 Link 3 Sublink 0 not connected
 
(thanks for htstatus)

Nope. Insufficient CPU supply voltage doesn't cause things to slow down... it causes things
to crash and burn (figuratively speaking).

Not sure what is up but I'd let it complete one unit. And, for sake of performance comparisons,
completely switch to dedicated bench unit (each time started from scratch, etc.)

I'd also double check that The Kraken is running ('top' showing thekraken-FahCo).

Some systems are known to give POST issues above 230 whereas others work fine -- currently
under investigation.
 
Being as naughty as I am, I went and increased the voltage to 1.05, and the TPF went to 14:51, from 15:24! :)
Temps increased by 4C though :(
I don't know if it's a glitch, but I'll let it run for the time being. I guess I'll have a better idea after a reboot tonight.
Thanks man! :cool:
 
Right, but you didn't perform voltage/p-state it in-flight (mid-WU) or did you? (yeah, I know I advised against doing so:)
 
I stopped the client, waited a few minutes, then set the vcore, and then restarted the client. This is fine, right?
 
Yes, it is.

My point was that, for proper experiment (verification of your claim), one should up supply
voltage while WU is running...

But anyway, just like you're saying, more data points will be gathered eventually.
 
Alright, kind of confusing, but I rebooted (and just as firedfly is describing, it failed to reboot, it just powered-off), and without playing with the voltages, the TPF went down to 14:49 (by the way, I am doing a controlled experiment with the WUs/frames).
So, it's possible, I think, that the 231 thing didn't actually engage somehow and it needed another power-cycle or a nudge of sorts. I'm very happy with the things now and the temps are nice at 43-46C.

I'll experiment with 235 tonight :D Can't help it!
 
My board seems to have maxed out with 235. I was running 240 for a day or so and everythign seemed fine, but I started to notice I was getting memory errors. After the unit completed I queried the ht-retries and one of them was up to 160+ :( Backed it down a bit and it seems happier now. Still all 0's as of this morning, though my TPF increased from 19:45 to 20:30 or so (on 6128's).
 
My board seems to have maxed out with 235. I was running 240 for a day or so and everythign seemed fine, but I started to notice I was getting memory errors. After the unit completed I queried the ht-retries and one of them was up to 160+ :( Backed it down a bit and it seems happier now. Still all 0's as of this morning, though my TPF increased from 19:45 to 20:30 or so (on 6128's).

How are your temperatures?
 
My board seems to have maxed out with 235. I was running 240 for a day or so and everythign seemed fine, but I started to notice I was getting memory errors. After the unit completed I queried the ht-retries and one of them was up to 160+ :( Backed it down a bit and it seems happier now. Still all 0's as of this morning, though my TPF increased from 19:45 to 20:30 or so (on 6128's).

As long as the wu aren't corrupted isn't lower TPF what you want? Seems to me in your case I wouldn't worry about the ht-retries if your TPF is 45 seconds less.
 
As long as the wu aren't corrupted isn't lower TPF what you want? Seems to me in your case I wouldn't worry about the ht-retries if your TPF is 45 seconds less.

Up to a certain amount of retries that may hold true, but higher retries equates to less stability and the increased possibility of crashing and/or screwing up a WU.
 
Drawing conclusions from single run at 230 followed by single run at 231 is something that
should be avoided.

Gryphon, this must have been a WU 'fluke'. Multithreaded GROMACS (esp. w/DLB)
is not a deterministic piece of software. Same WU may sometimes be slower with
exactly same machine setup. Flukes mostly happen at resumption from checkpoint
and carry through the rest of WU.

That said, comparisons should always be done with same unit, always started from
scratch (wudata_XX.dat being only file in work/ directory). Yes, that sacrifices production
but otherwise one's just generating noise.

You suggested it may have been misapplication of refclock. No cases of misapplication
have been reported to date. Not to say that's impossible but checking effective CPU
frequency (by typing 'dmesg | grep -o Detected.*') is a good practice after changing OC
settings.

Posidon42, how do you observe "memory errors" ? Memory errors either go unnoticed
(silent corruption) or cause kernel/WU crash. What is it you observed?

Note that HT-retries and memory errors are completely different issues. Per FAQ,
absolute number of HT retries it not relevant that much. It's their rate (how often they
occur) that is most important.

Improved HT retry script is in the works. It will free users from retry analysis and provide
easily digestible OK/FAIL indication.
 
You are right, I was mixing terms. I wasn't getting memory errors, but I was getting ht-retries errors. I understand that it is the rate of retries that is more of a concern. I tracked this throughout the work unit. At around 50% complete, only one core was showing 5 retries. At around 80%, that one was the same, but another one had popped up to about 12 retries, and a new one was showing around 8 or so. At 95% the core with 5 was the same, the one with 12 was about the same, yet, the one with 8 went up to ~160. That's when I decided to back it back down a bit and just re-evaluate how things were going. At 55% of my current work unit, I still have 0 ht-retries.

Possibly related to all of this is the fact that I moved the rig from an open spot in my basement to the top of my desk. So maybe some of my issues are due to the difference in cooling. Evaluations continue. Regardless, shaving some time off the frames isn't worth me cooking some processors.

Also, my current temps are below. I think they are pretty good.

Temperature table:
Node 0 C0:31 C1:31 C2:31 C3:31
Node 1 C0:31 C1:31 C2:31 C3:31
Node 2 C0:31 C1:31 C2:31 C3:31
Node 3 C0:31 C1:31 C2:31 C3:31
Node 4 C0:28 C1:28 C2:28 C3:28
Node 5 C0:27 C1:27 C2:27 C3:27
Node 6 C0:26 C1:26 C2:26 C3:26
Node 7 C0:25 C1:25 C2:25 C3:25
 
Temperature table:
Node 0 C0:31 C1:31 C2:31 C3:31
Node 1 C0:31 C1:31 C2:31 C3:31
Node 2 C0:31 C1:31 C2:31 C3:31
Node 3 C0:31 C1:31 C2:31 C3:31
Node 4 C0:28 C1:28 C2:28 C3:28
Node 5 C0:27 C1:27 C2:27 C3:27
Node 6 C0:26 C1:26 C2:26 C3:26
Node 7 C0:25 C1:25 C2:25 C3:25

Is the system folding? If so, those are damn good temps. What are the coolers and ambient temp?
 
Yes it is folding. I am using the 212s like everyone else.

I a/c is set to 72F though the basement is a little warmer due to folding.
 
Are the BIOS downloads OK on the OP? I get a Play button... Is it just a matter of DLing the MP3 file then changing the file ext?

Thanks for the Hard work on the GL Bios.
 
Last edited:
OP lists direct links... where are you seeing the icon?

Perhaps you tried to open URL pointing to the directory instead?
If that's what you did then just ignore the icon... these aren't mp3s
and your browser shouldn't attempt to play them.

If still in doubt: right-click and select "Save as..." (or alike).
 
Tear, I have a question
Node 0 0000 0000 0000 0000 001c 0000 0000 0000
Node 1 0000 0000 0000 0000 0000 0000 0000 0000
Node 2 0000 0000 0000 0000 0000 0000 0000 0000
Node 3 0000 0000 0000 0000 0000 0000 0000 0000
Node 4 0000 0000 0000 0000 0000 0000 0000 0000
Node 5 0000 51a2 0000 0000 0000 0000 0000 0000
Node 6 0000 0000 3928 0000 0000 0000 0000 0000
Node 7 0000 0000 0000 0000 0000 0000 0000 0000

As you can see, I have three positions with retry numbers. It's always the same three. And this is at a pretty low 10% OC.

Do they correspond to individual memory sticks?
I ordered three new sticks based on the theory that they do.
How would I go about mapping the above output to individual slots?
 
Tear, I have a question
Node 0 0000 0000 0000 0000 001c 0000 0000 0000
Node 1 0000 0000 0000 0000 0000 0000 0000 0000
Node 2 0000 0000 0000 0000 0000 0000 0000 0000
Node 3 0000 0000 0000 0000 0000 0000 0000 0000
Node 4 0000 0000 0000 0000 0000 0000 0000 0000
Node 5 0000 51a2 0000 0000 0000 0000 0000 0000
Node 6 0000 0000 3928 0000 0000 0000 0000 0000
Node 7 0000 0000 0000 0000 0000 0000 0000 0000

As you can see, I have three positions with retry numbers. It's always the same three. And this is at a pretty low 10% OC.

Do they correspond to individual memory sticks?
I ordered three new sticks based on the theory that they do.
How would I go about mapping the above output to individual slots?

HT-retries have nothing to do with ram. The hyper transport (HT) is a way for nodes on the CPUs to communicate with each other. When there is an error sending/receiving a HT packet, an HT-retry occurs.
 
I had that issue with ie saving as a .mp3. I just deleted it and put the 511 that was there, correct?

Personally, I just right-click and save-as on the links. However you end up downloading the file, I would recommend performing an md5sum on the downloaded file to compare it against the md5sum tear provides. If they differ, then something went awry.
 
HT-retries have nothing to do with ram. The hyper transport (HT) is a way for nodes on the CPUs to communicate with each other. When there is an error sending/receiving a HT packet, an HT-retry occurs.
^^ what Mr. Firedfly says.

HT links in a Supermicro 4p G34 system:
ht-links-Gi-6-D-800.png
 
Status
Not open for further replies.
Back
Top