Dual E7-2870 Not Using all 40 Cores

plext0r

[H]ard DCOTM x3
Joined
Dec 1, 2009
Messages
780
I mentioned this on IRC yesterday and I just found another system running Linux where the 8101 is only using 32 of 40 cores. The only way I could get more PPD out of similar systems was to install VMware ESXi and configure two virtual machines with half the CPUs. Is there any other way to get all 40 cores busy?

Code:
pk cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
            80.02 2.53 2.39  19.98   0.00   0.00   0.00   0.00
 0   0   0 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   0  20 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1   1 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1  21 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2   2 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2  22 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8   3 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8  23 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9   4 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9  24 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16   5 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16  25 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17   6 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17  26 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18   7 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18  27 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24   8 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24  28 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25   9 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25  29 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  10 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  30 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  11 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  31 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  12 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  32   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 1   8  13 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   8  33   0.15 2.53 2.39  99.85   0.00   0.00   0.00   0.00
 1   9  14 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   9  34   0.03 2.52 2.39  99.97   0.00   0.00   0.00   0.00
 1  16  15 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  16  35   0.03 2.52 2.39  99.97   0.00   0.00   0.00   0.00
 1  17  16 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  17  36   0.05 2.52 2.39  99.95   0.00   0.00   0.00   0.00
 1  18  17 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  18  37   0.19 2.53 2.39  99.81   0.00   0.00   0.00   0.00
 1  24  18 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  24  38   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  25  19 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  25  39   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00
 
A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.
 
Interesting idea. I had forgotten about running two clients on one host. Is there a smart way to do this with fahclient 7.2.9 or should I go back to 6.34? I assume I start one client with -smp 32 (instead of letting it auto-detect 40 and then map to 32).

The nice thing about running two VMs with 20 CPUs is that each one can handle bigadv 8101 units. If I run 32 + 8, the 8 will only handle SMP work units. Maybe I can run two -smp 20 clients? :)
 
A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.

I came up with the following config.xml and it's currently running two cores with -smp 20.

Code:
  <slot id="0" type="smp">
    <cpus v="20"/>
  </slot>
  <slot id="1" type="smp">
    <cpus v="20"/>
  </slot>

I removed the smp and gpu lines completely.
 
turbostat is worse in this config. The "Mapping" entry in the log shows 20, but about half the cores are idle and top is showing two theKraken-wrapped FAHClient's with only 800% CPU.

Code:
pk cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
            50.05 2.53 2.39  49.95   0.00   0.00   0.00   0.00
 0   0   0 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   0  20   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00
 0   1   1 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1  21   0.26 2.53 2.39  99.74   0.00   0.00   0.00   0.00
 0   2   2 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2  22   0.19 2.53 2.39  99.81   0.00   0.00   0.00   0.00
 0   8   3 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8  23   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 0   9   4 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9  24   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 0  16   5 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16  25   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 0  17   6 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17  26   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 0  18   7 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18  27   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 0  24   8 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24  28   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 0  25   9 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25  29   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00
 1   0  10 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  30   0.23 2.53 2.39  99.77   0.00   0.00   0.00   0.00
 1   1  11 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  31   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1   2  12 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  32   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1   8  13 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   8  33   0.14 2.53 2.39  99.86   0.00   0.00   0.00   0.00
 1   9  14 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   9  34   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  16  15 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  16  35   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  17  16 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  17  36   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  18  17 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  18  37   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  24  18 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  24  38   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  25  19 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  25  39   0.15 2.53 2.39  99.85   0.00   0.00   0.00   0.00
 
BOINC will use all available cores (physical or virtual). ;):D Sounds like a sweet setup.
 
BOINC will use all available cores (physical or virtual). ;):D Sounds like a sweet setup.

Funny. :)

I ended up unwrapping the cores with theKraken and specifying
Code:
<cpu-affinity v=true/>
in the config.xml. I'm now getting two processes with ~2000% usage.

Code:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   
29239 fahclien  39  19 2409m 1.2g 3176 S 1994.2  1.0  33:27.81 FahCore_a5                                                                                    
29234 fahclien  39  19 2584m 1.2g 3176 S 1895.6  1.0  32:39.61 FahCore_a5
 
HFM history shows I was getting 140-145K PPD on 6901, 136-145K PPD on 8101 (32 theKraken-wrapped cores) and 186K on 8102.

With two unwrapped clients running 20 cores each (6901 WU), it's estimating 56-58K PPD each. turbostat is showing 1 or 2 cores are idle. Not sure why yet. I'll wait to see how 2x20 does with 8101.

Another system with identical CPUs is running ESXI with a single VM. The VM has been allocated 20 vCPUs (VMware allocates full cores, not hyperthreaded cores). The VM is getting between 158-168K PPD on 8101 WU (compared to 136-145K on bare metal 32C shown above).

EDIT: Considering VMware does not expose hyperthreading to the VMs, I wonder if I should disable HT on the bare metal and reconfigure for a single client running 20 cores.
 
Last edited:
IIRC, E7's are HyperThreaded. Shut off 8 HT (pseudo) cores, not the real ones. I doubt you will see much of a performance hit at 32C if you shut off just HT cores.

Then the system overhead can run in the 8 HT cores?

I could be wrong, but IIRC, a HT core acts like about 20% of a real core.
 
So this guy has 2 10-HT-Core CPUs. I just disabled HT in BIOS (not sure how to disable 8 HT cores??). Will see how it compares to running non-HT vs. other host with VMware + HT + 20 vCPUs.
 
I have not done it, but it's done through Affinity commands. Every other core # is HT, so you first select all the Real cores, then pick another 12 alternate cores, and leave the rest open?

Not a wiz at this stuff, just throwing out ideas.
 
When you just chop off the top of the cores, 1/2 are real, 1/2 are HT. Don't shut off Real cores.
 
Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...

There's no such thing as "virtual core" or "real core" either. They are _always_ equivalent.

And watch closely FahCore boot-up messages -- no mystery there:
Code:
[23:27:01] Mapping NT from 40 to 32

For more issues -- keep using V7.
 
Last edited:
Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...

There's no such thing as "virtual core" or "real core" either. They are _always_ equivalent.

And watch closely FahCore boot-up messages -- no mystery there:
Code:
[23:27:01] Mapping NT from 40 to 32

For more issues -- keep using V7.

Thanks. I guess I really don't know what HT is. I do that turning off certain core #'s fixed problems when the Linux BigAdv core was in beta.

When I turn off HT, I used to take a 15-30% hit in TPF instead of the 50% that turning off 1/2 the cores did.
 
Today's HT is far superior and better than say from the early 2000s. Actually I think a turd on a stick is slightly better than early 2000s HT :D
 
Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...
(snip)
For more issues -- keep using V7.

Thanks tear. If I wanted to migrate from V7 /var/lib/fahclient back to V6 + origami to manage my installs, any ideas how compatible the two are? Are the "work" subdirectories compatible or only the cores?

On this particular host, it's at 94% on a WU. I restarted FAHClient with max-units=1 and will manually migrate back to 6.34 when it's done.

I have a ton of hosts running V7.2.9 and it would be nice to automate going back to 6.34. I scripted the transition from origami to V7, but now I forget where origami's client stored the core files, etc. I'd like to just move the cores from /var/lib/fahclient/cores to somewhere under /var/lib/origami/foldingathome, and have "origami deploy" automate most of it. :)
 
FYI, I installed 6.34 with origami and wrapped the a5 core. I configured origami to start fah6 with -smp 44 -bigadv. It mapped 44 to 40.

turbostat also shows all 40 cores are busy. I sure wish the SMP autodetection was more robust so all this manual configuration was not needed. When you're managing a ton of clients with origami, logging into each one to tweak client.cfg, fah_config, etc. can be a pain. One of the reasons I had migrated to V7 was the easy of deployment and management.

Code:
Launch directory: /var/lib/fahclient/origami/foldingathome/CPU1
Executable: /var/lib/origami/foldingathome/CPU1/fah6
Arguments: -smp 44 -bigadv

<snip>
[19:30:12] Loaded queue successfully.
[19:30:12]
[19:30:12] + Processing work unit
[19:30:12] Core required: FahCore_a5.exe
[19:30:12] Core found.
[19:30:12] Working on queue slot 01 [March 1 19:30:12 UTC]
[19:30:12] + Working ...
[19:30:13]
[19:30:13] *------------------------------*
[19:30:13] Folding@Home Gromacs SMP Core
[19:30:13] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[19:30:13]
[19:30:13] Preparing to commence simulation
[19:30:13] - Looking at optimizations...
[19:30:13] - Files status OK
[19:30:16] - Expanded 30300756 -> 33158020 (decompressed 109.4 percent)
[19:30:16] Called DecompressByteArray: compressed_data_size=30300756 data_size=33158020, decompressed_data_size=33158020 diff=0
[19:30:16] - Digital signature verified
[19:30:16]
[19:30:16] Project: 8101 (Run 17, Clone 1, Gen 142)
[19:30:16]
[19:30:16] Assembly optimizations on if available.
[19:30:16] Entering M.D.
[19:30:23] Mapping NT from 44 to 40
[19:30:29] Completed 0 out of 250000 steps  (0%)
 
Excellent. How's TPF looking?

BTW, don't get me wrong brilong, V7 can be made to do what it's supposed to but doing so requires
much more effort, care, and expertise.

Processor count detection bug in 7.3 series that you found and that we discussed on IRC is just
another reason to stay away from it. In V6 one would see such issue right away (I would think).

[first reason is this: http://foldingforum.org/viewtopic.php?f=67&t=18304 -- going on 4th year now]

The way I see it today -- for typical BA folder, risks of using V7 outweigh the benefits...
 
Excellent. How's TPF looking?

Previously saw 132-144K PPD on 8101, 186K PPD on 8102.

Over the weekend it got 165K PPD on 8101 and it's working on a 8102 right now. HFM.net estimates PPD at 243K. :D
 
I sure wish the SMP autodetection was more robust so all this manual configuration was not needed. When you're managing a ton of clients with origami, logging into each one to tweak client.cfg, fah_config, etc. can be a pain. One of the reasons I had migrated to V7 was the easy of deployment and management.

As odd as it may seem, I had never heard of origami until you mentioned it last week. I spent a little time messing with it over the weekend, and I must say, it definately has potential even if you don't need to deploy F@H to many machines. It looks like it just needs a refresh to handle today's hardware and software. It may be something we tackle in the future.
 
A year or two ago when I wanted to automated F@H v6 deployment to 20+ Linux nodes, I found origami. I setup an SSH trust from one of the Linux nodes to the rest and then used origami to deploy the clients. I tweaked origami for my needs (having it rely less on external URLs during deployments) and it worked well.

When v7 came out and I saw a supported way to handle deployments (RPM-based install, config.xml, etc), I tried it on a few nodes and it worked well. Over a couple of days I migrated all my origami nodes to v7 (automating the LVM rename of the /var/lib/origami ext3 LUN to /var/lib/fahclient, etc).

With tear's suggestion that v7 needs more babysitting than v6 in certain instances, I put origami back on the 2xE7-2870 node, but I'm not using it to deploy multiple nodes again. We'll see how things progress since I'm not sure how log v6 clients will last or if new cores will be released that require the v7 client.
 
Back
Top