Dual E7-2870 Not Using all 40 Cores

plext0r · Feb 28, 2013

I mentioned this on IRC yesterday and I just found another system running Linux where the 8101 is only using 32 of 40 cores. The only way I could get more PPD out of similar systems was to install VMware ESXi and configure two virtual machines with half the CPUs. Is there any other way to get all 40 cores busy?

Code:

pk cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
            80.02 2.53 2.39  19.98   0.00   0.00   0.00   0.00
 0   0   0 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   0  20 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1   1 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1  21 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2   2 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2  22 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8   3 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8  23 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9   4 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9  24 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16   5 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16  25 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17   6 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17  26 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18   7 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18  27 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24   8 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24  28 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25   9 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25  29 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  10 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  30 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  11 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  31 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  12 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  32   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 1   8  13 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   8  33   0.15 2.53 2.39  99.85   0.00   0.00   0.00   0.00
 1   9  14 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   9  34   0.03 2.52 2.39  99.97   0.00   0.00   0.00   0.00
 1  16  15 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  16  35   0.03 2.52 2.39  99.97   0.00   0.00   0.00   0.00
 1  17  16 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  17  36   0.05 2.52 2.39  99.95   0.00   0.00   0.00   0.00
 1  18  17 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  18  37   0.19 2.53 2.39  99.81   0.00   0.00   0.00   0.00
 1  24  18 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  24  38   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  25  19 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  25  39   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00

402blownstroker · Feb 28, 2013

A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.

plext0r · Feb 28, 2013

Interesting idea. I had forgotten about running two clients on one host. Is there a smart way to do this with fahclient 7.2.9 or should I go back to 6.34? I assume I start one client with -smp 32 (instead of letting it auto-detect 40 and then map to 32).

The nice thing about running two VMs with 20 CPUs is that each one can handle bigadv 8101 units. If I run 32 + 8, the 8 will only handle SMP work units. Maybe I can run two -smp 20 clients?

Deleted member 12106 · Feb 28, 2013

402blownstroker said:
A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.

Huh?

Nathan_P · Feb 28, 2013

402blownstroker said:
A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.

-smp 40 should work, if it doesn't try -smp 48 and see if that loads all the cores

plext0r · Feb 28, 2013

402blownstroker said:
A single client can not use 40 cores. For all 40 cores/threads, you will need two clients: one with 32 cores and a second with 8.

I came up with the following config.xml and it's currently running two cores with -smp 20.

Code:

  <slot id="0" type="smp">
    <cpus v="20"/>
  </slot>
  <slot id="1" type="smp">
    <cpus v="20"/>
  </slot>

I removed the smp and gpu lines completely.

plext0r · Feb 28, 2013

turbostat is worse in this config. The "Mapping" entry in the log shows 20, but about half the cores are idle and top is showing two theKraken-wrapped FAHClient's with only 800% CPU.

Code:

pk cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
            50.05 2.53 2.39  49.95   0.00   0.00   0.00   0.00
 0   0   0 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   0  20   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00
 0   1   1 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   1  21   0.26 2.53 2.39  99.74   0.00   0.00   0.00   0.00
 0   2   2 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   2  22   0.19 2.53 2.39  99.81   0.00   0.00   0.00   0.00
 0   8   3 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   8  23   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 0   9   4 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0   9  24   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 0  16   5 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  16  25   0.12 2.53 2.39  99.88   0.00   0.00   0.00   0.00
 0  17   6 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  17  26   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 0  18   7 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  18  27   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 0  24   8 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  24  28   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 0  25   9 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 0  25  29   0.10 2.53 2.39  99.90   0.00   0.00   0.00   0.00
 1   0  10 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   0  30   0.23 2.53 2.39  99.77   0.00   0.00   0.00   0.00
 1   1  11 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   1  31   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1   2  12 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   2  32   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1   8  13 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   8  33   0.14 2.53 2.39  99.86   0.00   0.00   0.00   0.00
 1   9  14 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1   9  34   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  16  15 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  16  35   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  17  16 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  17  36   0.07 2.53 2.39  99.93   0.00   0.00   0.00   0.00
 1  18  17 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  18  37   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  24  18 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  24  38   0.06 2.53 2.39  99.94   0.00   0.00   0.00   0.00
 1  25  19 100.00 2.53 2.39   0.00   0.00   0.00   0.00   0.00
 1  25  39   0.15 2.53 2.39  99.85   0.00   0.00   0.00   0.00

rflcptr · Feb 28, 2013

BOINC will use all available cores (physical or virtual).

Sounds like a sweet setup.

plext0r · Feb 28, 2013

rflcptr said:
BOINC will use all available cores (physical or virtual). Sounds like a sweet setup.

Funny.

I ended up unwrapping the cores with theKraken and specifying

Code:

<cpu-affinity v=true/>

in the config.xml. I'm now getting two processes with ~2000% usage.

Code:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   
29239 fahclien  39  19 2409m 1.2g 3176 S 1994.2  1.0  33:27.81 FahCore_a5                                                                                    
29234 fahclien  39  19 2584m 1.2g 3176 S 1895.6  1.0  32:39.61 FahCore_a5

jebo_4jc · Feb 28, 2013

Nice. Let us know how ppd works out.

402blownstroker · Feb 28, 2013

Nathan_P said:
-smp 40 should work, if it doesn't try -smp 48 and see if that loads all the cores

Just tried '-smp 40' on my MC 4P, client runs with 32 cores.

Deleted member 12106 · Feb 28, 2013

402blownstroker said:
Just tried '-smp 40' on my MC 4P, client runs with 32 cores.

Weird both my 4p run smp 48 w/o issue.

402blownstroker · Feb 28, 2013

sc0tty8 said:
Weird both my 4p run smp 48 w/o issue.

40 != 48

plext0r · Feb 28, 2013

HFM history shows I was getting 140-145K PPD on 6901, 136-145K PPD on 8101 (32 theKraken-wrapped cores) and 186K on 8102.

With two unwrapped clients running 20 cores each (6901 WU), it's estimating 56-58K PPD each. turbostat is showing 1 or 2 cores are idle. Not sure why yet. I'll wait to see how 2x20 does with 8101.

Another system with identical CPUs is running ESXI with a single VM. The VM has been allocated 20 vCPUs (VMware allocates full cores, not hyperthreaded cores). The VM is getting between 158-168K PPD on 8101 WU (compared to 136-145K on bare metal 32C shown above).

EDIT: Considering VMware does not expose hyperthreading to the VMs, I wonder if I should disable HT on the bare metal and reconfigure for a single client running 20 cores.

Qinsp · Feb 28, 2013

IIRC, E7's are HyperThreaded. Shut off 8 HT (pseudo) cores, not the real ones. I doubt you will see much of a performance hit at 32C if you shut off just HT cores.

Then the system overhead can run in the 8 HT cores?

I could be wrong, but IIRC, a HT core acts like about 20% of a real core.

plext0r · Feb 28, 2013

So this guy has 2 10-HT-Core CPUs. I just disabled HT in BIOS (not sure how to disable 8 HT cores??). Will see how it compares to running non-HT vs. other host with VMware + HT + 20 vCPUs.

Qinsp · Feb 28, 2013

I have not done it, but it's done through Affinity commands. Every other core # is HT, so you first select all the Real cores, then pick another 12 alternate cores, and leave the rest open?

Not a wiz at this stuff, just throwing out ideas.

Qinsp · Feb 28, 2013

When you just chop off the top of the cores, 1/2 are real, 1/2 are HT. Don't shut off Real cores.

tear · Mar 1, 2013

Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...

There's no such thing as "virtual core" or "real core" either. They are _always_ equivalent.

And watch closely FahCore boot-up messages -- no mystery there:

Code:

[23:27:01] Mapping NT from 40 to 32

For more issues -- keep using V7.

Qinsp · Mar 1, 2013

tear said:
Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...

There's no such thing as "virtual core" or "real core" either. They are _always_ equivalent.

And watch closely FahCore boot-up messages -- no mystery there:

Code:

[23:27:01] Mapping NT from 40 to 32

For more issues -- keep using V7.

Thanks. I guess I really don't know what HT is. I do that turning off certain core #'s fixed problems when the Linux BigAdv core was in beta.

When I turn off HT, I used to take a 15-30% hit in TPF instead of the 50% that turning off 1/2 the cores did.

402blownstroker · Mar 1, 2013

Today's HT is far superior and better than say from the early 2000s. Actually I think a turd on a stick is slightly better than early 2000s HT

plext0r · Mar 1, 2013

tear said:
Use -smp 44 for 40 threads. Do not shut HT off. Wrap cores back again...
(snip)
For more issues -- keep using V7.

Thanks tear. If I wanted to migrate from V7 /var/lib/fahclient back to V6 + origami to manage my installs, any ideas how compatible the two are? Are the "work" subdirectories compatible or only the cores?

On this particular host, it's at 94% on a WU. I restarted FAHClient with max-units=1 and will manually migrate back to 6.34 when it's done.

I have a ton of hosts running V7.2.9 and it would be nice to automate going back to 6.34. I scripted the transition from origami to V7, but now I forget where origami's client stored the core files, etc. I'd like to just move the cores from /var/lib/fahclient/cores to somewhere under /var/lib/origami/foldingathome, and have "origami deploy" automate most of it.

plext0r · Mar 1, 2013

FYI, I installed 6.34 with origami and wrapped the a5 core. I configured origami to start fah6 with -smp 44 -bigadv. It mapped 44 to 40.

turbostat also shows all 40 cores are busy. I sure wish the SMP autodetection was more robust so all this manual configuration was not needed. When you're managing a ton of clients with origami, logging into each one to tweak client.cfg, fah_config, etc. can be a pain. One of the reasons I had migrated to V7 was the easy of deployment and management.

Code:

Launch directory: /var/lib/fahclient/origami/foldingathome/CPU1
Executable: /var/lib/origami/foldingathome/CPU1/fah6
Arguments: -smp 44 -bigadv

<snip>
[19:30:12] Loaded queue successfully.
[19:30:12]
[19:30:12] + Processing work unit
[19:30:12] Core required: FahCore_a5.exe
[19:30:12] Core found.
[19:30:12] Working on queue slot 01 [March 1 19:30:12 UTC]
[19:30:12] + Working ...
[19:30:13]
[19:30:13] *------------------------------*
[19:30:13] Folding@Home Gromacs SMP Core
[19:30:13] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[19:30:13]
[19:30:13] Preparing to commence simulation
[19:30:13] - Looking at optimizations...
[19:30:13] - Files status OK
[19:30:16] - Expanded 30300756 -> 33158020 (decompressed 109.4 percent)
[19:30:16] Called DecompressByteArray: compressed_data_size=30300756 data_size=33158020, decompressed_data_size=33158020 diff=0
[19:30:16] - Digital signature verified
[19:30:16]
[19:30:16] Project: 8101 (Run 17, Clone 1, Gen 142)
[19:30:16]
[19:30:16] Assembly optimizations on if available.
[19:30:16] Entering M.D.
[19:30:23] Mapping NT from 44 to 40
[19:30:29] Completed 0 out of 250000 steps  (0%)

tear · Mar 2, 2013

Excellent. How's TPF looking?

BTW, don't get me wrong brilong, V7 can be made to do what it's supposed to but doing so requires
much more effort, care, and expertise.

Processor count detection bug in 7.3 series that you found and that we discussed on IRC is just
another reason to stay away from it. In V6 one would see such issue right away (I would think).

[first reason is this: http://foldingforum.org/viewtopic.php?f=67&t=18304 -- going on 4th year now]

The way I see it today -- for typical BA folder, risks of using V7 outweigh the benefits...

plext0r · Mar 4, 2013

tear said:
Excellent. How's TPF looking?

Previously saw 132-144K PPD on 8101, 186K PPD on 8102.

Over the weekend it got 165K PPD on 8101 and it's working on a 8102 right now. HFM.net estimates PPD at 243K.

musky · Mar 4, 2013

brilong said:
I sure wish the SMP autodetection was more robust so all this manual configuration was not needed. When you're managing a ton of clients with origami, logging into each one to tweak client.cfg, fah_config, etc. can be a pain. One of the reasons I had migrated to V7 was the easy of deployment and management.

As odd as it may seem, I had never heard of origami until you mentioned it last week. I spent a little time messing with it over the weekend, and I must say, it definately has potential even if you don't need to deploy F@H to many machines. It looks like it just needs a refresh to handle today's hardware and software. It may be something we tackle in the future.

plext0r · Mar 4, 2013

A year or two ago when I wanted to automated F@H v6 deployment to 20+ Linux nodes, I found origami. I setup an SSH trust from one of the Linux nodes to the rest and then used origami to deploy the clients. I tweaked origami for my needs (having it rely less on external URLs during deployments) and it worked well.

When v7 came out and I saw a supported way to handle deployments (RPM-based install, config.xml, etc), I tried it on a few nodes and it worked well. Over a couple of days I migrated all my origami nodes to v7 (automating the LVM rename of the /var/lib/origami ext3 LUN to /var/lib/fahclient, etc).

With tear's suggestion that v7 needs more babysitting than v6 in certain instances, I put origami back on the 2xE7-2870 node, but I'm not using it to deploy multiple nodes again. We'll see how things progress since I'm not sure how log v6 clients will last or if new cores will be released that require the v7 client.

Dual E7-2870 Not Using all 40 Cores

[H]ard DCOTM x3

[H]ard|DCer of the Month - Nov. 2012

[H]ard DCOTM x3

Deleted member 12106

Guest

[H]ard DCOTM x3

[H]ard DCOTM x3

[H]ard DCOTM x3

Supreme [H]ardness

[H]ard DCOTM x3

[H]ard|DCer of the Month - April 2011

[H]ard|DCer of the Month - Nov. 2012

Deleted member 12106

Guest

[H]ard|DCer of the Month - Nov. 2012

[H]ard DCOTM x3

2[H]4U

[H]ard DCOTM x3

2[H]4U

2[H]4U

[H]ard|DCer of the Year 2011

2[H]4U

[H]ard|DCer of the Month - Nov. 2012

[H]ard DCOTM x3

[H]ard DCOTM x3

[H]ard|DCer of the Year 2011

[H]ard DCOTM x3

[H]ard|DCer of the Year 2012

[H]ard DCOTM x3