NUMA performance comparison between NUMA kernels and mainline

theGryphon

[H]ard|Gawd
Joined
Nov 21, 2011
Messages
1,295
Hey guys,
I find this read (found the link at Phoronix) very interesting and I'm wondering these NUMA optimized kernels could have any affect on folding performance: http://lkml.indiana.edu/hypermail/linux/kernel/1212.0/03076.html

tldr: They ran NUMA-centric benchmarks and measured latency and bandwidth on a 32-way 4-node (ala 6128 4P) Opteron system, and found that their NUMA optimized v3 kernel (named "numa-u-v3") outperforms the mainline kernel by pretty large margins.

AFAIK, folding workloads are multiple processes with multiple threads (you can actually see 4x8 when you run a 32-way system, or 4x12 on a 48-way system, or 8x8 on a 64-way system). Hence, I think the most relevant results in the benchmarks above is when you see a 4x8 (or 8x4) workload, in which cases the NUMA optimized kernels perform way better than the mainline kernel.

The thing is, what this may mean for folding is still speculation, unless someone with some Linux prowess implements that NUMA v3 kernel. A few names pop in my mind :D

Can we even get our hands on it?
 
The 3.X.Y kernels in Fedora 16 and 17 x86_64 has been complied with the NUMA config options enabled and I have not seen anything that makes me go "Wow these are the shit!!! I need to get me some more!!!"
 
with thekraken turning on dynamic load balancing and setting cpu affinity it the best we can do with current version of the cores... There is a lot of data that moves from each thread back to the main client after every few steps... On p8101 I have 24 threads with 3185MB of memory and see about 800M jumps on the fahclient, numactl reports 4 nodes with 6cpus, each node with 2048MB in use...

In Gromacs 4.6, they have added numa-aware memory allocation and something about processing the memory sequentially by thread instead when it resyncs every few steps back to the controlling thread...
 
Back
Top