Simple Dual Processor question

Andyk5 · Apr 25, 2014

How does working on a dual cpu system change the way Windows assigns cores to applications?

As an example lets say I have a program that can be configured to run using 2 cores. If I have a system with 2 physical CPU's with 8 cores each, will it use
a) 2 out of 8 available cores on 1 CPU.
b) 2 out of 16 available cores on 2 CPU's
c) 4 out of 16 available cores on 2 CPU's ( 2 cores each CPU)

If option b, then
will this run better than a? Still 2 cores, but on 2 physically separate CPU's, therefore can run at a higher frequency? Same heat output but better cooling due to more cooling surface?

Quartz-1 · Apr 25, 2014

Here's the word from Microsoft for Windows 7:

PCs with multi-core processors:
Windows 7 was designed to work with today's multi-core processors. All 32-bit versions of Windows 7 can support up to 32 processor cores, while 64‑bit versions can support up to 256 processor cores.

PCs with multiple processors (CPUs):
Commercial servers, workstations, and other high-end PCs may have more than one physical processor. Windows 7 Professional, Enterprise, and Ultimate allow for two physical processors, providing the best performance on these computers. Windows 7 Starter, Home Basic, and Home Premium will recognize only one physical processor.

The requirements page for Windows 8 makes no mention of multiple CPUs.

This page indicates that Windows 8 Pro and Enterprise can handle 2 CPUs.

zandor · Apr 25, 2014

C won't happen. Having multiple CPUs doesn't allow an app that uses two threads to somehow use 4. Often the ideal result is A. Local memory access is faster than accessing memory on a different NUMA node, so assuming the usual case where the two threads are sharing data just using one CPU is often faster. Dropping down one turbo bin isn't usually all that much, and getting rid of the performance penalty for a non-local memory access usually outweighs the loss of clock speed. This is assuming your OS does a decent job of allocating memory, as seeing an advantage requires ram allocation to be done on the node the app is running on. On top of that there is usually more memory bandwidth available than one or two cores can really use most of the time. Of course, it is possible to write an application that will work better on two different CPUs than on one. If the interaction between threads is really minimal it will be better to have the higher turbo boost and if the code is memory bound the extra bandwidth available on the second socket may help too.

Andyk5 · Apr 25, 2014

zandor said:
C won't happen. Having multiple CPUs doesn't allow an app that uses two threads to somehow use 4. Often the ideal result is A. Local memory access is faster than accessing memory on a different NUMA node, so assuming the usual case where the two threads are sharing data just using one CPU is often faster. Dropping down one turbo bin isn't usually all that much, and getting rid of the performance penalty for a non-local memory access usually outweighs the loss of clock speed. This is assuming your OS does a decent job of allocating memory, as seeing an advantage requires ram allocation to be done on the node the app is running on. On top of that there is usually more memory bandwidth available than one or two cores can really use most of the time. Of course, it is possible to write an application that will work better on two different CPUs than on one. If the interaction between threads is really minimal it will be better to have the higher turbo boost and if the code is memory bound the extra bandwidth available on the second socket may help too.

Great detailed answer thank you, I did not think about the cache issues actually. Looks like a single 6 core with the highest possible clock speed and some 1866Mhz Ram are my best options. The software that I am talking about is Xilinx Vivado and generating the bit file for a fully utilized Virtex 7 FPGA takes about 5 hours of processing 15 gigabytes of RAM. I am trying to shave off as much as I can but unfortunately only certain steps of the software is multi threaded and only for two cores. I wonder how does massive companies like Intel and AMD solve these issues.

zandor · Apr 26, 2014

I'm vaguely familiar with that problem. I used to sit next to the hardware engineers at work. They described bitfile generation as being analogous to a travelling salesman problem. There are an obscene number of ways to lay out the logic on a big FPGA, and generating a bit file is more along the lines of searching for one that works. They usually run several builds at a time and see which one finishes first. If I understood it right the tools we're using can be set up to try different layouts on the FPGA so multiple copies don't try the same layout. It still takes hours though.

Just out of curiosity, what are you using the FPGAs for? We use them to process market data from securities exchanges.

Andyk5 · Apr 26, 2014

Mostly flight projects. The one that I am working on is for a communication device. Multiple layouts is not the biggest problem, it is place and route. When you code hardware using RTL (Verilog), you don't do structural logic design.
You don't say (A xor B) xor Cin = S, (A and B)or((AxorB)andCin)=Cout.
You just say S=A+B and synthesize turns it first in to above expression and then place and route turns it in to transistors. Your inputs determine how many transistors a gate is made of, . So a simple 16bit adder can easily take 30*16=580 transistors worth of fpga fabric. Routing all those damn pins require a lot of RAM and CPU, especially when you start dealing with more complicated designs. Virtex 7 that I am working with has 6.8billion transistors.....

jimmyb · Jun 3, 2014

zandor said:
They described bitfile generation as being analogous to a travelling salesman problem. There are an obscene number of ways to lay out the logic on a big FPGA, and generating a bit file is more along the lines of searching for one that works.

My understanding is that most modern placement algorithms start with everything located in a single coordinate, and then use some form of simulated annealing to to find a reasonably optimal and legal placement.

Quartz-1 · Jun 4, 2014

Something to consider is that you will not save time if your application is limited by the memory bandwidth. This happened with my brother: he has an app which is single-threaded and uses gobs of RAM. He tried running multiple instances and the degradation was almost linear: running two instances doubled the time for each, running three tripled the time, etc.

geok1ng · Jun 4, 2014

Andyk5 said:
Great detailed answer thank you, I did not think about the cache issues actually. Looks like a single 6 core with the highest possible clock speed and some 1866Mhz Ram are my best options. The software that I am talking about is Xilinx Vivado and generating the bit file for a fully utilized Virtex 7 FPGA takes about 5 hours of processing 15 gigabytes of RAM. I am trying to shave off as much as I can but unfortunately only certain steps of the software is multi threaded and only for two cores. I wonder how does massive companies like Intel and AMD solve these issues.

You need premium single thread performance. If you need ECC RAM to increase the RAS, there are single socket Xeons v2 that can be overclocked. If you are OK running non-ecc RAM, Intel launched yesterday Devil's Canon, a 4 cores 8 threads chips capable of unprecedent overclocking. No need to go higher than 4 cores to run code that uses at most 2 cores.

Still no overcloking results, buit there is also a new 2 cores Pentium unlocked, which can result in very cheap machines to run your Vivado.

Simple Dual Processor question

Andyk5

[H]ard|Gawd

Quartz-1

Supreme [H]ardness

zandor

Supreme [H]ardness

Andyk5

[H]ard|Gawd

zandor

Supreme [H]ardness

Andyk5

[H]ard|Gawd

jimmyb

2[H]4U

Quartz-1

Supreme [H]ardness

geok1ng

2[H]4U