Dual Opterons or Xeons?

mikeblas said:
NUMA can provide independent pipelines between the processors and memory. The only reason I had focused on random accesses is to avoid having the cache prefetch lines of memory. If the cache prefetching the data effectively (as it would after, linear access to the memory) there's still a measurable performance pentaly accessing memory cross-node, though it isn't nearly as pronounced.

no. simply put numa is non-uniform-memory access. in other words a non-shared memory bus. how the opterons do it is different from how power 4/5 does it and how sparc did it. the fact that the opterons give you a side band of hyperlink and some extra data handling capability in the cpu itself to hand off data from a different cpus memory cache without having the instruction go through that cpu's pipeline is specific to opteron. yes in your particular example the opterons will pay a penalty higher than xeon. move to a different memory addressing pattern and it will go either way depending on if the data is treated as independant or shared.


mikeblas said:
Your summary that this issue is "because you are doing RANDOM memory accesses" isn't accurate. It's because the program ends up running code on one processor that access memory physically associated with the other processor.

no again. you did a random memory address lookup. where in a shared bus this is all in one memory controller's space. in numa it is either in one or the other memory controller's space and may or may not need to be fetched with a slight latency penalty or not. the point of numa is to provide a cpu it's own memory space. and when you decide to share then you go and fetch out of another memory controller's area, or have a cpu write the data over into anothers area. this is to provide 2 things. data integrity and speed up of not having a cpu wait on a memory bus being used by another cpu. the intel platforms have a very fast memory controller with low latency, plus prefetching algos in the cpu itself to minimize a cache miss keeping the pipeline full. what your test showed was that the opteron memory controller is a bit more latent than the xeon in fetching data. not that the opteron's memory crosslinks are slower but the memory controllers themselves are slower.




mikeblas said:
I don't think it's flawed at all. The intention is to show and measure the difference between NUMA and shared memory busses, and it does exactly that. The measure a difference, it has to be identifiable.

well it is flawed as it is assuming from the beginning that your memory addressing is always going to be random when in reality it NEVER will be unless you are writing a malicious piece of software that is out to F@#$ up a system's stability. as a synthetic benchmark it's useless because no real software is going to look anything up in a random memory address unless to pick the contents of another apps memory spaces. there is no pattern other than malicious intent that a random memory address lookup can be used for. well, i guess you could use it for crypto enconding and key creation but with xeons and opteron's the pipeline empty penalty you pay is way too high to make it usefull vs. another scheme.



mikeblas said:
I'm not trying to simulate anything. Meanwhile, I don't think it would be easy to find a good or experienced software engineer who would agree with the balance of your point.

what? find me any software engineer who isn't about to get fired who would agree that picking out random memory addreses to read data from is a way to write software period. let alone specifically coding random memory address lookups between 2 cpus haphazardly. your example is a very poor way to show one tiny aspect of each architechture's strong point and weak point. and you even drew the wrong conclusions from that to boot. you didn't narrow down the test enough to prove a point about opteron or xeon. nor did you involve the architechtures' real world performance in memory reads. the latency could be from a number of different sources other than the memory cross connect in the opteron. frankly using a random address to avaoid how the cpu would actually fetch data and not involving the pipelines of each and their strengths and weaknesses isn't good testing methodology. in the real world large data sets get a big boost from opteron vs. xeon. smaller data sets favor xeon down to a point of filling the cache and residing in cache or are more linear so the pre-fetch can work it's magic. also clock speed plays a big role in how fast a simple instruction can be handled and a more complex cpu can handle varied items easier etc...

mikeblas said:
Any memory access might be non-sequential: any lookup table or hash causes strides through memory to be at irregular intervals, for example. Code itself does too, as it executes: branches are taken, functions are called, and so on. Traversing a linked list, tree, or heap are additional examples of non-predictive accesses through the address space since the next node of the structure might lie at a location beyond the cache line length from the current node.

Why would you identify non-consecutive memory access by an application as a symptom of malicious code?

i didn't say anything about non-sequential. you picked a memory address at random to fetch from. you didn't put anything there ahead of time and you're not relying on the actual data that's there just discarding it. i didnt say sequential, i said random. as in whatever is there is there and i won't know where i'm getting it from ahead of time nor did i plan to put what i want where it does the most good.



mikeblas said:
Multi-threading optimization is about maximuzing concurrency, not assuming memory access patterns are linear.


Protect data integrity? If an application is relying on the distance between two data sets in memory for data integrity, that application is fundamentally flawed. Code should deliberatey write to the memory it owns and therefore avoid touching memory it doesn't own.

you misunderstand what i said. i said multi PROCESSING requires data integrity. not multi threading (which should also have data integrity as it's foremost preconception. if you can't rely on the data as being correct due to the execution order for true coherency you should not multi thread.)

mikeblas said:
Regardless, in this test app only one thread was active at a time. No memory was shared between threads or processes, period. The addresses shown are the same because the memory was freed, then reallocated without any interceeding allocations. Plus, they're virtual addresses and not physical locations.

Since there was only one thread running at a time, there was no data interdependency between threads in this application at all. There could not have possibly been because there was only one runnable thread at any moment.

In fact, it's possible to demonstrate similar numbers without ever creating a second thread.



Sure, the two Xeons will trip over eachother if they're not in a NUMA system.

I described this in my previous post, by the way. And I gave a link to a program that can help actually demonstrate the phenomenon, and further included some of the results from my own machine.



Is it polling, or just entering a wait state?



I fully understand that much about NUMA, at least.

there is no debate on this point as i can't see your code nor does it really matter as from what i stated above.



mikeblas said:
My example is exactly intended to show that NUMA-unaware code on a NUMA-enabled system can pay a significant memory access penalty.

My claim is that the vast majority of developers don't think about heap-allocated memory being affined to a particular processor or thread, and so the vast majority of code in shipping products today isn't NUMA-ready.

I wouldn't call this optimization: the penalty of 40% that I measured is pretty substantial, especially considering the other processor wasn't busy at all. Writing code with the unique requirements of NUMA in mind is a requirement.



Of course, this leads to a very interesting problem. If you want to use both processors cooperatively, then you eventually have to copy the data from memory owned by one processor to the other so the processor accepting the data isn't paying an access penalty (or disrupting the first processor) during every access.

This isn't an issue on non-NUMA systems, since both processors have direct access to all available memory. And so that's another facet NUMA which will hinder applications not specifically witten with the characteristics of the architure in mind.

I think it's very important that someone planning on building a NUMA-based system understands the shortage of software available that's able to run well on NUMA systems.

well i find fault in your conclusions based on a limited code example in a very non-real world scenario. numa is designed for seperate procesess to run at greater speed. memory penalties aside the platform strengths of the opteron lean toward data base apps and non sse3 optimized code requiring floating point power. render farms will be faster on xeon or opteron platforms depending on the render engine you choose. some workstation apps rely on higher clock speed and favor the xeon when they don't have large data sets. the opteron's memory controller's don't hurt it at all as it excels at large data set manipulation over the xeon. so as i said your attempt was a good thought but not a real representation of real world performance as it doesn't factor in enough elements to judge a platform by.

and i still don't think you get the point of numa. so here we go one last time.

if you have a seperate PROCESS such as running windows virtual server with win 2k3 installed as the virtual enviornment. you don't want the second Virtual server sharing the same memory space as the first real win server 2k3 the virtual environment is running under anyway. so seperate memory entirely devoted to a PROCESS or set of PROCESSES is a big speed benefit if it has it's own dedicated memory controller.

same scenario on non-numa.
you get a protected memory area to use as if it's dedicated. however each process or set of processes now has to wait on the other for memory. causing a split of memory bandwidth performance and latency issues resulting in harmfull cache read misses or empty pipelines waiting on instructions. on xeon this translates into a further slowdown of the cpu's efficiency. on opteron it doesn't hurt it as much as it has a shorter pipeline. and yes there are non-numa dualie opteron systems out there as well as numa xeon systems.

this is a real world example of what numa was designed FOR. virtual enviornments.
machine partitioning as in mainframe/mini/super computers is another real world situation tht numa excels at.
seperate processes, such as a database running under it's own process benefit from numa as well as he data set is localized to the cpu handling it.

also numa can see a penalty in other situations where multithreading and larger data sets exceeding the local cpu's onboard cahce come into play as they would both be working on the same data set in different pieces and you then get a latency penalty for the data transfer back and forth. opterons however handle this gracefully if coded correctly to do so, and like a piece of $h|7 if you don't. xeons in numa handle it like crap period because numa is tacked on at the chipset level with no cpu specific instructions to deal with it so you have to realy watch it on numa xeons. but considering their numerical insignifigance you won't see one unless for a specific app.
 
wetware interface said:
no. simply put numa is non-uniform-memory access. in other words a non-shared memory bus.

Think it through: if the memory bus isn't shared, then you can have an independent path of access between the processor and the memory. If the processors are clustered on nodes, sure -- it might not be independent. That's why I said "can", not "does".

You response says "no", but then you go on to re-state what I've said.

wetware interface said:
no again. you did a random memory address lookup. where in a shared bus this is all in one memory controller's space. in numa it is either in one or the other memory controller's space and may or may not need to be fetched with a slight latency penalty or not.

The latency penalty isn't always slight; it can be quite expensive.

In this test, I've assured that the memory accessed by one CPU is either entirely on the same node as that CPU, or entirely on the node of the other CPU. The whole point is to compare access to local memory with access to remote memory.

wetware interface said:
what your test showed was that the opteron memory controller is a bit more latent than the xeon in fetching data. not that the opteron's memory crosslinks are slower but the memory controllers themselves are slower.

Have I, really? I'm not sure that's true. How do you work out the numbers?

The way I look at it, the Intel chip actually has more of a latency penalty wether you measure by time or by clock cycle count. On the Opteron box, if I'm accessing memory on the same node, I'm getting about 50.5 megabytes/second throughput. That's great, considering almost every access I perform is a cache miss. On the Intel box, I'm finding about 48 megabytes/second throughput. Only four percent, but the Opteron is faster.

wetware interface said:
well it is flawed as it is assuming from the beginning that your memory addressing is always going to be random when in reality it NEVER will be unless you are writing a malicious piece of software that is out to F@#$ up a system's stability.

Some applications use sequential access patterns. An application might compute a checksum of a block of memory, for example. That code will access each byte (or word, or DWORD) by reading it, adding it to the total, and then moving to the next byte.

Sequential access patterns are pretty common, but not every problem can be solved with such a simple memory access pattern. In fact, more random patterns are used very frequently, and don't mean the application is malicious. Consider a very simple linked list structure:

Code:
class node
{
     char szName[80];
     char szAddress[250];
     node* pNext;
};

Say I build a list of these nodes:

Code:
node* pRoot = new node;
pRoot->pNext = new node;
pRoot->pNext->pNext = new node;
pRoot->pNext->pNext->pNext = NULL;

Traversing this structure involves visiting a node, then finding the next node and visiting it, and so on. There is zero relationship between each node; the value of pRoot is not related to the value of pRoot->pNext. Maybe we're lucky and operator new returned two blocks of sizeof(node)-sized memory consecutively, or nearly consecutively. Maybe we're unlucky and it didn't. In a real application, there can be many memory allocations and frees between the creation of two successive nodes of a linked list, unlike this sample.

Even if we were lucky enough to have consecutive blocks, the order list might be modified and pNext points at something very far away.

And so the code which does the traversal can't make any assumption about where the next node is and has to go get the pointer. In other words, it can't possibly compute pRoot->pNext; it has to go and actually get the value to find the next block.

This is random memory access: it's not sequential, and it's not predictable. Linked lists (and any number of other extensible data structures that make use of pointers to dynamically allocated memory) end up following the same access patterns.

Applications that use linked lists aren't all malicious, and therefore your assertion about random access meaning malware is false.

wetware interface said:
as a synthetic benchmark it's useless because no real software is going to look anything up in a random memory address unless to pick the contents of another apps memory spaces. there is no pattern other than malicious intent that a random memory address lookup can be used for.

Consider further an application that hashes. Maybe I read a key and touch each byte to compute a hash of that key. Given that hash, I'll access an element in an array within memory to tally something about about the key that I have just read.

A good hash function has even distribution over its key space for its range of input keys. Then, for two consecutive and unequal keys, my access to the array of hash buckets should be random. Given the current key and the bucket it accessed, I can't predict the bucket I will access for the next key.

Again, it is shown that random access does not mean malware.

This sort of memory access isn't limited to a particular genre of application.

wetware interface said:
what? find me any software engineer who isn't about to get fired who would agree that picking out random memory addreses to read data from is a way to write software period.

Maybe the problem is that you're assuming I mean "arbitrary" when I mean "unpredictable". Think about people talking about disk drive access; they'll say the access is either "sequential" or "random".

Sequential means that some stream of data is read in the same order it is physically stored. An application that has a random access pattern over a file doesn't mean that someone is just seeking around, touching random parts of the disk -- it just means "not sequential", or, at least, not sequential enough to be predictable. Within real-applicaiton random disk access, there are usually small runs of sequential access.

If I was truly reading a random memory location not limited to a memory range that I've allocated for myself, then I would crash with an access violation because I end up touching memory I don't own. Then how would I possibly have anything to benchmark? How could I have shown runtime numbers?

If I am limiting my random reads to locations that I've allocated, my memory access pattern isn't much different than a hash.

But the point is that I'm using "random" to mean "unpredictable", not "arbitrary". Perhaps you're having trouble understanding my point because you've assumed the latter and ignored the fact that the former is really the only viable interpretation.

wetware interface said:
i didn't say anything about non-sequential. you picked a memory address at random to fetch from. you didn't put anything there ahead of time and you're not relying on the actual data that's there just discarding it. i didnt say sequential, i said random. as in whatever is there is there and i won't know where i'm getting it from ahead of time nor did i plan to put what i want where it does the most good.

You seem to have done a really poor job of reading my code.

Actually, I did put something there ahead of time: information about what memory to read next. What my test application does is not much different than traversing a very long linked list. I'm storing a distance to the next node and not a pointer to the next node; I'm also visiting to ever-increasing memory addresses, while a linked list might not do so.

wetware interface said:
here is no debate on this point as i can't see your code nor does it really matter as from what i stated above.

What you stated above was apparently based on very bad guesses about what my code actually does. My code aside, the assertion that random memory access don't happen even over very large and known blocks of memory is false. Because what you stated above was based on those guesses and that assertion, what you stated above based on those guesses and that assertion can't be trusted.

wetware interface said:
also numa can see a penalty in other situations where multithreading and larger data sets exceeding the local cpu's onboard cahce come into play as they would both be working on the same data set in different pieces and you then get a latency penalty for the data transfer back and forth. opterons however handle this gracefully if coded correctly to do so, and like a piece of $h|7 if you don't. xeons in numa handle it like crap period because numa is tacked on at the chipset level with no cpu specific instructions to deal with it so you have to realy watch it on numa xeons. but considering their numerical insignifigance you won't see one unless for a specific app.

This is exactly what my test code demonstrates. It's intended to let me measure that difference, and that is all.

NUMA has not been widely available to the desktop, workstation, or server developer until very recently. Sure, it existed in high-end servers; all machines costing several times what the most expensive dual-proc workstation of the time cost. In those applications, the boxes would be used for either server consolidation or high-end, NUMA-aware applications.

The Opteron brings NUMA architectures to the desktop, and into a price range where they're being purchased by consumers who run general-purpose applications on them.

Do you think application vendors have made their consumer-oriented products NUMA aware? Has Adobe made PhotoShop NUMA-aware, for example? If they haven't, their code expects uniform memory access per processor, and they're not paying attention to scheduling the right processor for the memory the workload uses. When these applications run on NUMA machines, they won't allocate memory while concerning themselves with node-affinity.

As a consequence, they'll access the memory from whatever processor got scheduled instead of the processor most local to the physical memory.

Making an application NUMA-aware, even if that application is already a good multithreaded application, is nominally not trivial. It will take a while for vendors to appreciate the market, see the difference. And even longer to code for it: don't AMD processors have less than 20% of the x86-architecture CPU market share? If so, then we can't expect them to have strong influence over many application vendors.

For at least a little while, then, most anyone who runs an application on a multi-proc Opteron machine is going to be doing so accessing memory cross-node at least a part of time time, since their application isn't NUMA-aware. While the OS is trying to schedule threads near memory they own, it can't do a perfect job. And the user might not even be running a NUMA-aware OS in the first place!

When an unsuspecting NUMA user runs a NUMA-unaware application, what performance can they expect? Well, that depends on the performance of cross-node memory access, and the ability of the operating system to schedule that access. And it depends on some other things, but the speed of cross-node memory access is the first one I decided to measure.

I've only just started, but I think I'm quite successful with it so far. Even when I have measured it, all I've learned is how expensive that memory access is. As a result of measuring and studying those exact costs, I can make educated decisions about tradeoffs when designing my own applications.

Since I don't know how much cross-thread memory access other applications do, even in the aggregate, I can't predict their performance. But I do know they will do some such access, and with the result I have I know their performance certainly isn't going to increase as a result of those access patterns. And that's why I wonder about using Opteron-based NUMA machines for general-purpose computing.

For someone like our original poster who's not well-read in these issues and thinking about using multi-procsesor Opteron machines, I think this is a valid concern.
 
Well I think that the problem with your benchmark is this.....

It is a worst case scenario.... Real world programs WILL NOT have penalties any where near that severe... Plus with the introduction of NUMA aware OS's Like MS's new x64 that penalty will be reduced further... Then a little later when NUMA aware Applications become the norm that Penalty will be gone completely...

I think that the fact that Opteron completely anhialates Xeon in every possible catagory confirms my conclusion.... Not to mention superior scalability, better cache hit rates, faster reorder buffer, etc,etc,etc.......

Have you checked out Anands DC (Dual Core) Opteron benches??? Most of them show it completely anhialating a quad Xeon system that cost 5 TIMES as much........

Again your benchmark is a worst case scenario that is not indicative of real world performance
 
duby229 said:
Well I think that the problem with your benchmark is this.....

Maybe I've even called it a "benchmark" myself, but I didn't write a benchmark; I wrote a test. It's my intention to understand how expensive cross-node NUMA memory access is and measure it. Writing something that mimics real world performance is impossible, since there's an infinite variety of real-world applciations.

My test does however access memory in a pattern which some real world applications use. The test runs in a matter of a few seconds, exercising that pattern over a large amount of memory -- ten megs, if I recall correctly. Applications which have similar access patterns certainly exist, though they're far more likely to do so in small bursts.

This isn't a worst-case scenario; the second processor isn't busy. The numbers are worse for that case, for both processors. Regardless, measuring a worst-case scenario is still a valid measurement as it gives a bound for how bad performance can possibly get. Performance will be as bad as the worst-case scenario in some situations. Otherwise, the worst-case wouldn't be a valid case and wouldn't be achievable.

What percentage of time does a typical application spend in the worst case arena? Hopefully, very little.

The code was run on a NUMA-aware OS. You're certainly right: we need NUMA-aware operating systems and applications before we can realize the full benefit of NUMA. Until then, NUMA systems show a very sharp edge which can significantly hurt NUMA-unaware code. The larger the cross-node latency penalties, the worse the performance.

Note that a NUMA-aware OS doesn't guarantee perfect, or even good, scheduling for applications because there's no way for the OS to predict what memory a given thread will access during a particular quantum. It doesn't even try to guess: it just tries to satisfy memory requests from a particular processor with memory that's local to that processor.

Is the only difference between a Xeon system and an Opteron system that one is SMP and the other is NUMA?
 
Is the only difference between a Xeon system and an Opteron system that one is SMP and the other is NUMA?

umm, no????

If your asking this question, I have to wonder what your doing talking about NUMA as if your an expert.... Not saying that I'm an expert or anything, but I can clearly see that your methods and conclusions are flawed.. I mean for cryin' out loud, don't let facts stop you in your quest to dicredit the benefits of NUMA....

Let me give you a suggestion for a new "Test" to try.... I'm no programmer so I don't have the know how to do this, but if you will humor me for a moment.... First let me say that it is critical to have More than one thread running. Try running your "Random" access pattern then try do a sequential pattern in another thread, they don't have to be related, although they can be.

Then try to do a "semi-random" pattern in another "Test"... What I mean by semi-random is this, set up the code so that you run a sequence of sequentail access patterns that are then randomized you can setup each sequential pattern in it's own thread... Basically run multiple sequentail accesses at the same time in random locations...Also you may want to consider using a MUCH larger data set....

I think this situation will more closely resemble real world performance... But still not a realistic benchmark.

Then tell me where that penalty went...
 
duby229 said:
umm, no????

Then the benchark scores from AnandTech that you reference don't prove anything about the performance of NUMA. The performance differences in those benchmarks can be the result of any of the differences between the platforms, though they're probably a result of some combination of the differences working together.

I read the AnandTech article (asssuming you mean this one) as closely as I had time to do so, and while I don't find the results unexpected, it leaves me with some questions.

Like most benchmarks you see on review sites these days, the method is very poor: there's no description of the variance in the tests involved. Anyone with experience doing any performance work has a big enemy: variance. Running a test might net a particular result, then a second run won't show the same result. Variance in results is completely expected in testing complicated system, where lots of variables can influence the results in subtle, significant ways.

One way to deal with variance is to average it out: take the average the running time for three different runs of the test, for instance. Another way is to observe the variance and indicate it, or try to compensate for it. If you take your car to the dyno, a competant dyno operator will do that. They'll tell you the dyno has an error of more or less three percent, for example; or that they tried to normalize the results using measured humidity and temperature readings.

Either way, the result they give is within a particular tolerance. Because of the margin of error involved, the results might actually be different than the simple numerical result shown. That's particularly true of close tests, like the ones in the AnandTech article.

If my car pulled 50 horsepower, and your car ran 48 horsepower, yours might be the more powerful rig -- if the margin of error at the dyno we used was more than 5%. Lots of people try to use the results sites publish as basis for very broad claims and conclusions.

Yet AnandTech (and almost every other hardware site where I've read a review) doesn't publish the margin of error expected for their tests.

This specific AnandTech article fails to mention some very important issues about the configuration of the involved machines. For example, on the Tyan motherboard they're using, there are settings which will affect how memory is mapped by the machine. That is, they can set interleaving to be done by node, or by bank. Which setting did they use for their benchmarks?

Another important issue is how they boot Windows Server 2003, which they're using for their tests. Did they turn on the /PAE boot option? If they did, then the machines are truly running NUMA. If not, then the operating system isn't scheduling threads or allocating memory in a node-aware fashion. That setting will certainly affect the benchmarks, though I can't forecast by what magnitude.

duby229 said:
I can clearly see that your methods and conclusions are flawed.. I mean for cryin' out loud, don't let facts stop you in your quest to dicredit the benefits of NUMA....

If you don't think I'm an expert, maybe you can educate me.

Before you do, bear in mind that I'm not discrediting the benefits of NUMA. My intention was to measure the penalty involved in cross-node NUMA access. I've done that very successfully, I think.

In this thread, I'm simply pointing out that the benefits of NUMA are realized in very little commercial software today. NUMA-aware applications are very scarce. In fact, good SMP-aware applications are also pretty scarce; for SMP, it's a situation that's gotten lots better, but for NUMA the migration hasn't even started in earnest.

What I've learned about cross-node NUMA memory access demonstrates how bad the problem can be. Code that's touching lots of memory and isn't NUMA aware is, by definition, accessing the memory in a pattern which might match the worst-case memory access for that NUMA machine. On an SMP system, the access isn't penalized nearly as much.

Considering that reminder, why do you specifically think my methods and conclusions are flawed? Do you really think that there's no penalty in accessing memory cross-node on a NUMA system? What facts do you think I'm ignoring?

duby229 said:
Let me give you a suggestion for a new "Test" to try.... I'm no programmer so I don't have the know how to do this, but if you will humor me for a moment.... First let me say that it is critical to have More than one thread running. Try running your "Random" access pattern then try do a sequential pattern in another thread, they don't have to be related, although they can be.

Why do you think it is critical to have more than one thread running? That is, critical to what goal or condition?

Since I wanted to measure the latency of cross-node access, I did so by holding one processor idle and making the other busy. This shows me how long it takes to access memory not owned by the requesting processor when the servicing processor isn't idle.

This is why I said my test isn't the worst-case. What you're proposing (and what I later measured) was the worst-case scenario: the requesting processor doesn't own the memory, and the processor that does own the memory is actually busy. It must synchronize it's own work with the request.

I suppose this still isn't the worst-case. The timing might worse when the servicing processor and the requesting processor are both accessing the same memory region. The servicing processor must the snoop the request from the requesting processor to assure cache coherency.

duby229 said:
Then try to do a "semi-random" pattern in another "Test"... What I mean by semi-random is this, set up the code so that you run a sequence of sequentail access patterns that are then randomized you can setup each sequential pattern in it's own thread... Basically run multiple sequentail accesses at the same time in random locations...

I'm not sure I clearly understand what you're suggesting. How long are these sequential runs, in bytes? How many sequential runs do I do before going back the the randomized patterns? Or by "are then randomized", do you mean that the individual sequential runs are scattered throughout memory randomly? Do they ever overlap?

How many threads do you expect I need to run? Why would I need to execute more threads than the number of processors on my machine?

I think this situation will more closely resemble real world performance...

Certainly, there's code that will randomly access memory in small streaks. I'm not sure why you think it more closely resembles real-world performance, though. Real-world applications are a blend of many different memory access patterns. Some applications stress certain memory access patterns more than others; some patterns appear in performance-cirtical code, and some don't. And what the user or developer perceives as "performance critical" is very specific to the application.

This is, in fact, why I claim my test not to be a benchmark. It is difficult, though possible with expensive equipment, to measure the memory access patterns of an application. Deciding which blend of which patterns best represents real world application usage I think is a task that's between futile and impossible.

Then tell me where that penalty went...

If memory access is sequential, the cross-node access penalty doesn't go away. It's very substantially helped by cache prefetching, however. With a strictly sequential access pattern, I found that the penalty for accessing memory cross-node was between three and four percent with the target processor completely idle. This happens because prefetching bursts the reads; the processor reads a whole cache line on the first cache miss. When I read address 0x0000, the processor reads data from 0x0000 through 0x007F into the cache in one gulp. Then, as I access 0x0001 through 0x007F, there are no more hits to the memory bus because my reads are satisfied from cache. When I finally get to 0x0080, another 128-byte read is performed in one transfer.

When I randomly read, the processor ends up prefetching 128-byte lines just to satisfy requests for a single byte. It is simply anticipating that my memory access will be localized and sequential. But it isn't: I purposefully go beyond that range in order to avoid the cached memory. While this isn't great for performance, it is a pattern that applications certainly make use of. I outlined some of the algorithms which have such aptterns in my responses to wetware_interface.
 
Ok, let me ask you a quick question... Are you, or are you not trying to say that your Test reflects actual real world usage? In other words are you trying to say that your test is indicative of real performance?

If not then why is this conversation still going on?

Your right about benchmarks to an extent... Its impossible to get a 100% accurate always repeatable result... However when we are talking about consistantly 40% to 80% advantages.... Those kind of numbers are hard to ignore....

I was just simply offering an alternative to a purely random access pattern. Sure random access patterns are used, but so are sequential patterns, so why not try and use them too... The reason you should use more than one thread is becouse we are talking about an SMP system here... NUMA first and foremost requires SMP so you might as well take advantage of the hardware that you've been given, otherwise your test is even less accurate.... You should use as large a data set as possible to flood the bus... Fill that bandwidth up for cryin' out loud!! If you don't then once again your test becomes even less accurate.

What I've learned about cross-node NUMA memory access demonstrates how bad the problem can b

The statement above is flawed.... The only thing you've learned about is cross node RANDOM access.... Which is not representative of real world perfomance in a properly tuned configuration... If your running a system that relies on random memory access then I really feel for you!!

Considering that reminder, why do you specifically think my methods and conclusions are flawed? Do you really think that there's no penalty in accessing memory cross-node on a NUMA system? What facts do you think I'm ignoring?

I think your ignoring real world usage patterns....

Why do you think it is critical to have more than one thread running? That is, critical to what goal or condition?

Becouse your running an SMP system, running a single thread is not indicative of SMP performance....



do you mean that the individual sequential runs are scattered throughout memory randomly

Yes that was exactly what I meant... I know it sounds weird but you gotta figure that RAM gets fragmented during usage so this should more closely reflect a real world usage pattern. I'll admit that it's not even close, becouse of all of the numerous other memory access patterns that are being ignored, but it should still be closer then your solely random pattern

While this isn't great for performance, it is a pattern that applications certainly make use of.

Ok, so now that you've admitted this.... Of an average multithreaded program..... How much of it is random???

Very, Very small percentage is my guess, which will become even smaller as applications are optimized for NUMA.
 
6933.png



I mean come on!! how can you ignore numbers like this!!
 
duby229 said:
I mean come on!! how can you ignore numbers like this!!

What makes you think I'm ignoring those numbers? Just the fact that I hadn't seen them until about three hours ago?

The numbers are very interesting, but we can't use them for a conclusion about NUMA vs. SMP.

I believe they're using SQL Server for that particular test. SQL Server is the only comercially shipping NUMA-tuned application that I know about. Therefore, I'd expect the Opteron-based systems to shine because the application has been tuned for their strength.

One of the things that's missing from this AnandTech review is detailed information about the configuration of the software and hardware used for the tests. Is NUMA enabled in the operating system when that test is running? In SQL Server, too?

If it is, then the better numbers for the Opteron-based system might go to my point: NUMA applications need NUMA-tuned code in order to shine. Regardless, we don't know how much of that performance comes from NUMA or from the Opteron's architectural differences.
 
The numbers are very interesting, but we can't use them for a conclusion about NUMA vs. SMP.

I think that is where you are going wrong... It's not a question of NUMA vs SMP.....

You CAN'T have NUMA without SMP... NUMA cannot run on a single CPU there must be multithreading to take advantage of NUMA.... NUMA is an extension of SMP....

Kinda like how SSE is an extension to x86....
 
duby229 said:
Are you, or are you not trying to say that your Test reflects actual real world usage? In other words are you trying to say that your test is indicative of real performance?

I'm not sure how to answer your question.

I think the simple answer is: no, I don't think my test application demonstrates exactly what real world applications do. My test application works only one memory access pattern, and real applications are a blend of many different memory access patterns. The performance of any particular application depends on many things aside from memory access performance, as well.

However, I do think with complete certainy, my application demonstrates one of the many memory access patterns exercised by real-world programs.

Maybe an oversimplified (and probably busted, anyway) analogy helps: perhaps I'm only measuring the effectiveness of the field goal kicker. I think I've accurately measured one aspect of the kicker's performance. The performance of a real football team depends on many, many things. The performance of the kicker might be a big indicator for a particular team, and less of a factor for another team.

Am I reflecting the performance of real football teams? To some extent, but certainly not in a deterministic way.

Will real world programs hit penalties that severe? Absolutely, they certainly will. Will the performance of those applications overall change as much as that test shows? Probably not; that depends on what percentage of the timed portion of the application is exercising memory in that same way.

It depends on how often is your field goal kicker on the field, in other words.

duby229 said:
The reason you should use more than one thread is becouse we are talking about an SMP system here... NUMA first and foremost requires SMP so you might as well take advantage of the hardware that you've been given, otherwise your test is even less accurate....

But if we're so concerned with "real world applications", and many real-world applications don't have multiple threads (or, at least, don't have multiple runnable threads -- exactly as the first iteration my test did), then the first run of my tests was accurately simulating more apps than it wasn't.

What definition of "accurate" are you trying to use, by the way?

I've said many times that I'm not trying to simulate real applications. So, I'm not true to real applications; but the information I've collected and shown about cross-node NUMA latency is no less valid. Cross-node memory access is going to have some non-zero affect on the performance of a NUMA-unaware application running on a NUMA-architecture machine, so it's relevant even if it isn't a dominating predictor.

Your point seems moot, then, from any way I can look at it. I am using multiple threads, and that actually makes the difference worse; and my goal is not to closely model real world applications -- only to study a particular access pattern unique to NUMA systems.

duby229 said:
I think your [sic] ignoring real world usage patterns....

I am indeed ignoring other real world patterns. There are many memory and processor usage patterns. I've modeled only one of them. (Meanwhile, I don't think I've ever claimed to be modeling all real world usage patterns and at least twice in this thread I've explained that my goal isn't to do so.)

I'm ignoring other patterns for a lot of different reasons. The foremost is because I'm not trying to model real-world applications in my test. The goal of my test was to measure the penalty of cross-node NUMA memory access. Another reason is that there are many possible patterns, and every application is going to exercise different patterns at different times.

Now that I understand the substantial cost cross-node NUMA memory access incurs, I can design better software. I can also make an educated guess about the performance of real-world software: I've hypothesized that applications which aren't NUMA-aware and are multithreaded are very likely to incur this cross-node access penalty, and therefore will run slower on NUMA hardware than they will on symmetrical hardware to the extent that they access that memory.

duby229 said:
I know it sounds weird but you gotta figure that RAM gets fragmented during usage so this should more closely reflect a real world usage pattern. I'll admit that it's not even close, becouse of all of the numerous other memory access patterns that are being ignored, but it should still be closer then your solely random pattern

I'm having a hard time connecting the access pattern you're propopsing with any meaningful measure of fragmentation. Of more direct concern is structure size and the algorithm used to traverse the data structure in question.

A pattern of randomly accessing groups of sequential locations is no closer or further from real-world usage than the test I've exercised because an application will use many different access patterns throughout its execution. Some applications, at some times, may do more random access than scattered-sequential access. For those applications, my test is a better predictor. For applications that do more scatterd-sequential access, the test you suggest is a better predictor.

The problem is that modeling a real world application accurately means providing a mix of a mix of access patterns. Even if I managed to model them all, I'd still be left with trying to figure out what percentage of what times which patterns are exercised in the universe of all applications.

Doing that would be fun, if I had a research grant and a couple of years to spend. Since my goal is to measure the worst-case cross-node NUMA access penalty, I don't think it's too interesting to try and model everything that's possible.

I could probably implement your suggested access pattern in the next couple of days if you'd care to completely specify it. The performance will be directly proportional to the size of length and stride of the sequential read, so I need you to answer those questions from your insight into real world applications.

duby229 said:
Ok, so now that you've admitted this....

Gosh! When did I ever deny it?

duby229 said:
Of an average multithreaded program..... How much of it is random???

There are an unknown number of programs, and so there's no way to guess what "an average program" really means. as such, that question is impossible to accruately answer. Any meaningful answer would require a lifetime of research and work. And that's exactly why I didn't follow that detour when trying to measure one parameter of memory access on NUMA systems.

You could even quibble about what "random" really means. I usually take it to mean non-sequential to such a degree that a cache miss is more likely than not. With my definition, even what appears to be sequential might be random; if you're sequentially visiting items, but those items are each larger than the size of a cache line on your processor, then they're effectively random. In my work, I have found this definition is quite useful. Even that withstanding, perhaps you'd like to suggest a different definition.

Here's a paper that tries to examine the access patterns of different scientific applications. You can see, among three or four different applications, the pattern varies vastly even though the applications are undisputably all scientific applications.

Stipulating those definitions, we're still left with a question that's impossible to answer even for a single application. Memory might be accessed differently at start-up than it is in "steady state"; and it might depend on exactly what the application is doing when run. Certainly, the memory access pattern of SQL Server when computing a SUM aggregation is vastly different than when it is doing a COUNT DISTINCT.

As such, all we can hope to do is time a few different access patterns and use the results to try to estimate performance based on the anticipated patterns of a particular application.

If we limit the question further to a particular sub-program -- a algorithm, say -- then we might have a chance. It's easy to pick up an algorithm book and see what different patterns different algorithms might use. But even then, the real answer depends on how the algorithm was actually implemend as well as on the data volume and type that it is processing.

This paper studies memory access patterns of a single application running on the DEC Alpha. (It's an old paper.) The "application" is absolutely trivial: it's less than 40 lines of simple pseudocode. But the analysis, which is in some dimensions not very deep, is twelve pages long.
 
duby229 said:
I think that is where you are going wrong... It's not a question of NUMA vs SMP.....

You CAN'T have NUMA without SMP... NUMA cannot run on a single CPU there must be multithreading to take advantage of NUMA.... NUMA is an extension of SMP....

Kinda like how SSE is an extension to x86....

I'd agree that NUMA is symmetrical in the sense that there aren't dedicated-use processors, I don't think this is correct. The "S" in "SMP" stands for "symmetric". NUMA isn't symmetrical; it's "non-uniform".

Wikipedia defines SMP as a machine where multiple processors are connected to "a single shared main memory". NUMA machines don't have a single shared main memory; they have multiple banks of memory which are associated with processing resources. Memory on one node, at a cost of can be, at an expense of access time, be accesed from other nodes. But that expense is what makes NUMA not symmetrical.

If you write software and you look at NUMA as "an extension of SMP", you're going to write software that performs poorly because it's not accurately aware of its target environment.

I'd love to find a definitive description of how this terminology should be used, but I think it's not relevant to my measurement of cross-node NUMA memory access times. And it was you who were trying to "conclusions" based on some of the nubmers:

duby229 said:
I think that the fact that Opteron completely anhialates Xeon in every possible catagory confirms my conclusion....

Did I misunderstand your conclusion? Was it not about SMP vs. NUMA? Was it just Opteron vs. Xeon MP? Or something else?
 
Well, I guess There is no changes going to be made here sooooo I'm not going to debate this with you...

I'm no expert so I guess my observations are not important...

I do believe that your interpreting SMP and NUMA wrong however.... SMP for PROCESSING....... NUMA for MEMORY....... see the differance? Symmetric Multi Proccessing.......... NonUniform Memory Access

SMP means that more than one processor can communicate with each other to share a load..... (provided OS support) Symmetric becouse of how the communications happen though AMD's MOESI communications protocol

NUMA means that in a SMP system each proccessor has its own memory banks.... (Provided OS support) NonUniform becouse each proccessor has its own banks therefore the total amount of RAM is not in a uniform location hence access cannot be uniform either.....

However memory access has nothing to do with CPU communications instead that task falls on AMD's MOESI protocol

And I also believe that you are are putting way too much faith in the usefullness of you test. Maybe only 0.5% of all code uses access patterns similarto your test? I don't know take a guess... If you'd throw in another thread you may just see a differant result.... Also CPU affinity is critical....

I'm sure you probably already stated this earlier in the thread.... But what purpose did you create this test for.... Looking for bottlenecks? If so then you now know one to avoid, so why defending it so much? I just don't understand?


Stand out on a limb a little man! If everyone was as defensive and catious as you we would still be living in the stone age.... The industrial revolution would never have even happened..... Every single valid argument I've thrown at you you've turned down becouse you are unwilling to speculate.....

Here's a paper that tries to examine the access patterns of different scientific applications. You can see, among three or four different applications, the pattern varies vastly even though the applications are undisputably all scientific applications.

This is exactly my point. Your test is not accurate becouse it is only one very uncommon access pattern
 
I'm having a hard time connecting the access pattern you're propopsing with any meaningful measure of fragmentation. Of more direct concern is structure size and the algorithm used to traverse the data structure in question.

Think of the brick wall analogy.... Each brick is a chunk of memory that the OS assigns to a thread or proccess.... One of those bricks become free and then another and then another.... Those free bricks create "fragment" in memory....

Now imagine reading from these fragments... This is no better then you purely random test, but should be closer to real world than yours....

The problem is that modeling a real world application accurately means providing a mix of a mix of access patterns. Even if I managed to model them all, I'd still be left with trying to figure out what percentage of what times which patterns are exercised in the universe of all applications.

That's exactly correct... So what was the point of this test? For the sake of?

There are an unknown number of programs, and so there's no way to guess what "an average program" really means. as such, that question is impossible to accruately answer. Any meaningful answer would require a lifetime of research and work. And that's exactly why I didn't follow that detour when trying to measure one parameter of memory access on NUMA systems.

Oh comeone dude!! Your a programmer right? You should at least have a guestimate right? Just a qiuck guess... You'd know much better than I...
 
duby229 said:
I do believe that your interpreting SMP and NUMA wrong however.... SMP for PROCESSING....... NUMA for MEMORY....... see the differance? Symmetric Multi Proccessing.......... NonUniform Memory Access

I'm well aware of what the acronyms stand for, and very familiar with what the words mean. What I question is that "processing" is in this case seperable from "memory access". Processing can't happen without some memory access, and thus the memory access has to be implemented in a platform-appropriate way. Symmetrical scheduling means that any work can be scheduled on any processor; and in the case of a NUMA machine, that's certainly not an optimal approach.

If I'm wrong about that usage, that's fine by me. The only reason I use "SMP" when I mean "non-NUMA" is because I get tired of writing "non-NUMA" and have a very hard time seeing NUMA as "symmetrical".

duby229 said:
However memory access has nothing to do with CPU communications instead that task falls on AMD's MOESI protocol

You've lost me again. Memory access has everything to do with CPU communications because the CPU is moving data to and from memory. Moving data is communications. The cache coherency protocol has to do with inter-processor communications, as well; otherwise, the caches that exist in the mulitple processors can't be made coherent.

duby229 said:
And I also believe that you are are putting way too much faith in the usefullness of you test... If you'd throw in another thread you may just see a differant result....

Why do you think so? The test helped me quantitatively measure the cross-node NUMA memory access penalty on my machine, and compare it to the timing of the same code on a non-NUMA machine.

duby229 said:
Also CPU affinity is critical....

Of course. I locked each thread to the appropriate processor. Otherwise, I wouldn't have been able to measure anything meaningful.

duby229 said:
I'm sure you probably already stated this earlier in the thread.... But what purpose did you create this test for.... Looking for bottlenecks? If so then you now know one to avoid, so why defending it so much? I just don't understand?


I write software for a living. If I don't understand the platforms where my software will run, I will necessarily have written pretty crappy software. If I know more deep information about the architecture of popular machines, then I'll have a better understanding of how to write better software.

The application I've been working on is constrianed by memory bandwidth. NUMA seems very interesting because it will increase memory bandwidth available to my application as long as I can effectively schedule the application's work across independent data sets local to the individual procsesors.

But what if I can't schedule the work in my application that way? Then I must copy the data from one node to the next. If I must do that, I want to do it in the most expeditious way possible, and that requires studying the case that my test program demonstrated.

I'm not sure how I came off as defending a bottleneck, and I'm sorry you have that impression. I'm just responding to your challenges that my conclusions were wrong and that my test was invalid. Considering that you've not seen the code in question and don't seem to understand many of the issues at hand -- even the purpose of the code in question -- I'm not sure what you intention was. But defending my ideas against those questions is thought-provoking, and I'm happy to go through the exercise.

duby229 said:
I'm no expert so I guess my observations are not important...

They're occasionally not clear to me, but I never regarded them as unimportant.
 
OK I see why you've done the test now... It does make a little more sense to me.. As I said I'm no programmer.

So let me get this right... The reason you've made this test was to identify a bottleneck so that you can avoid it in the future? How did you know to make this particular test?

Cache coherency is managed by AMD's MOESI protocol, It has nothing to do with NUMA.. NUMA is a way to access memory. Not a way for CPU's to communicate... Unless you are copying data from one CPU's memory bank to another CPU's memory bank, which I don't think would be very optimal, NUMA is not needed for communications.. It is handled entirely by MOESI... However having said that SMP (MOESI) is needed for NUMA to work...

cross-node NUMA memory access

Again I think you should be stating cross node RANDOM memory access, becouse your test only shows the latency involved in random accesses. Not ALL accesses

Anyhow I think I've learned a bit from this debate.... As I said I'm no programmer, but I do understand how CPU's work to a degree and am always willing to learn a little more.. Sometimes a good debate can be frustrating but will bring out a little more knowledge then what went in... So I do appreciate your patiance with me While I prodded you :D

Of course. I locked each thread to the appropriate processor. Otherwise, I wouldn't have been able to measure anything meaningful.

Ok, What does this mean about the bottleneck that you've found? Does it mean that it is not reproducable in an OS managed thread? (Not sure if my terminology is correct) In other words if you hadn't locked the threads would it still be a problem? Can you think of any real world examples of when this bottleneck would be a problem?
 
I got NUMA and SMP with my dual Opteron 244's and the Tyan S2895 (Thunder K8WE) board. I get about 9100MB/s in Sandra tests but there has been a flaw discovered where it is reading the SPD of the memory and skewing the results. Oh well, if you want to gain a lot of knowledge on SMP and NUMA things go to www.2cpu.com and hit the forums. There are guys over there that will make your head spin.

Just to add my 2cents in, NUMA and SMP are not comparable since you have to have and SMP rig to even start to think about NUMA. Then you have to have a NUMA-capable OS to even get benefits. The OSes that do that now are mainly any 64-bit windows version and XP Pro w/SP2 included in the original install. You cannot even install XP SP1 then upgrade to SP2 to get NUMA-support. Weird I know but I have the machine to test on. I will say this machine beats my old 2x1.6LV Xeons @ 3.2 just due to the memory bandwidth, number crunching and SLi capabilities. Just to get the 6800GT OC's in to get the machine and full specs.

I will update later with results for now, here are some screenshots of some benchmarks so you can get a fill of how NUMA helps SMP systems.
???


WARNING: PICTURES ARE PRETTY BIG!!!
4x HTT multiplier, 400FSB memory, and 1000 HTT
5x HTT multiplier, 500FSB memory, and 1250 HTT
Prime95 and Sandra
 
??? said:
Then you have to have a NUMA-capable OS to even get benefits. The OSes that do that now are mainly any 64-bit windows version and XP Pro w/SP2 included in the original install. You cannot even install XP SP1 then upgrade to SP2 to get NUMA-support.

Are you perfectly sure that is true? I can't find anything on the Microsoft website that says XP supports NUMA. In fact, what I do find in MSDN says that NUMA is not supported on any version of XP.

For clarification, I checked with the developers who work on the product. They said that the APIs were added for compatibility in the service packs, but are not the same code that is in Windows Server 2003 and might not work correctly. In other words, they're not to be relied on.

Even if you do have a NUMA-aware OS, you might have to take some steps to enable NUMA.

??? said:
I will update later with results for now, here are some screenshots of some benchmarks so you can get a fill of how NUMA helps SMP systems.

Maybe, in all these long posts, my point has become unclear. I'm not saying that NUMA isn't advantageous. I think it's great, and that it's the only way to scale to more processors for high-bandwidth applications.

But the applications need to be NUMA-aware in order to enjoy that benefit. And they're penalized if they're not written to be NUMA-aware. Sandra is getting those scores by affinitizing the threads it is using, and then having each thread hit memory that's associated with the node where the thread using it is running.

Almost all shipping commercial applications aren't written to guarantee that they'll perform optimally on NUMA-based machines.

I'm sure you're getting better bandwidth and lower latency just because you're using an AMD-processor, as its memory subsystem is more efficient than Intel's comparable processors. (For now, at least.) The greater bandwidth you're seeing because of NUMA is not something you can always expect from your applications ... until they're NUMA aware, like your benchmark is.
 
duby229 said:
Stand out on a limb a little man! If everyone was as defensive and catious as you we would still be living in the stone age.... The industrial revolution would never have even happened..... Every single valid argument I've thrown at you you've turned down becouse you are unwilling to speculate.....

Wow! Why would you ruin an interesting discussion by going ad hominem like that?

You've asked lots of questions, which is great. I've done my best to try to answer them for you. But I'm not really sure I understand what your "argument" is; what is your position? What is it that you disagree with?

duby229 said:
Your test is not accurate becouse it is only one very uncommon access pattern

I'll try it one last time: my test was never intended to mimic all access patterns. I'm trying to measure cross-node access time. If you want to try and use my test for something else, feel free to do so. But I can't help it if it wasn't intended for that purpose.

How did you arrive at the conclusion that my access pattern over memory isn't common?

duby229 said:
Now imagine reading from these fragments... This is no better then you purely random test, but should be closer to real world than yours....

I'm well-aware of what memory fragmentation is. What I'm trying to figure out is why you think memory fragmentation has anything to do with random memory access patterns.

When memory is allocated, the bricks you're talking about are usually large. Otherwise, the overhead of dividing things into bricks ends up being quite costly. Even when a program makes a small memory request of its memory manager, the manager usually rounds up to a larger size in order to manage the request efficiently. It also bumps the request size in order to align the resulting memory.

As such, it's rather rare that two consecutive dynamic allocations would end up fitting together in the same cache line in the order they'll be accessed. It's by no means impossible, but a developer who assumes this will happen is doing himself a disservice.

duby229 said:
Again I think you should be stating cross node RANDOM memory access, becouse your test only shows the latency involved in random accesses. Not ALL accesses

This is the first time you've said that, so it's rather strange you'd qualify your statement with "again".

Actually, my test doesn't only show the latency caused by sequential access. I perform the random access on the local nodes memory and time it. That time incldues the latency because of cache misses. Then, I perform the random access pattern on the remote node memory and time it. That time includes the latency because of cache misses plus the latency because of the cross-node access.

The difference between the times is the overhead caused by cross-node access.

I've also measured sequential access. It's there, but it's far less pronounced because the processor can batch the requests. (I described this in a previous post.) The latcency I've measured in this case, for my byte-by-byte sequential access pattern, is between three and five percent on my machine.

duby229 said:
Unless you are copying data from one CPU's memory bank to another CPU's memory bank, which I don't think would be very optimal,

Exactly! The funny thing is that any application that's not NUMA aware is going to be doing that, at least some of the time, when run on a NUMA-enabled system. That's why this is important to measure and understand when designing software. Code that isn't NUMA aware is very likely to fall into this trap.

duby229 said:
Ok, What does this mean about the bottleneck that you've found? Does it mean that it is not reproducable in an OS managed thread?

Acessing memory across nodes doesn't suddenly become cheaper just because the OS is managing the thread.

The OS will schedule threads to run on processors as they become available. Setting affinity for a thread (or a process) is a technique some deveopers use to try to aid performance. The idea is that leaving the thread on the same processor will keep the cache on that processor charged with the data the thread is likely to access. If a thread is switched to another processor, it probably has to charge the cache with its data again.

On a NUMA system, assigning a thread to a specific processor becomes more important because you want the thread to always run on the processor local to its data. Otherwise, you're always paying the cross-node access penalty.

When the OS knows its running on a NUMA machine, it doesn't know what memory a thread plans to access. It tries to help the thread by allocating any memory it asks for on the node where the thread was running when it allocated that memory. After the allocation happens, the operating system doesn't know what the thread intends to do with its quantum. All it can do is hope to improve the odds by trying to keep the memory local and trying to use the same processor any time a given thread needs to run.

But the OS must switch the processor, sometimes. A thread that's running on the wrong processor isn't optimal, but it's better than a thread that's doing nothing at all while waiting for it's favorite processor to become available.

duby229 said:
Can you think of any real world examples of when this bottleneck would be a problem?

Say you write an application. You want it to have two threads, but you didn't think about NUMA when you were writing the application. But I bought your application and I'm running it on my NUMA box. One thread reads data from disk and then does some processing on it. Then, it sends it to the other thread to process it more and write it out to a different file.

Your first thread allocates some memory, and reads a bunch of bytes from the file. You process the bytes, and things are going great. Since you're running on CPU0 and accessing MemoryBank0, you're golden.

But then you're done processing in that thread, and want to hand the data off to the other thread. That thread gets scheduled on CPU1. But it's accessing memory that came from MemoryBank0, so whatever access it is doing is penalized. From my measurements, the minimum penalty appears to be just more than three percent. The maximum penalty is around 35%, I think.

While CPU1 is trying to work, CPU0 is off reading more data and doing its own processing again. Now, it's busy and will delay CPU1's requests. This seems to cost another 5%. So you're paying between three and forty percent.

That cost is unfortunate, because this application works just great on a not-NUMA machine. There's nothing wrong with the design, and it's actually a very normal approach to doing I/O and processing. In other words, there's no reason for a developer to avoid this design until they know about NUMA.

NUMA-aware developers have a funny challenge: how can I fix this design so it works optimally on NUMA machines, but have the same code work well on non-NUMA machines?
 
Here are the results from modifying the program to follow duy229's suggested access pattern. I vist a byte of memory, where I find a randomly initialized quantity between 0 and 255. I move head that many bytes in my buffer. At that location, I read between zero and 128 bytes of memory in a row, sequentially. Then, I move ahead a random amount again.

Code:
Windows Version: 5.2 Buid 3790 ("")
2 processors
GetNumaHighestNodeNumber: 3
Performance Frequency is 1991120000
Other processor idle:
Processor 0 on Memory 0: 1133553605
Processor 0 on Memory 1: 1401580754
Processor 1 on Memory 0: 1404444219
Processor 1 on Memory 1: 1108713455
Other processor busy:
Processor 0 on Memory 0: 1116267259, 1113906380
Processor 0 on Memory 1: 1495308381, 1156863733
Processor 1 on Memory 0: 1461791937, 1156219493
Processor 1 on Memory 1: 1113090896, 1114465231

When the memory being accessed is on not on the same node, and the target node isn't busy, the time for the code to run is about 24% lower than when the memory being accessed is local.

When the other processor is busy, the penalty rises to about 31%.

Interestingly, the other processor isn't substantially affected by servicing the accesses from its partner; looks like the penalty is about 4%.
 
Hey, this was a good read. You two make some good points, and when you consider the collective expertise between the two of you, you pretty much understand this SMP and NUMA thing. I don't want to come off as a smart ass, but I write multi-threaded NUMA-aware software for a living, and I do consider myself an "expert". You have sparked my intrest, and I will write a test program to demostrate as many possibilities as I have time to cover. Perhaps this will show us a little more.

In the mean time, I would like to address one issue you two haven't really touched much upon. While it's true that software that is NUMA aware is better, you don't need NUMA aware applications to take advantage of it, as long as you have a NUMA-aware OS that is. If you have a NUMA enabled OS, the OS does some smart things for you. A NUMA aware OS tries not to move processes or threads around too much. It's true that OS will move some threads around if there is a dire need to, but for the most part, it tries to let things be. Let's consider an example. When I start up firefox on my dual Opteron box, the OS starts it on one CPU, let's say CPU0. Since the OS knows what CPU firefox was started on, and will make an attempt to keep it on that same CPU, the OS will go ahead and allocate all of firefox's memory from the RAM connected to the CPU that it is running on (CPU0). So there is no "penalty". And this is from an app that knows nothing about NUMA. The OS is smart, and helps you out quite a bit. If you run a two CPU intensive, single-threded apps, the OS will arrange things to take full advantage of NUMA. Threded apps are little trickier with shared memory and all, but still.

I just wanted to point this out, because you make it out as if no software can use this currenty. Any 2.6.0 or higher Linux kernel as full support for NUMA, and windows XP 64 and the 64-bit version of server 2003 have full NUMA support as well.
 
visaris said:
When I start up firefox on my dual Opteron box, the OS starts it on one CPU, let's say CPU0. Since the OS knows what CPU firefox was started on, and will make an attempt to keep it on that same CPU, the OS will go ahead and allocate all of firefox's memory from the RAM connected to the CPU that it is running on (CPU0). So there is no "penalty". And this is from an app that knows nothing about NUMA.

Are you very certain it works that way? Because, if it does, your machine is going to waste. If the OS simply doesn't schedule the thread because the CPU0 is busy, you're getting a lot of time-domain latency and only using half your machine.

Say CPU0 is busy with some other app. FireFox will have a couple of runnable threads because it has work to do; paint some frames from an animated GIF, say. If the OS was perfect at keeping FireFox on CPU0, then it just wouldn't paint them until CPU0 became available agian. That might be a while, if there's lots of runnable threads on the system.

I think what you're calling a "dire need" is really quite an ordinary occurrence.

If the frames in that GIF are on MemoryBank0 when the drawing thread gets scheduled on CPU1, you're going to be non-optimal. Yeah, it's not heinously bad; but it certainly isn't perfect.

visaris said:
I just wanted to point this out, because you make it out as if no software can use this currenty. Any 2.6.0 or higher Linux kernel as full support for NUMA, and windows XP 64 and the 64-bit version of server 2003 have full NUMA support as well.

I stand by my point that very little software uses it optimally. The operating systems you're talking about do have NUMA support, but those are just the operating systems. The applications you run on those systems, when not NUMA-aware, are not taking full advantage of NUMA. The OS tries its best, sure, but it's very easy to write software that's got innocuous performance characteristics on non-NUMA systems but trips on a bad bottleneck when run on a NUMA rig. And there isn't a darn thing the OS can do about it.

For desktop applications, I don't think the performance problem is much of an issue. Rarely more than 15%, say; and usually less than half that. But for servers and more demanding applications, particularly applications who's performance is already constrained by memory bandwidth, I think the numbers are substantially higher. For those apps, since performance is already critical, NUMA unawareness is real issue.


visaris said:
You have sparked my intrest, and I will write a test program to demostrate as many possibilities as I have time to cover. Perhaps this will show us a little
more.

Please do post what you find out. I mean to write a whitepaper and show my code, but I'm not having much free time lately. I'm in that mode where I just want to do one more thing then I'll post it, but then I find two more things and skip a weekend and next thing you know it's six weeks later. Plus, it's racing season.

It's very interesting that cache misses across nodes are so very expensive compared to locally. I'm impressed, though, that the impact on the remote processor is so little even when it is busy.

What I want to look at next are read-write patterns, plus some sharing scenarios.
 
Are you very certain it works that way? Because, if it does, your machine is going to waste. If the OS simply doesn't schedule the thread because the CPU0 is busy, you're getting a lot of time-domain latency and only using half your machine.

Say CPU0 is busy with some other app. FireFox will have a couple of runnable threads because it has work to do; paint some frames from an animated GIF, say. If the OS was perfect at keeping FireFox on CPU0, then it just wouldn't paint them until CPU0 became available agian. That might be a while, if there's lots of runnable threads on the system.

I think what you're calling a "dire need" is really quite an ordinary occurrence.

If the frames in that GIF are on MemoryBank0 when the drawing thread gets scheduled on CPU1, you're going to be non-optimal. Yeah, it's not heinously bad; but it certainly isn't perfect.
Yes. You're right. Maybe "dire need" isn't the best word for it. The point here is that the OS will not move a thread or process to another CPU on account of a few percent CPU useage. Each OS has it's own design and threshold for this. The designer of the OS tries to balance CPU useage against memory efficency. I can't claim to know exactly how each OS does this, but support for it is there. Also, when the original CPU becomes available, the OS will move it back to where it "belongs" in non-NUMA systems this concept of "belonging" to a CPU (affinity) doesn't exist at all.


I stand by my point that very little software uses it optimally. The operating systems you're talking about do have NUMA support, but those are just the operating systems. The applications you run on those systems, when not NUMA-aware, are not taking full advantage of NUMA.
Yeah, I can't argue with that.
 
visaris said:
Each OS has it's own design and threshold for this. The designer of the OS tries to balance CPU useage against memory efficency. I can't claim to know exactly how each OS does this, but support for it is there.

It's just a part of the scheduler. The scheduler decides which thread gets to run next. It rips through its list of runnable threads, and scores them. The score invovles the thread's priority, the time since it last ran, and so on. The computation should also involve the last CPU and the preferred CPU.

It's surprisingly simple; it has to be, since that code runs very frequently -- at every context switch. Of the OS books I have, I think Tanenmaum's Modern Operating Systems gives the best description. It includes algorithms for multiprocessor scheduling.

The hardware vendor influences what happens by plugging code and some values into the scheduler. The HAL you get in Windows, for example, lets the scheduler know the anticipated cost of moving something to a different processor. The scheduler might pick up other values along the way while booting; from the CPUID instruction, or even from a loop that executes some known code and times it.

visaris said:
Also, when the original CPU becomes available, the OS will move it back to where it "belongs" in non-NUMA systems this concept of "belonging" to a CPU (affinity) doesn't exist at all.

I guess it will; that's kind of interesting. For the time the thread isn't on its "home" CPU, it's dealing with a cold cache. If it's moved back, then it gets memory locality again but has to refresh its cache. One consideration that must be in the values the HAL gives the scheduler for the expense of switching has to involve cache hotness, and the comparative values between doing cross-node memory access and cache repopulation.

I was thinking a little more about your point with running single-thread applications. Indeed, that's an interesting case where you haven't even done any SMP-aware programming and end up getting a better benefit from a NUMA system than an UMA system. Since the NUMA system can run as if it's a share-something cluster, but keep the two single-thread processes independent, a perf win over a UMA system is realized because the UMA system is competing for memory access.

By the way, I came across the "UMA" term in one of the OS books I was looking through. I think I can use that to identify the opposite of NUMA (instead of SMP) and not generate any controversy. It's odd for me, though, as "uma" is the Japanese word for "horse".
 
Well, I finished all the basic tests. Here is the output... a little verbose, but that's how it goes. Take it or leave it. I was listening to some MP3s and browsing while this ran, so CPU0 (thread0) may be a little slower. Hell they both may be a little slower than a real "fair" test would be. Still, I think it's interesting.

Code:
avose@tcm NUMA $ ./NUMA
* Checking system confg:
  Currently available CPUs: 2
* Done.

* Starting threads:
  Thread 0: Started.
  Thread 0: Current affinity mask: 0x3
  Thread 0: Setting mask to 0x1 (CPU0 only)
  Thread 0: Affinity mask is now: 0x1
  Thread 0: Allocating 512MB of memory
  Thread 0: Done with initialization.
  Thread 1: Started.
  Thread 1: Current affinity mask: 0x3
  Thread 1: Setting mask to 0x2 (CPU1 only)
  Thread 1: Affinity mask is now: 0x2
  Thread 1: Allocating 512MB of memory
  Thread 1: Done with initialization.
* Done.

* Testing local reads:
  Thread 0: Starting local read test:
  Thread 1: Starting local read test:
  Thread 0: Read local 512MB 10 times in 3.841246 seconds.
  Thread 1: Read local 512MB 10 times in 3.883761 seconds.
* Done.

* Testing local writes:
  Thread 0: Starting local write test:
  Thread 1: Starting local write test:
  Thread 1: Wrote local 512MB 10 times in 6.034309 seconds.
  Thread 0: Wrote local 512MB 10 times in 7.853928 seconds.
* Done.

* Testing remote reads:
  Thread 1: Starting remote read test (from CPU0's memory):
  Thread 0: Starting remote read test (from CPU1's memory):
  Thread 1: Read remote 512MB 10 times in 11.203189 seconds.
  Thread 0: Read remote 512MB 10 times in 11.496961 seconds.
* Done.

* Testing remote writes:
  Thread 1: Starting remote write test (to CPU0's memory):
  Thread 0: Starting remote write test (to CPU1's memory):
  Thread 1: Wrote remote 512MB 10 times in 20.612060 seconds.
  Thread 0: Wrote remote 512MB 10 times in 21.433938 seconds.
* Done.

* Testing remote reads (read, sleep):
  Thread 0: Starting remote read test (from CPU1's memory):
  Thread 1: Doing nothing.
  Thread 0: Read remote 512MB 10 times in 9.552590 seconds.
* Done.

* Testing remote writes (write, sleep):
  Thread 0: Starting remote write test (to CPU1's memory):
  Thread 1: Doing nothing.
  Thread 0: Wrote remote 512MB 10 times in 12.177972 seconds.
* Done.

* Testing remote reads (read, compute):
  Thread 0: Starting remote read test (from CPU1's memory):
  Thread 1: Spinning on some local math:
  Thread 0: Read remote 512MB 10 times in 9.836080 seconds.
  Thread 1: Finished spinning after 12.703246 seconds.
* Done.

* Testing remote writes (write, compute):
  Thread 0: Starting remote write test (to CPU1's memory):
  Thread 1: Spinning on some local math:
  Thread 1: Finished spinning after 11.831904 seconds.
  Thread 0: Wrote remote 512MB 10 times in 13.348314 seconds.
* Done.

* Testing reads (remote read, local read):
  Thread 1: Starting local read test:
  Thread 0: Starting remote read test (from CPU1's memory):
  Thread 1: Read local 512MB 10 times in 5.567625 seconds.
  Thread 0: Read remote 512MB 10 times in 10.603706 seconds.
* Done.

* Testing writes (remote write, local write):
  Thread 1: Starting local write test:
  Thread 0: Starting remote write test (to CPU1's memory):
  Thread 1: Wrote local 512MB 10 times in 10.855946 seconds.
  Thread 0: Wrote remote 512MB 10 times in 12.078677 seconds.
* Done.

* Shutting down:
  Thread 1: Shutting down.
  Thread 0: Shutting down.
* Done.

Well, The only test I think you could really complain about is the one where CPU1 finishes doing some math before the writes finish. Oh well. I'm done messing with it for today. I'll post the code in a second, but don't give me any crap about it. This isn't production code, it's a 1 hour hack. I wanted to point out that I didn't do ANY random access reads/writes. Why? Because it takes like 10 times longer to generate random numbers than it does to read anything. If you are trying to test memory speed with random reads/writes. Good luck. In my testing it takes so long do generate good random numbers that you'll just be wasting your time. All that being said, here's my code. Written for linux, though, I'm sure you could port this to windows in 30sec if you wanted to. (Why you could possibly want to do that is beyond me though ;) )

Maximum post size reached. Code to follow in next post:
 
Code:
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>

/* I think a max of 128 threads is fair */
#define MAX_THREADS 128

/* These are the "work" flags to get the threads to do "stuff" */
#define TEST_FINISHED                 1
#define TEST_LOCAL_READS              2
#define TEST_LOCAL_WRITES             3
#define TEST_REMOTE_READS             4
#define TEST_REMOTE_WRITES            5
#define TEST_REMOTE_READS_CPU0        6
#define TEST_REMOTE_WRITES_CPU0       7
#define TEST_REMOTE_READS_CPU0_SPIN   8
#define TEST_REMOTE_WRITES_CPU0_SPIN  9
#define TEST_WRITES_REMOTE_LOCAL      10
#define TEST_READS_REMOTE_LOCAL       11

/* Stop-Watch timer */
typedef struct {
  struct timeval s;
  struct timeval e;
  unsigned long  t;
  int            f;
} StopWatch, *pStopWatch;

/* Thread info structure */
typedef struct {
  volatile int            id;
  volatile unsigned long *mem;
} ThreadData, *pThreadData;

/* This prototype is needed for CreateThreads() */
void* Thread(void *);

/***************************************************************************************
 * Globals, shared between all threads, muticies and the like
 ***************************************************************************************/

/* Muticies, condition variables, and flags for thread syncronization */
volatile pthread_cond_t  Master,Slaves;
volatile pthread_mutex_t Mutex;
volatile int             Go[MAX_THREADS],Done[MAX_THREADS];
volatile int             nThreads;
volatile pThreadData     Threads[MAX_THREADS];

/***************************************************************************************
 * "Util" functions 
 ***************************************************************************************/

/* 
   Similar to fprintf(stderr...); exit(-1);
*/
void Error(const char *fmt, ...)
{
  va_list ap;

  if (!fmt) return;
  va_start(ap, fmt);  
  vfprintf(stderr, fmt, ap);
  va_end(ap);
  exit(1);
}

/*
  Just a "safe" malloc wrapper
*/
void* Malloc(size_t s, const char *fmt, ...)
{
  void *d;  va_list ap;
  
  if( !(d = (void*) malloc(s)) ) {
    if(!fmt) return NULL;
    va_start(ap, fmt);  
    vfprintf(stderr, fmt, ap);
    va_end(ap);
    fprintf(stderr,"\n");
    exit(1);
  }
  
  return d;
}

/* 
   Starts the stopwatch 
*/
void swStart(pStopWatch sw)
{
  sw->t = 0;
  sw->f = 1;
  gettimeofday(&sw->s,NULL);
}

/* 
   Returns the amount of time that has passed since the stopwatch was started.
*/
unsigned long swTime(pStopWatch sw)
{
  gettimeofday(&sw->e,NULL);
  if(sw->f)
    return sw->t + ((sw->e.tv_sec*((unsigned long)1000000)+sw->e.tv_usec) - 
		    (sw->s.tv_sec*((unsigned long)1000000)+sw->s.tv_usec));
  else
    return sw->t;
}

/***************************************************************************************
 * Functions dealing with the creation and syncronization of threads
 ***************************************************************************************/

/* 
   Creates test threads and returns when they have finished setting themselves up
*/
void CreateThreads(int threads)
{
  int i,ids[MAX_THREADS];  pthread_t t;  pthread_attr_t a;

  /* Record the number of threads created */
  nThreads = threads;
  if(nThreads > MAX_THREADS) {
    printf("  This application supports a maximum of %d threads.\n", MAX_THREADS);
    printf("  Reducing threads from %d to %d.\n", nThreads, MAX_THREADS);
    nThreads = MAX_THREADS;
  }

  /* Create some condition variables and muticies for thread management */
  pthread_cond_init((pthread_cond_t*)&Master,NULL);
  pthread_cond_init((pthread_cond_t*)&Slaves,NULL);
  pthread_mutex_init((pthread_mutex_t*)&Mutex,NULL);

  /* Set up the thread attributes */
  pthread_attr_init(&a);
  pthread_attr_setdetachstate(&a,PTHREAD_CREATE_DETACHED);
  pthread_attr_setscope(&a,PTHREAD_SCOPE_SYSTEM);

  /* Create worker threads */
  for(i=0; i<nThreads; i++) {
    ids[i] = i;
    if( pthread_create(&t, &a, Thread, (void*)(ids+i)) )
      Error("!! Could not create thread #%d.\n",i);
  }

  /* Wait for them to initialize */
  pthread_mutex_lock((pthread_mutex_t*)&Mutex);
  for(i=0; i<nThreads; i++) {
    while( !Done[i] ) 
      pthread_cond_wait((pthread_cond_t*)&Master,(pthread_mutex_t*)&Mutex);
    Done[i] = 0;
  }
  pthread_mutex_unlock((pthread_mutex_t*)&Mutex);
}

/*
  Signals threads with flag f, and waits for them to complete
*/
void SignalWork(int f)
{
  int i;

  /* Signal worker threads and wait for them to finish */
  pthread_mutex_lock((pthread_mutex_t*)&Mutex);  
  for(i=0; i<nThreads; i++)
    Go[i] = f;
  pthread_cond_broadcast((pthread_cond_t*)&Slaves);
  for(i=0; i<nThreads; i++) {
    while( !Done[i] ) 
      pthread_cond_wait((pthread_cond_t*)&Master,(pthread_mutex_t*)&Mutex);
    Done[i] = 0;
  }
  pthread_mutex_unlock((pthread_mutex_t*)&Mutex);
}

/*
  Signals that the thread td has completed and then waits for a go (or exit) signal
*/
int SignalDone(int id) 
{
  int f;

  /* Signal done and wait for Go (or exit) singal */
  pthread_mutex_lock((pthread_mutex_t*)&Mutex);
  Done[id] = 1;
  pthread_cond_signal((pthread_cond_t*)&Master);
  while( !Go[id] )
    pthread_cond_wait((pthread_cond_t*)&Slaves, (pthread_mutex_t*)&Mutex);
  f = Go[id];
  Go[id] = 0;
  pthread_mutex_unlock((pthread_mutex_t*)&Mutex);

  /* Return our "flag" value */
  return f;
}

/***************************************************************************************
 * "Work" threads and support functions
 ***************************************************************************************/

/* 
   Thread start fucntion
*/
void* Thread(void *arg)
{
  ThreadData td;  int task,flags;  unsigned long omask,mask,i,j,t;  StopWatch sw;

  /* Initialize this thread */
  td.id = *((int*)arg);
  Threads[td.id] = &td;
  printf("  Thread %d: Started.\n",td.id);
  sched_getaffinity(0, sizeof(unsigned long), &omask);
  printf("  Thread %d: Current affinity mask: 0x%X\n", td.id, omask);
  mask = 1<<td.id;
  printf("  Thread %d: Setting mask to 0x%X (CPU%d only)\n", td.id, mask, td.id);
  sched_setaffinity(0, sizeof(unsigned long), &mask);
  sched_getaffinity(0, sizeof(unsigned long), &omask);
  printf("  Thread %d: Affinity mask is now: 0x%X\n", td.id, omask);
  if(omask != mask)
    Error("!! Thread %d: Affinity mask assignment failed.\n", td.id);
  printf("  Thread %d: Allocating 512MB of memory\n", td.id);
  td.mem = Malloc((1ul<<28)*sizeof(char),"!! Thread %d: malloc(128MB) failed.\n", td.id);
  printf("  Thread %d: Done with initialization.\n", td.id);

  /* Start task loop */
  while(task=SignalDone(td.id)) {
    switch(task) {
    case TEST_LOCAL_READS:
      /* Read our local 512MB 10 times */
      printf("  Thread %d: Starting local read test:\n", td.id);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  t = td.mem[j];
      i = swTime(&sw);
      printf("  Thread %d: Read local 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_LOCAL_WRITES:
      /* Write our local 512MB 10 times */
      printf("  Thread %d: Starting local write test:\n", td.id);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  td.mem[j] = j;
      i = swTime(&sw);
      printf("  Thread %d: Wrote local 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_READS:
      /* Read a remote 512MB 10 times */
      printf("  Thread %d: Starting remote read test (from CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  j = Threads[nThreads-td.id-1]->mem[j];
      i = swTime(&sw);
      printf("  Thread %d: Read remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_WRITES:
      /* Write a remote 512MB 10 times */
      printf("  Thread %d: Starting remote write test (to CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  Threads[nThreads-td.id-1]->mem[j] = j;
      i = swTime(&sw);
      printf("  Thread %d: Wrote remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_READS_CPU0:
      /* Only CPU0 does work */
      if(td.id) {
	printf("  Thread %d: Doing nothing.\n", td.id);
	break;
      }
      /* Read a remote 512MB 10 times */
      printf("  Thread %d: Starting remote read test (from CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  t = Threads[nThreads-td.id-1]->mem[j];
      i = swTime(&sw);
      printf("  Thread %d: Read remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_WRITES_CPU0:
      /* Only CPU0 does work */
      if(td.id) {
	printf("  Thread %d: Doing nothing.\n", td.id);
	break;
      }
      /* Write a remote 512MB 10 times */
      printf("  Thread %d: Starting remote write test (to CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  Threads[nThreads-td.id-1]->mem[j] = j;
      i = swTime(&sw);
      printf("  Thread %d: Wrote remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_READS_CPU0_SPIN:
      /* Only CPU0 reads, we spin */
      if(td.id) {
	printf("  Thread %d: Spinning on some local math:\n", td.id);
	swStart(&sw);
	for(i=0; i<(1ul<<29); i++)
	  t += ((((i^t) % 24) + 10727) / 7) + i;
	i = swTime(&sw);
	printf("  Thread %d: Finished spinning after %lf seconds.\n", td.id, i/((double)1000000));
	break;
      }
      /* Write to remote 512MB 10 times */
      printf("  Thread %d: Starting remote read test (from CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  t = Threads[nThreads-td.id-1]->mem[j];
      i = swTime(&sw);
      printf("  Thread %d: Read remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_REMOTE_WRITES_CPU0_SPIN:
      /* Only CPU0 writes, we spin */
      if(td.id) {
	printf("  Thread %d: Spinning on some local math:\n", td.id);
	swStart(&sw);
	for(i=0; i<(1ul<<29); i++)
	  t += ((((i^t) % 24) + 10727) / 7) + i;
	i = swTime(&sw);
	printf("  Thread %d: Finished spinning after %lf seconds.\n", td.id, i/((double)1000000));
	break;
      }
      /* Write to remote 512MB 10 times */
      printf("  Thread %d: Starting remote write test (to CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  Threads[nThreads-td.id-1]->mem[j] = j;
      i = swTime(&sw);
      printf("  Thread %d: Wrote remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_READS_REMOTE_LOCAL:
      /* Only CPU0 reads remote, we read local */
      if(td.id) {
	printf("  Thread %d: Starting local read test:\n", td.id);
	swStart(&sw);
	for(i=0; i<10; i++)
	  for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	    t = td.mem[j];
	i = swTime(&sw);
	printf("  Thread %d: Read local 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
	break;
      }
      /* Write to remote 512MB 10 times */
      printf("  Thread %d: Starting remote read test (from CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  t = Threads[nThreads-td.id-1]->mem[j];
      i = swTime(&sw);
      printf("  Thread %d: Read remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_WRITES_REMOTE_LOCAL:
      /* Only CPU0 writes remote, we write local */
      if(td.id) {
	printf("  Thread %d: Starting local write test:\n", td.id);
	swStart(&sw);
	for(i=0; i<10; i++)
	  for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	    td.mem[j] = j;
	i = swTime(&sw);
	printf("  Thread %d: Wrote local 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
	break;
      }
      /* Write to remote 512MB 10 times */
      printf("  Thread %d: Starting remote write test (to CPU%d's memory):\n", td.id, nThreads-td.id-1);
      swStart(&sw);
      for(i=0; i<10; i++)
	for(j=0; j<((1ul<<28)/sizeof(unsigned long)); j++)
	  Threads[nThreads-td.id-1]->mem[j] = j;
      i = swTime(&sw);
      printf("  Thread %d: Wrote remote 512MB 10 times in %lf seconds.\n", td.id, i/((double)1000000));
      break;
    case TEST_FINISHED:
      /* "Cleanup" the thread here */
      printf("  Thread %d: Shutting down.\n", td.id);
      free(td.mem);
      SignalDone(td.id);
      break;
    }
  }

  return NULL;
}

/***************************************************************************************
 * Main and support functions
 ***************************************************************************************/

/*
  Returns that number of CPUs that the current process can currently run on
*/
int GetAvailableCpus()
{
  unsigned long i,n,mask;

  /* Get current affinity mask */
  if(sched_getaffinity(0,sizeof(unsigned long),&mask)) {
    perror("!! Could not get current thread affinity");
    exit(1);
  }

  /* Count the number of available processors */
  for(i=n=0; i<(sizeof(unsigned long)*8); i++)
    n += ((mask)>>i)&1;

  /* Return that number */
  return n;
}

/*
  Application entry point
*/
int main(int argc, char **argv)
{
  int cpus;

  /* Get CPU info */
  printf("* Checking system confg:\n");
  cpus = GetAvailableCpus();
  printf("  Currently available CPUs: %d\n",cpus);
  printf("* Done.\n");

  /* Start threads */
  printf("\n* Starting threads:\n");
  CreateThreads(cpus);
  printf("* Done.\n");

  /* Run tests */
  printf("\n* Testing local reads:\n");
  SignalWork(TEST_LOCAL_READS);
  printf("* Done.\n");

  printf("\n* Testing local writes:\n");
  SignalWork(TEST_LOCAL_WRITES);
  printf("* Done.\n");

  printf("\n* Testing remote reads:\n");
  SignalWork(TEST_REMOTE_READS);
  printf("* Done.\n");

  printf("\n* Testing remote writes:\n");
  SignalWork(TEST_REMOTE_WRITES);
  printf("* Done.\n");

  printf("\n* Testing remote reads (read, sleep):\n");
  SignalWork(TEST_REMOTE_READS_CPU0);
  printf("* Done.\n");

  printf("\n* Testing remote writes (write, sleep):\n");
  SignalWork(TEST_REMOTE_WRITES_CPU0);
  printf("* Done.\n");

  printf("\n* Testing remote reads (read, compute):\n");
  SignalWork(TEST_REMOTE_READS_CPU0_SPIN);
  printf("* Done.\n");

  printf("\n* Testing remote writes (write, compute):\n");
  SignalWork(TEST_REMOTE_WRITES_CPU0_SPIN);
  printf("* Done.\n");

  printf("\n* Testing reads (remote read, local read):\n");
  SignalWork(TEST_READS_REMOTE_LOCAL);
  printf("* Done.\n");

  printf("\n* Testing writes (remote write, local write):\n");
  SignalWork(TEST_WRITES_REMOTE_LOCAL);
  printf("* Done.\n");
 
 
  printf("\n* Shutting down:\n");
  SignalWork(TEST_FINISHED);
  printf("* Done.\n");
  
  /* Return success */
  exit(0);
  return 0;
}
 
Edit: There were a couple of printf typos, but I went back and edited them out.

If you want me to write some more tests I'll think about it. I want you to keep in mind that my Opterons are low end (242s) with slow (DDR333) memory. My box cannot be used to represent a high-end opteron system.

My advice to the original question. You know the answer! Don't waste your money on the Xeon system that cannot support future dual core chips! Get the Opteron box! When your 2-way server gets too slow in the future, drop in some dual cores, in the same MB, for instant 4-way. I don't know why this is even debatable. Also, I don't think you should call the remote access in the Opteron NUMA system as a penalty. The Xeon system always has to access remote memory, and with the Xeon system there is no possibility for using local mem on a local mem controller without causing a bottleneck for the other CPU. The Xeon system will cost more, is less advanced on an architechtual level, and over all, will perform worse.
 
EDIT: This post was nothing more than directions for fixing typos in the print statements, but I went back and edited my previous posts. I guess that means that this doesn't need to be here anymore. I would just delete it, but I don't know how.
 
why don't you guys stop completely taking this guys thread off course ? you haven't referred to his question whatsoever and its just turned into a pissing contest. you could even take it to pm...
 
Demon_of_The_Fall said:
you haven't referred to his question whatsoever and its just turned into a pissing contest
Visaris said:
My advice to the original question....Get the Opteron box! When your 2-way server gets too slow in the future, drop in some dual cores, in the same MB, for instant 4-way....
I think I made a reference to the original question... I don't think you're being fair. Also, I don't think posting a little code should be considered "a pissing contest". I think both my software, and mikeblas' software shed a little light on the NUMA performace of the Opteron system, which is related in every way to the subject at hand (Opteron vs. Xeon).
 
mikeblas said:
How do they dynamically reconfigure themselves? NUMA is a hardware issue. Look at the block diagram for the S2882, for example. If CPU1 wants memory that's directly connected to CPU0, it has to ask CPU0 for that memory.

What's the OS got to do with it? How would the OS override the way the board is actually wired?


I thought NUMA was the name of that one really annoying flash animation.
 
Demon_of_The_Fall said:
why don't you guys stop completely taking this guys thread off course ? you haven't referred to his question whatsoever and its just turned into a pissing contest. you could even take it to pm...

lol yeah...

For some closure, I decided to wait for the dual-core opties, get one, then another, for 4 core goodness.
 
visaris said:
I think I made a reference to the original question...

As have I. How can you decide between two different architectures without deeply understanding them?

I completely missed the pissing contest. Who won? Or was it just that you confused the critical questioning for negative argument?
 
Back
Top