Dual Opterons or Xeons?

Hate_Bot · Apr 2, 2005

This will be for mostly 3Ds max work, Video editing, and some gaming (10-20% of the time is gaming) so, for my uses, whats better?:

http://www.canadacomputers.com/cc/index.php?do=ShowProduct&cmd=pd&pid=006654&cid=CPU.330

Intel Xeon 3.2 OR

http://www.canadacomputers.com/cc/index.php?do=ShowProduct&cmd=pd&pid=004597&cid=CPU.611

AMD Opteron 246

Well?

(On a side note, whats a good dual xeaon/opteron Mobo, with PCI-Ex16 slot?)

Hate_Bot · Apr 2, 2005

Also, any dual Xeon Mobos, with PCI-Ex16 slot(s) that use ddr2 mem?

Hate_Bot · Apr 2, 2005

ALSO, can a dual CPU mobo, work with 1 cpu? Like say you buy a xeon now, then a few days later get a second one? Or can they only work with 2 cpus at once? Like, they wont work with only 1 cpu.
ALSO also, any dual Xeon mobos, with dual PCI-Ex16 slots (for the possibility of SLI, or AMR, or is that even possible, even if you have dual PCI-Ex16 slots?)

Thanks

Master [H] · Apr 2, 2005

No Xeon boards have SLI...well, that Alienware one, but you can't purchase it outside of the Alienware system.

SMP boards will work fine with just one CPU. You'll probably just have to change the HAL to SMP instead of Uniprocessor, but that's pretty much it (unless you're using WinNT for some reason).

Hate_Bot · Apr 2, 2005

^^^ Woah, I'm sorry, I didnt understand pretty much any of that.

Hal, SMTP? WhaaaA?

err... sorry, so ya, what about good Dual Xeons mobos with a PCI-Ex16 slot?

Thanks

mikeblas · Apr 4, 2005

Are there any dual-Opteron boards that are SMP?

defakto · Apr 4, 2005

mikeblas said:
Are there any dual-Opteron boards that are SMP?

All dual boards are Symetrical Multi-Processing capable. Otherwise they wouldn't be dual boards.

Are you thinking of something else?

mikeblas · Apr 4, 2005

defakto said:
All dual boards are Symetrical Multi-Processing capable. Otherwise they wouldn't be dual boards.

Are you thinking of something else?

All the Opteron boards I know about are NUMA and not SMP.

defakto · Apr 4, 2005

Ah ok.

It's my understanding that a dual opteron board will run SMP unless you have a NUMA aware OS. A backwards compatibility thing I believe as SMP is alot simpler to work with than NUMA. Most the dual boards are NUMA capable, but they still run SMP if your not using NUMA.

mikeblas · Apr 4, 2005

defakto said:
It's my understanding that a dual opteron board will run SMP unless you have a NUMA aware OS. A backwards compatibility thing I believe as SMP is alot simpler to work with than NUMA. Most the dual boards are NUMA capable, but they still run SMP if your not using NUMA.

How do they dynamically reconfigure themselves? NUMA is a hardware issue. Look at the block diagram for the S2882, for example. If CPU1 wants memory that's directly connected to CPU0, it has to ask CPU0 for that memory.

What's the OS got to do with it? How would the OS override the way the board is actually wired?

defakto · Apr 4, 2005

Correct, the wiring does play into. (The following is not a strictly accurate description but is a generalization) How the OS assigns processes memory is involved also. On a fully NUMA enabled system, os and hardware, with cpu 0 and 1, if a processes is running on cpu1 it will normally have it's memory addressed to match the memory attached to cpu1. If it's running on CPU0, same thing. It looks for the memory information on the closest processor first and works it's way out. If the OS does not support NUMA it doesn't do that. It fills the ram from cpu0 first, then if need be, fills it on cpu1's ram. NUMA also allows cpu1/0 to access memory attached to any cpu if a process needs it, but at a more significant performance hit. This is why with a numa aware OS memory benchmarks on a dual opteron board, with pc2700 and dual channel you can hit close to 10GB of bandwidth, where as with NUMA disabled you'll only hit near the 5GB mark.

The best way to explain it, you can have an SMP system without NUMA, but you can't have a NUMA system without SMP. NUMA is an addition to SMP not a substitution of SMP. NUMA is just another way of addressing the memory efficiently. It does not change how an SMP system works.

http://whatis.techtarget.com/definition/0,289893,sid9_gci212678,00.html

Chris_Morley · Apr 4, 2005

Master [H] said:
No Xeon boards have SLI...well, that Alienware one, but you can't purchase it outside of the Alienware system.

SMP boards will work fine with just one CPU. You'll probably just have to change the HAL to SMP instead of Uniprocessor, but that's pretty much it (unless you're using WinNT for some reason).

???

Supermicro X6DAE is the first that comes to mind.

Hate_Bot · Apr 4, 2005

Morley said:
???

Supermicro X6DAE is the first that comes to mind.

Ya that one has SLI.

And theres opteron one:

Tyan Thunder K8WE

That has SLI support, except its full sli, meaning that it doesnt switch to two PCI-E x8 Clots, but full 2 PCI-Ex16 slots (it has 2 chipsets)

mikeblas · Apr 4, 2005

defakto said:
If the OS does not support NUMA it doesn't do that. It fills the ram from cpu0 first, then if need be, fills it on cpu1's ram.

Right. Anything running on CPU1 had to go through CPU0 to get to that memory. That involves overhead, and is not great for performance. If you're not runnning a NUMA-aware OS on such a machine, aren't you necessarily wasting performance? The OS will schedule and allocate assuming symmetrical access.

The Intel boards don't work this way. Even when running a NUMA-aware OS, you'll want NUMA-aware applications. And since software vendors screw up even regular multithreaded applications all the time, that seems like an advantage for Intel over the Opterons.

Chris_Morley · Apr 4, 2005

Hate_Bot said:
Ya that one has SLI.

And theres opteron one:

Tyan Thunder K8WE

That has SLI support, except its full sli, meaning that it doesnt switch to two PCI-E x8 Clots, but full 2 PCI-Ex16 slots (it has 2 chipsets)

Yeah, the S2895, I think we still have an engineering sample of that in house...great board. Two CK8's provide for the extra PCI-E lanes.

Chris_Morley · Apr 4, 2005

mikeblas said:
Right. Anything running on CPU1 had to go through CPU0 to get to that memory. That involves overhead, and is not great for performance. If you're not runnning a NUMA-aware OS on such a machine, aren't you necessarily wasting performance? The OS will schedule and allocate assuming symmetrical access.

The Intel boards don't work this way. Even when running a NUMA-aware OS, you'll want NUMA-aware applications. And since software vendors screw up even regular multithreaded applications all the time, that seems like an advantage for Intel over the Opterons.

...but since Intel still has to work with a memory hub, it's a moot point.

mikeblas · Apr 4, 2005

Morley said:
...but since Intel still has to work with a memory hub, it's a moot point.

Is it really? For all application patterns?

Chris_Morley · Apr 4, 2005

mikeblas said:
Is it really? For all application patterns?

Forget application patterns, look at the physical connectivity.

Both CPU's have to go through one memory hub. With dual channel DDR400, current Xeons are limited to 6.4GB/s.

Opterons START at 6.4GB/s (248's and up), and only get better when NUMA is introduced.

mikeblas · Apr 4, 2005

Morley said:
Forget application patterns, look at the physical connectivity.

Both CPU's have to go through one memory hub. With dual channel DDR400, current Xeons are limited to 6.4GB/s.

Opterons START at 6.4GB/s (248's and up), and only get better when NUMA is introduced.

Sure. So, if one Xeon is busy and the other is idle, you get 6.4 GB/second. Your one busy proc is very happy.

If you have two Opterons and one isn't busy, but the other one is going through the first to service memory requests, you end up with less than 6.4 GB/sec.

That's why I think it's worthwhile to pay attention to application access patterns.

Chris_Morley · Apr 4, 2005

mikeblas said:
Sure. So, if one Xeon is busy and the other is idle, you get 6.4 GB/second. Your one busy proc is very happy.

If you have two Opterons and one isn't busy, but the other one is going through the first to service memory requests, you end up with less than 6.4 GB/sec.

That's why I think it's worthwhile to pay attention to application access patterns.

...and you get the same effect going through a memory controller. I guess the question is, if only one processor is busy, will that processor always be CPU 0?

mikeblas · Apr 4, 2005

Morley said:
...and you get the same effect going through a memory controller. I guess the question is, if only one processor is busy, will that processor always be CPU 0?

Well, no, you don't. If the 2nd CPU is not accessing memory, the controller doesn't do much -- unless there's false sharing between the two processor caches.

I'm not sure I understand your question about CPU 0. I guess for some motherboard (eg, the S2875) you're right: you want CPU 0 to be doing all the work because CPU 1 always has latency.

For others, you just want to make sure the CPU running the code is the CPU that owns the memory the code is referencing. Assuring this in the design of the application is what I'd call making an application NUMA-aware.

defakto · Apr 4, 2005

No.

You can set affinity for a process to a particular processor as well the OS tries to balance things as evenly as possible if you don't set affinity a process can be switched between processors if load gets to high on one.

Chris_Morley · Apr 4, 2005

mikeblas said:
Well, no, you don't. If the 2nd CPU is not accessing memory, the controller doesn't do much -- unless there's false sharing between the two processor caches.

There's an inherent latency in having the processor request information of a memory hub that is running at a fraction of the frequency of the processor.

wetware interface · Apr 4, 2005

the opteron architechture will use numa memory addresing if it is enabled by a numa aware os.
numa in the opterons is a memory addressing system only, on top of smp. not a replacement for the way smp works but a replacement for the way the memory gets handled by the processor cores. no numa aware os and the cores work in standard smp and talk to each other to transfer the memory data across the hyper transport pipes as if the second proc was a memory hub as well as exchange other data. the catch is without numa extensions the processor tied directly to the memory needed by the other proc needs to do overhead to fetch the memory instruction itself and then pass it along to the processor requesting it. with numa any processor can access the other processors memory cahce with little overhead by the one the data is being requested of.

otherwise its simple smp as far as exchange of processor data

if no smp os then its single processor mode and second core is only a dead paperweight except it's used to transfer memory requests through the hyper transport pipe as if it were a memory controller extension.

basically it just opens up the hyper transport switches to act as a pipeline to memory. even without smp aware os.

Chris_Morley · Apr 4, 2005

wetware interface said:
the opteron architechture will use numa memory addresing if it is enabled by a numa aware os.
numa in the opterons is a memory addressing system only, on top of smp. not a replacement for the way smp works but a replacement for the way the memory gets handled by the processor cores. no numa aware os and the cores work in standard smp and talk to each other to transfer the memory data across the hyper transport pipes as if the second proc was a memory hub as well as exchange other data. the catch is without numa extensions the processor tied directly to the memory needed by the other proc needs to do overhead to fetch the memory instruction itself and then pass it along to the processor requesting it. with numa any processor can access the other processors memory cahce with little overhead by the one the data is being requested of.

otherwise its simple smp as far as exchange of processor data

if no smp os then its single processor mode and second core is only a dead paperweight except it's used to transfer memory requests through the hyper transport pipe as if it were a memory controller extension.

basically it just opens up the hyper transport switches to act as a pipeline to memory. even without smp aware os.

That was my next post. The hypertransport 'hooks' on the Opteron architecture are pretty dynamic, and since the memory controller resides on the processor requesting the information from the memory, little or no work is done by CPU 0.

wetware interface · Apr 4, 2005

also in the case of the xeon and memory addressing

the second proc not only shares the memory bus but has to request memory adresses through the cpu0 with all the same latency penalties that the opterons have.

smp isn't 2 independant processors

its cpu0 handing data and instructions to cpu1 for processig and then retreiving these results and sending them on to memory or irq's etc...

smp is called symetric but in reality isn't. it's one cpu being the boss the other the slave, but the boss still needs to fetch things for the slave as well as tell it what to do and pass on the efforts of the slave's labor.

mikeblas · Apr 4, 2005

wetware interface said:
the second proc not only shares the memory bus but has to request memory adresses through the cpu0 with all the same latency penalties that the opterons have.

Are you positive of that? I can demonstrate the latency problem that Opterons have with their NUMA architecture by writing a test program. I can't show the same latency with an Intel Xeon box.

wetware interface · Apr 4, 2005

quite positive all smp systems use cpu0 as the arbiter of data and assigner of tasks

i can write a program to show latency hits on either as can you.

the issue is writing something large enough to exhaust the xeon or opteron cache faster than the hypertransport bus or p4's memory bus on cpu1 specifically.

once you exhaust the cache you see the latency penalty in each configuration for standard smp.

to really see numa you need to write a numa specific memory addressing program that exhausts the cache to compare with. having a numa aware os is nice for the os tasks to utlise numa, however you need to code your app to also use numa and multithread it to get numa on the opteron to show it's advantage.

wetware interface · Apr 4, 2005

just re-reading both posts and need to clarify something a bit more.

not all memory tasks on a xeon require the attention of cpu0 this is where dma memory addressing pays off on a shared memory bus. if cpu1 is aware of where the data needs to go and software is coded (or assembled by an intel multi aware compiler) to allow cpu1 to write directly to memory without checking with cpu0 for coherency first it can write to memory if the bus is idle.

for instance...
cpu1 may be doing some audio filtering task and writing it's results to memory in a playback buffer. which is a serialized data update so cpu0 won't need to get involved other than to read memory to see if the task completed (costly) or if cpu1 was polled by cpu0 and cpu0 had a register updated (quicker)

this is where the second cpu didn't ned cpu0 to intervene to get something done. and on serial processes that are on a seperate thread and dont have any dependencies on cpu0's tasks completing it can be coded to go ahead and write to memory when the bus is free. however cpu0 has to initiate the whole process and "instruct" cpu1 what thread to handle by passing it on etc.... and handle the initial data exchange to feed cpu1 its data to work from.

however due to a numa architechture this would have a slight penalty as cpu0 still needs to be polled if the memory address is in the range cpu0 is in charge of. if not then the numa architechture is even faster as there are more registers to assign to individual tasks as cpu0 won't need to maintain the result in the above example and if the original data set to work on is in cpu1's memory area then cpu1 can go and do it's business and cpu0 can simultaneously do another task from it's memory area.

this is where numa aware coding comes into play you have to make sure the data is local to the cpu that needs it ahead of time in order to see a speed benefit. multi operating systems or hardware partitoning is a prime candidate for numa which is what it was originally developed for.

games are a poor choice as they are a bitch to multi thread due to all the interdependencies and numa would be slower in 60% of game type applications due to the dependencies between cpu0 and cpu1 needing coherancy for physics and ai and graphics and user input drives all.

games = numa slower
virtual enviorments = numa faster

Sojuuk · Apr 5, 2005

Morley said:
Forget application patterns, look at the physical connectivity.

Both CPU's have to go through one memory hub. With dual channel DDR400, current Xeons are limited to 6.4GB/s.

Opterons START at 6.4GB/s (248's and up), and only get better when NUMA is introduced.

coulda sworn all curent stepping opterons ran at 200fsb even the 240's now.

mikeblas · Apr 5, 2005

wetware interface said:
quite positive all smp systems use cpu0 as the arbiter of data and assigner of tasks

i can write a program to show latency hits on either as can you.

What do you specifically mean by "arbiter of data" and "assigner of tasks"? Are you saying that, if CPU0 is busy, then CPU1 will block on it, waiting for a task to be "assigned" to it? Why isn't CPU1 just fetching and executing code?

I'd love to see what code you've got that will demonstrate that CPU0 is "the arbiter of data and assigner of tasks".

wetware interface said:
the issue is writing something large enough to exhaust the xeon or opteron cache faster than the hypertransport bus or p4's memory bus on cpu1 specifically.

once you exhaust the cache you see the latency penalty in each configuration for standard smp.

It's not that hard to do so: just use a random memory access pattern, making reads more than a cache line's length apart. The cache won't be able to prefetch, and every request ends up being a cache miss.

With a Pentium 4 machine, it doesn't matter which processor is hitting memory, or to which processor the memory is physically attached. If I have one procsesor access memory allocated by another processor, I get roughly the same performance:

Code:

3360060000
3: Memory 0 is at 0x00440000
Processor 0 on Memory 0: 748115680
3: Memory 1 is at 0x00440000
Processor 0 on Memory 1: 688257488
3: Memory 0 is at 0x00440000
Processor 1 on Memory 0: 735990976
3: Memory 1 is at 0x00440000
Processor 1 on Memory 1: 735181120

On an Opteron box with NUMA, if one processor is touching memory physically associated with another processor, then there's a very significant penalty:

Code:

1990890000
3: Memory 0 is at 0x00440000
Processor 0 on Memory 0: 391625573
3: Memory 1 is at 0x00440000
Processor 0 on Memory 1: 556589206
3: Memory 0 is at 0x00440000
Processor 1 on Memory 0: 564060907
3: Memory 1 is at 0x00440000
Processor 1 on Memory 1: 390970665

The Opteron numbers show the expense of cross-node NUMA access. Since the Pentium machine I used isn't NUMA, it pays no such penalty. The benefit of NUMA is that increased bandwidth available if each processor is touching only the memory it physically owns. This can be shown with MEMSPD, for example.

On that Pentium rig, running one thread with MEMSPD gets a copy speed of 1873 MB/sec, and with two threads gets 2404.2 MB/sec. On the Opteron NUMA box, one thread gets 1706 MB/sec while two threads gets 3757 MB/sec.

The Opteron is faster (and scales better) becasue the threads in MEMSPD are affinitized and touching memory they own, and never (ever!) touching memory the other processor owns. I don't think there are many commercial programs available that have this kind of optimization.

wetware interface said:
to really see numa you need to write a numa specific memory addressing program that exhausts the cache to compare with. having a numa aware os is nice for the os tasks to utlise numa, however you need to code your app to also use numa and multithread it to get numa on the opteron to show it's advantage.

Right, that's what MEMSPD shows. And my cross-node test app shows what happens when it's not done right.

And that's my point for the original poster: since applications have to be NUMA-aware to make the most of the Opteron, but don't need to be NUMA aware to make the most of the Pentium, I wonder if the Pentium isn't a better choice for some users.

wetware interface · Apr 7, 2005

yes you are getting the issue because you are doing RANDOM memory accesses.
which means you aren't optimizing for numa but specifically penailizing it as the randomness is counter to what numa accellerates. numa is good for local memory coherancy where cpu1 is mostly independant of cpu0 as in virtual environments and hardware partitioning. where most data is treated as being on a second system entirely not just a second cpu.

your example is flawed in that it favors a shared memory bus from the start. as it then is a question of cpu0 and cpu1 polling each other for access to the memory system to fetch data.

it isn't a very good real world simulation as you wouldn't randomly access memory in any app for multithreaded access unless you were writing a virus or something harmfull. it is completely counter to what multi threading optimization is about. you want to keep your data compartmentalized and seperate between the two processors to protect data integrity. only when one cpu is doing work another is waiting for do you use a shared memory space.

try using an example of 2 seperate processes not dependant on each other's data.

say cpu0 running os and background tasks and cpu1 rendering a specific large graphic to a memory buffer. now on a xeon that memory buffer render is going to cause lag with cpu0's being able to access the memory controller at a greater than 50% percent ratio if you exhaust the cache on cpu1. as cpu1 then has to make sure first of all if the memory bus is free by polling cpu0 (eating up cycles for the request) or cpu0 is going to need to know when cpu1 is done with it's task as the os is going to need to know if cpu1 is done in order to assign it other taks as neccesary. there is overhead on cpu0 throughout the begginning and ending of this process as it is handling the os which is monitoring the threads of cpu1. you can't have 2 cpu's acting completely independantly of each other as they would conflict for resources pretty damn fast.

with multithreaded apps you have to make sure you are not stepping on the toes of one cpu's tasks by the other constantly.

my euphamism of arbiter of data and assigner of tasks is relative to cpu0 being the one handling the kernal of your os from the start up of said os. from there the kernal uses cpu0 to assign tasks to cpu1 and if in doubt cpu0 always gets first shot at memory or irq resources. unles you specifically code for cpu1 to override cpu0 the os will always give prefernce to cpu0's tasks. and even if you specifically code for cpu1 to have priority all you get is a crash of your os if it causes a conflict as the os kernal is assuming it is king not an application and you'll end up corrupting a memory space eventually the kernal is expecting to stay coherant.

and you don't seem to understand numa at all.
numa is designed to allow you to use specific memory for a specific cpu. it isn't a better implementation of a shared memory hub. it's a completely different way to give a seperate thread speed without penalizing it for another thread on another cpu hogging the memory controller. with opterons you have multiple memory controllers not a shared hub. yes if you ignore optimizing for numa you pay a penalty. if you do optimize (i.e. make sure data and instructions are residing in the local memory area for the cpu's specific controller) you don't have a shared memory bus and all bandwidth is available and further if you write some small memory ranging lock you have coherancy as cpu1 will not allow access to cpu0 and vice versa. whereas if you lock memory on a xeon from cpu0 and code cpu 1 to access it anyway you get a lockout. but if you flip it and lock memory on cpu1 and write code for cpu0 to access it you may or may not get a lockout depending on what the kernal requires. remember cpu0 is boss in smp.

mikeblas · Apr 7, 2005

wetware interface said:
yes you are getting the issue because you are doing RANDOM memory accesses.
which means you aren't optimizing for numa but specifically penailizing it as the randomness is counter to what numa accellerates.

NUMA can provide independent pipelines between the processors and memory. The only reason I had focused on random accesses is to avoid having the cache prefetch lines of memory. If the cache prefetching the data effectively (as it would after, linear access to the memory) there's still a measurable performance pentaly accessing memory cross-node, though it isn't nearly as pronounced.

Your summary that this issue is "because you are doing RANDOM memory accesses" isn't accurate. It's because the program ends up running code on one processor that access memory physically associated with the other processor.

wetware interface said:
your example is flawed in that it favors a shared memory bus from the start. as it then is a question of cpu0 and cpu1 polling each other for access to the memory system to fetch data.

I don't think it's flawed at all. The intention is to show and measure the difference between NUMA and shared memory busses, and it does exactly that. The measure a difference, it has to be identifiable.

wetware interface said:
it isn't a very good real world simulation as you wouldn't randomly access memory in any app for multithreaded access unless you were writing a virus or something harmfull.

I'm not trying to simulate anything. Meanwhile, I don't think it would be easy to find a good or experienced software engineer who would agree with the balance of your point.

Any memory access might be non-sequential: any lookup table or hash causes strides through memory to be at irregular intervals, for example. Code itself does too, as it executes: branches are taken, functions are called, and so on. Traversing a linked list, tree, or heap are additional examples of non-predictive accesses through the address space since the next node of the structure might lie at a location beyond the cache line length from the current node.

Why would you identify non-consecutive memory access by an application as a symptom of malicious code?

wetware interface said:
it is completely counter to what multi threading optimization is about.

Multi-threading optimization is about maximuzing concurrency, not assuming memory access patterns are linear.

wetware interface said:
you want to keep your data compartmentalized and seperate between the two processors to protect data integrity. only when one cpu is doing work another is waiting for do you use a shared memory space.
try using an example of 2 seperate processes not dependant on each other's data.

Protect data integrity? If an application is relying on the distance between two data sets in memory for data integrity, that application is fundamentally flawed. Code should deliberatey write to the memory it owns and therefore avoid touching memory it doesn't own.

Regardless, in this test app only one thread was active at a time. No memory was shared between threads or processes, period. The addresses shown are the same because the memory was freed, then reallocated without any interceeding allocations. Plus, they're virtual addresses and not physical locations.

Since there was only one thread running at a time, there was no data interdependency between threads in this application at all. There could not have possibly been because there was only one runnable thread at any moment.

In fact, it's possible to demonstrate similar numbers without ever creating a second thread.

wetware interface said:
say cpu0 running os and background tasks and cpu1 rendering a specific large graphic to a memory buffer. now on a xeon that memory buffer render is going to cause lag with cpu0's being able to access the memory controller at a greater than 50% percent ratio if you exhaust the cache on cpu1.

Sure, the two Xeons will trip over eachother if they're not in a NUMA system.

I described this in my previous post, by the way. And I gave a link to a program that can help actually demonstrate the phenomenon, and further included some of the results from my own machine.

wetware interface said:
as cpu1 then has to make sure first of all if the memory bus is free by polling cpu0 (eating up cycles for the request) or cpu0 is going to need to know when cpu1 is done with it's task as the os is going to need to know if cpu1 is done in order to assign it other taks as neccesary.

Is it polling, or just entering a wait state?

wetware interface said:
and you don't seem to understand numa at all.
numa is designed to allow you to use specific memory for a specific cpu. it isn't a better implementation of a shared memory hub. it's a completely different way to give a seperate thread speed without penalizing it for another thread on another cpu hogging the memory controller.

I fully understand that much about NUMA, at least.

wetware interface said:
with opterons you have multiple memory controllers not a shared hub. yes if you ignore optimizing for numa you pay a penalty. if you do optimize (i.e. make sure data and instructions are residing in the local memory area for the cpu's specific controller)

My example is exactly intended to show that NUMA-unaware code on a NUMA-enabled system can pay a significant memory access penalty.

My claim is that the vast majority of developers don't think about heap-allocated memory being affined to a particular processor or thread, and so the vast majority of code in shipping products today isn't NUMA-ready.

I wouldn't call this optimization: the penalty of 40% that I measured is pretty substantial, especially considering the other processor wasn't busy at all. Writing code with the unique requirements of NUMA in mind is a requirement.

wetware interface said:
you don't have a shared memory bus and all bandwidth is available

Of course, this leads to a very interesting problem. If you want to use both processors cooperatively, then you eventually have to copy the data from memory owned by one processor to the other so the processor accepting the data isn't paying an access penalty (or disrupting the first processor) during every access.

This isn't an issue on non-NUMA systems, since both processors have direct access to all available memory. And so that's another facet NUMA which will hinder applications not specifically witten with the characteristics of the architure in mind.

I think it's very important that someone planning on building a NUMA-based system understands the shortage of software available that's able to run well on NUMA systems.

Moog · Apr 8, 2005

I chuckle at how off course this thread went.

mikeblas · Apr 8, 2005

Moog said:
I chuckle at how off course this thread went.

Sorry about that. I'd maintain that the point about understanding the raminfications of NUMA-unaware code running on a NUMA is completely germain to choosing between dual Xeon and dual Opteron rigs, though.

Meanwhile, I had meant to make a post about my program and its results, but I haven't had time to finish measuring or clean up the code.

Hate_Bot · Apr 8, 2005

Moog said:
I chuckle at how off course this thread went.

Yeah I know, everyone is talkig about memory timings and whatnot... and I dont understand a thing...

I just want to know which choice is better..

mikeblas · Apr 8, 2005

Hate_Bot said:
Yeah I know, everyone is talkig about memory timings and whatnot... and I dont understand a thing...

I just want to know which choice is better..

Problem is, only you know that for yourself. We can only hope to give you enough information to make an educated decision when the time comes.

How much memory will you put on your new rig? Which operating system will you use?

Will you really install both processors? Seems like lots of people get the itch for a dual board, then buy one proc, then never pony up for the 2nd processor. You aksed about this above -- and it certainly is possible to run only one processor. If you aren't going to do dual proc, then I think there's not enough performance difference between Xeon and Opteron to make a difference. You might lean towards Opteron for price or price/performance, though. That is, you might be able to afford a faster Opteron chip becasue it can be cheaper than a similarly performant Xeon.

What will you do with the machine? You mention 3D work and video editing, but what specific applications?

Hate_Bot · Apr 8, 2005

Specific Apps: 3D studio max (7) and adobe prmiere professional

I will buy both procs. It's just, I will have a $3000 budget when the time comes. I cant buy both procs with that budget, so Ill buy one then, then save up another $900 for the other one.

OS is XP Pro SP2.

Im planning to start with 4 512 mb sticks of OCZ Platinum, 1 gig for each proc, then get another 4 sticks in a month or 2, so I will have 2 gigs for Each proc.

Does that help?

Jerunk · Apr 11, 2005

He's about to blow a lot of money in my opinion. Hate_Bot have you ever put a computer together before? Judging from a lot of your posts I didn't think you have. I'm being completely honest when I say that you're probably getting more than you need.

Edited

Hate_Bot · Apr 11, 2005

Jerunk said:
He's about to blow a lot of money in my opinion. Hate_Bot have you ever put a computer together before? Judging from a lot of your posts I didn't think you have. I'm being completely honest when I say that you're probably getting more than you need.

Edited

Thanks for Helping

Dual Opterons or Xeons?

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

Former [H] Consumer Managing Ed.

2[H]4U

[H]ard|DCer of the Month - May 2006

Former [H] Consumer Managing Ed.

Former [H] Consumer Managing Ed.

[H]ard|DCer of the Month - May 2006

Former [H] Consumer Managing Ed.

[H]ard|DCer of the Month - May 2006

Former [H] Consumer Managing Ed.

[H]ard|DCer of the Month - May 2006

2[H]4U

Former [H] Consumer Managing Ed.

n00b

Former [H] Consumer Managing Ed.

n00b

[H]ard|DCer of the Month - May 2006

n00b

n00b

2[H]4U

[H]ard|DCer of the Month - May 2006

n00b

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

Limp Gawd

2[H]4U