How do I get the maximum memory bandwidth from this board?

RADRaze2KX · May 20, 2013

How do I get the maximum memory bandwidth from this board?

Supermicro X9DAE

Looking for ECC Registered DDR3-1600 memory, because I'm paring it with a pair of Intel Xeon E5-2687W . Not sure how quad-channel works on a dual-socket board. Do I fill up all the slots or just certain ones?

unhappy_mage · May 20, 2013

Populate 1 or 2 DIMMs per channel with 1.5V single- or dual-rank x8 or x4 DIMMs. See page 2-10 of the manual for more details.

RADRaze2KX · May 20, 2013

Thank you for your quick answer. Let me ask a question more geared to the answer I need.

Am I correct in assuming a quad-channel set up on a dual-processor board still needs, at minimum, four DIMMs? Rather than four DIMMs per processor?

unhappy_mage · May 20, 2013

RADRaze2KX said:
Am I correct in assuming a quad-channel set up on a dual-processor board still needs, at minimum, four DIMMs? Rather than four DIMMs per processor?

No, you'll need four dimms per processor. Take a look at page 1-10 of the manual; the memory is connected directly to the processor, so each processor has a direct path to talk to its own four channels of memory, and can talk to the other processor to request things in its memory, for a total memory bandwidth of eight channels. That's quite a lot, something like 68 GB/s if Tom's Hardware and SiSoft managed to measure it right.

unhappy_mage · May 20, 2013

unhappy_mage said:
No, you'll need four dimms per processor to maximize memory bandwidth.

I re-read your question, and I'd add to my answer as above. You could still get pretty good numbers by starting with a single processor and four dimms, and you could test how much going over the QPI interconnect hurts your particular application by adding a second processor and moving two DIMMs over to the second processor.

What is the application? You might need a NUMA-aware application for a second socket to actually help things go much faster than a single socket.

RADRaze2KX · May 20, 2013

I'm building a workstation for a client that does architectural design. He uses programs like Sketchup (single core), Maya (multi-core) and VRay (multi-core). His current build is an older i7 960 with 12GB's of RAM, and he's not happy with the speed of his renders (the renders reach a few gigs each).

I've monitored the applications' resource usage at various phases of his workflow and noted that his processor is hitting 100% on all eight threads and staying that way during his rendering process. I did research on other workstation builds and came to the list of what I'm planning on assembling, but couldn't find any information on how the RAM could achieve max bandwidth, as this is my first dually setup

He inquired about a computer he saw from a company that's around $11K, so I budgeted $10K for the parts list, I've got a build around $7K in parts, then a markup that covers the build time, testing, data transfer, software installation, etc:

Motherboard: Supermicro X9DAE

Processors: 2 x Intel Xeon E5-2687W @ 3.1GHz (3.8GHz Turbo Boost)
- 2 x Noctua DH-14 coolers

Memory: 64GB Registered ECC DDR3-1600 (8 x 8GB's)

Graphics Card: nVidia GTX Titan 6GB (he already has this)

Storage:
4 x 128GB SSD's in RAID0 (512GB total) - Operating System / Programs / Current project data
2 x 2TB HD's in RAID1 - long term data storage
1 x 2TB External Harddrive - Data backup

Case: Corsair Obsidian 800D - EATX compatible

Power Supply: 1200W PC Power & Cooling
- Going with high wattage as opposed to 800W-100W for future graphics card expandability

Operating System: Windows 7 Professional x64

============

Is there anything that seems to be missing, or I should consider changing?

unhappy_mage · May 21, 2013

It seems that Sketchup and Maya both use V-Ray as their renderer. I'm not very experienced with this kind of software, but it seems nicely multi-threaded, and should deal well with NUMA, so dual-socket machines with four channels of memory should help. You could get a dual-socket machine for $8100 with a three-year warranty. You could even go crazy and get a quad-socket box for $13k, with a three-year warranty provided by someone else!

It's also worth looking into the licensing costs for V-Ray. It appears to be licensed by the machine, rather than by socket or by core, but if you can get a break on multiple licenses it might make sense to do something like four smaller systems sitting in another room (those fans will be loud!) with 8 E5-2620s instead of two 2687Ws. This is a nicely scalable problem, so that's 8*6*2 GHz = 96GHz effective crunching power instead of 2*8*3.1 = 49.6 GHz.

But before I get too far ahead of myself: Does the hard drive get accessed while rendering? If you're touching disk, start by removing that from the equation: 24GB of memory is less than $200 these days. Try that out first, and move on to more expensive solutions only as necessary. How big are the models that he's rendering? Tens of megabytes? Tens of gigabytes?

Nathan_P · May 21, 2013

For the PSU I would look at either one of the seasonic or corsair units @ 1200w. Any unit from either of these 2 will be solid and last for years

Maybe bigger SSD's?

RADRaze2KX · May 21, 2013

All great feedback, mage and nathan.

I considered a cluster or a small rack mount, but he takes his computer with him to the houses he's had constructed and does some new rendering on site for people during open house (these are multi-million dollar homes... the last open house I attended was starting at a whopping $9M). So a clustered / rack solution won't work or I'd definitely go that way. It's also the reason I'm avoiding a quad processor setup (the weight and size would be restrictive on the mobility)

Warranties from online retailers are irrelevant to him because he needs to have less than 48 hours downtime because he's constantly designing.

I admit, the RAM is overkill. But he wants overkill. Not sure why the RAM is necessary; I run virtualized machines on my computer and I never peak more than 30GB's of the 64GB I have myself.

I did consider getting larger SSD's, but currently he's working off a 256GB 1st Gen Corsair SSD... He's only using about a third of it, and he's had it for 3 years. I figured doubling the space and increasing the speed significantly would be sufficient since the long-term storage would be separate.

Not a big fan of Corsair PSU's, though seasonic is another brand I do like. Would anyone recommend a Seasonic over a PC Power & Cooling?

bloodypulp · May 21, 2013

dedicated swap disk volume (SSD or RAMdisk) physically distinct from OS/applications drives
more RAM (if the application does not need that much RAM for rendering scenes, make a RAMdisk)

unhappy_mage · May 21, 2013

RADRaze2KX said:
I considered a cluster or a small rack mount, but he takes his computer with him to the houses he's had constructed and does some new rendering on site for people during open house

What case were you going to use for the X9DAE? It's E-ATX, which isn't easy to find small cases for... The 4p box is still 4u, although it's 30kg, and the 4-node rackmount thing is 50kg. The 2p box in a reasonable case is likely to be 25kg. 4p might not be much more weight, and it's twice as many processor threads.

RADRaze2KX said:
Warranties from online retailers are irrelevant to him because he needs to have less than 48 hours downtime because he's constantly designing.

Then he needs two systems, or at least a whole lot of spare parts. And a high priority on your time, so you can replace his motherboard if it goes. Going with a bigger company means (theoretically) more people available to solve his problem in a pinch.

Another possible advantage: you might be able to try before buying. If he spends $10k on a new system and it doesn't help, that's going to put you in a bad place.

RADRaze2KX said:
I admit, the RAM is overkill. But he wants overkill. Not sure why the RAM is necessary; I run virtualized machines on my computer and I never peak more than 30GB's of the 64GB I have myself.

Heh, I have a single VM with 32GB for building SmartOS. Some things just tend to use it up

RADRaze2KX said:
Not a big fan of Corsair PSU's, though seasonic is another brand I do like. Would anyone recommend a Seasonic over a PC Power & Cooling?

These days, probably. PCP&C was at the top of the heap in 2006 or so, but I don't see them as having made real progress since then. Getting bought out by OCZ didn't help anything. Read power supply reviews here and on jonnyguru.com and you'll get a power supply that fits your needs.

Ski · Jun 1, 2013

RADRaze2KX said:
I admit, the RAM is overkill. But he wants overkill. Not sure why the RAM is necessary; I run virtualized machines on my computer and I never peak more than 30GB's of the 64GB I have myself.

Sorry for the late response to this thread, but as an architect myself who renders constantly, having access to tons of memory is a must when it comes to V-Ray rendering. Since we're messing tons of high res textures when applying to models, you can easily rendering go past 36Gb's on any given day; not only that, we're also using multiple other composoting programs in the background that can eat another 5-10 Gb's in itself. What it boils down to my friend, time is money, and any sort of delay or crash can cause havoc to our already stressful week.

Patriot · Jun 3, 2013

I would do larger SSDs not just for capacity but for better wear balancing.
Also, <256GB have performance losses over 256 due to lack of chips...which also decreases lifespan.

There are plenty of E-ATX cases... its the other big form factors that get ya... I would know.

Seasonic for the PSU definitely.

Ram is cheap, keep it under 80% usage and save the disks.

AndyE · Jun 3, 2013

Where do you intend to connect the ssd's?
You measured the existing system. Is his application cpu or memory bound ?

Depending on the memory access pattern, 8 Dimms are ok. If you want more interleaving (dimm, page, rank, etc.) 16 Dimms add another boost.

If this is a compute workstation, don't put more than 2 GPU's in - they can have significant impact on memory throughput for your CPUs.

What is the level of concurrency the user expects? (rendering one project, working on next)
What is your cooling solution wrt noise and efficiency?

48 hrs TTR? Who keeps the spare parts?

What compute options does the app have? Would a "silent" workstation and a "GPUbox" in the next room be more cost effective?
Or would 2 identical workstations with lower specs for 10k in total be the more resilient solution, if downtime is so expensive?

etc, etc,
Andy

Patriot · Jun 3, 2013

^^ FYI This man knows his shit.

AndyE · Jun 3, 2013

Patriot said:
^^ FYI This man knows his shit.

.... just pretending

.....

@RADRaze2KX,
the post above was written in a cab - now I have a keyboard and mouse

Please apply a grain of salt: I am not a hardware expert, just playing around with such systems. (dual E5-2687W and dual E5-2650)

Your system design is - from my current POV - inconsistent:

Does your customer need availability or performance?
Is throughput more important than latency or vice versa?

Some comments to point 1:
If availability is important, don't put his work files on a 4 SSD Raid0 ( I assume, you use consumer SSDs) until they end up in long term archival (BTW, which shouldn't be on the same machine).
1 external hard disk as backup dirve is insufficient. Minimum two sets are required with a capacity of the drive to be backed up. One set need allways to be offline, even when one is used for backup. Is your user a savvy computer user, who regularily backs up? What is your probability that he makes some errors? Will you train with him a fast recovery, when the machine goes down? Will you educate him, what NOT to do in such a case?
3-4 SSDs in Raid 0 will saturate the DMI interconnect between CPU0 and the C602 chipset. BAsically all LAN ports, USB ports, one PCI 2.0 slot and all SATA Ports need to sent data via this 2 GB/s DMI port - for a high perf machine, I would not recommend such a SSD config /(to connect all SATA on the motherboard). Use a dedicate SATA host bus adapter instead. I have good experience with the new generation LSI 9207-8i ( I use 6 of these with 48 SSDs in parallel and they deliver an amazing performance and are very stable)
Hard disks are slow, they can still be connected to the mobo - no issue there.

Some comments to point 2:
I am not familiar with the app, but ray tracing is usually embarassingly parallel. All your 32 cores (16 + 16HT) will max out when such a job is run - maximizing throughput. If your user wants to work in a speedy fashion at the same time, you need to throttle the ray tracing (background) task to reserve some CPU capacity that the user still enjoys his fast workstation (latency optimized). If the interactive part of the workstation is a heavy user of the built in GPU(s), they will consume some of your memory bandwidth. Just to be aware of.

The Dual SB Xeon deliver with 1600 MHz ECC reg dram approx 78-80 GB/sec (stream benchmark). But memory bandwith is heavily dependent from the memory access pattern the application has, so this absolute number is app dependent. (Still, the Dual Sandy is a super fast workstation vs. other architectures)

Is the application NUMA aware? If not, don't be surprised if the application performance isn't scaling with the amount of CPUs you throw at tthe problem. In extreme cases, a half populated dual motherboard with only one E-2687W can be faster than one with two CPUs. Recommendation: Check first if the app is scaling on dual socket systems.

Cooling:
Throughput systems (running on high energy for a long time) have to deal with the heat generated. If this system is not in a server room but under the desk, look for a silent and effective solution. I've done a couple of records on my dual E5-2687W and just by changing the cooling system, I gained 20% more performance. (Highly optimized code pulls so much energy that the fans could not keep the temp under the threshold where CPU throtteling kicks in. Reucing the 8core max frequency of 3400 MHZ to whatever limit is necessary to avoid overheating. I use currently on my workstation Corsair H90 sets, which are smaller then the previous H100 and are basically silent.The goal is to keep the CPUs under load at 3400 MHz, have them run below 60 degrees and do this in a silent way

I don't know the memory footprint of the application. If the data sets are large, the total usage to avoid any swaps - the kill any of the other good performance attributes your system has - a performance system should never be bogged down due to memory shortage.

I run for my usage one WS with 256 GB and the other with 128 GB. Loading in the larger WS a 100GB datafile into the app takes 5 seconds. Working on such sets is more or less instantenous and a changes profoundly the way how to work with somewhat "big data"

To cut a long story short:
Try to get a good understanding which applications your customer uses (when, how, concurrently or not, ...), get a good undertsanding what those apps do and what kind of options currently untapped are still available. With this information the probability to build a well balanced dual SB system is very high. The SB architecture is really a game changer in the workstation field, imho.

Cheers,
Andy

Back to your original question:
Here is a good paper on memory configurations for your system (as said, its app dependent)
http://globalsp.ts.fujitsu.com/dmsp...-sandy-bridge-ep-memory-performance-ww-en.pdf

How do I get the maximum memory bandwidth from this board?

RADRaze2KX

Weaksauce

unhappy_mage

[H]ard|DCer of the Month - October 2005

RADRaze2KX

Weaksauce

unhappy_mage

[H]ard|DCer of the Month - October 2005

unhappy_mage

[H]ard|DCer of the Month - October 2005

RADRaze2KX

Weaksauce

unhappy_mage

[H]ard|DCer of the Month - October 2005

Nathan_P

[H]ard DCOTM x3

RADRaze2KX

Weaksauce

bloodypulp

Gawd

unhappy_mage

[H]ard|DCer of the Month - October 2005

Ski

[H]ard|Gawd

Patriot

[H]ard|DCer of the Month - March 2011/June 2013/De

AndyE

Limp Gawd

Patriot

[H]ard|DCer of the Month - March 2011/June 2013/De

AndyE

Limp Gawd