Does NUMA Matter when Installing SAS HBA's, NVME drives and other expansion in MP systems?

Zarathustra[H]

Extremely [H]
Joined
Oct 29, 2000
Messages
38,862
Hey all,

I have a dual socket server board (Supermicro X9DRI-F and two Xeon E5-2650v2's). As I was installing a few SAS HBA's, a 10GB Ethernet card and preparing for future NVME drives, it struck me.

Each of the PCIe slots is labeled with which CPU the lanes are attached to. Two slots (16x and 8x) to CPU0, three slots, (16x, 8x and 8x) to CPU1 and 8x to chipset.

In order to maximize the efficiency/performance of the system, should I be caring about which slot things go into? Grouping all storage onto one CPU? Or distributing it across them? Does it matter at all?

I've been running a couple of dual socket servers for a few years now, and I never even thought about this before, but googling doesn't seem to give me much in the way of answers.

Appreciate any input.
 
It may be theoretically faster if you distribute however I don't think it matters unless you are using SSDs. With that said I have subscribed..
 
Yes, it very much matters when you are actually using a system to full tilt, or are concerned about latency. Crossing NUMA domains is not free, especially not for I/O.

Some network monitoring boxes at work were misconfigured with processes on CPU1 monitoring traffic from cards on CPU0. Was NOT pretty.
 
Absolutely. I dealt with an issue where adding CPU's was the worst thing we could have done to a casandra cluster. We had to pin cores and banks of memory to a VM, while removing the total amount of cores for the impacted vms. A monster vm with 32 Cores on multple CPUs was a fail where as 8 cores on 1 cpu pined to its memory bank made the issue basically vanish. Once we started following a numa standard, a lot of the IO contention vanished. You definitely will get punched in the face if you don't follow numa layouts. My arrays can shove million+ iops.. didnt even see that on the array level though it manifested as io wait in the vm. Fastest way to tell if you are getting face punched with it? If you have 10 vms sharing the same exact data store, however only 1 shows IO wait, the other show zero, and the data store itself has super low utilization? yah.. That VM is your problem child.
 
It may be theoretically faster if you distribute however I don't think it matters unless you are using SSDs. With that said I have subscribed..

Yes, it very much matters when you are actually using a system to full tilt, or are concerned about latency. Crossing NUMA domains is not free, especially not for I/O.

Some network monitoring boxes at work were misconfigured with processes on CPU1 monitoring traffic from cards on CPU0. Was NOT pretty.

Absolutely. I dealt with an issue where adding CPU's was the worst thing we could have done to a casandra cluster. We had to pin cores and banks of memory to a VM, while removing the total amount of cores for the impacted vms. A monster vm with 32 Cores on multple CPUs was a fail where as 8 cores on 1 cpu pined to its memory bank made the issue basically vanish. Once we started following a numa standard, a lot of the IO contention vanished. You definitely will get punched in the face if you don't follow numa layouts. My arrays can shove million+ iops.. didnt even see that on the array level though it manifested as io wait in the vm. Fastest way to tell if you are getting face punched with it? If you have 10 vms sharing the same exact data store, however only 1 shows IO wait, the other show zero, and the data store itself has super low utilization? yah.. That VM is your problem child.


Hmm.

So I am thinking along the lines of this. You are sticking a single SAS HBA or a single 10Gbit NIC in a dual socket server. Do you choose socket 0 or socket 1?

Now assume you have a pool spread over multiple SAS HBA cards. Do you put one on each, or try to get them both on the same socket, and tie storage heavy workloads tot he first socket?
 
Tough question. I would think I/O would be a bigger demand than networking, but based on your previous posts it almost seems like your work loads are pretty even. Its easy for me to pop off, but maybe do something like get 2 sas cards, create 2 storage pools, 1 per sas card. Run networking of a USB 10gig adapter? You can tailor your IO work load to each pool. More so because it looks like you run a lot of mix load stuff. One card does raid 1, other does 5 or something. To help make your decision, why not monitor your hypervisor for a week. Watch IO vs Disk. That'll tell you for sure your direction. When I do this stuff at work, I put my sysadmin hat on, and bust out my favorite collection tool, nmon. Nmon looks like a nicer version of top, but a great collection module. http://nmon.sourceforge.net/pmwiki.php then get extra fancy with the excel plug in for it. That will answer pretty much any question you could ever have about the linux os your getting bench marks off of. I know 2003 called and whats their crap web site design back.. but this tool will help: https://www.ibm.com/developerworks/...ng=en#!/wiki/Power+Systems/page/nmon_analyser

edit.. I know ibm and aix are all over both of those websites, but nmon was ported to linux years ago.
 
Last edited:
I would not use a dual socket server to only drive a single NIC and single HBA, rarely is a fully populated system used this way, unless it is over-speced to leave room for future growth.

Usually a beefy server is running many VMs or doing multiple tasks, I would try to align the physical I/O with whatever is running on each socket.

There are quite a few half-populated dual socket systems out there as well, it used to make sense buying some servers that way due to the platform.

Otherwise I would balance the cards, e.g. if two NICs and two HBAs: split both and use the network ports and drives accordingly to the proper VMs/cores. This is what my work boxes do, at least when they are configured correctly.

Better answer for today: Go TR (ideally the next one) or Epyc and get 60 or 128 lanes in one domain ;)
 
Tough question. I would think I/O would be a bigger demand than networking, but based on your previous posts it almost seems like your work loads are pretty even. Its easy for me to pop off, but maybe do something like get 2 sas cards, create 2 storage pools, 1 per sas card. Run networking of a USB 10gig adapter? You can tailor your IO work load to each pool. More so because it looks like you run a lot of mix load stuff. One card does raid 1, other does 5 or something. To help make your decision, why not monitor your hypervisor for a week. Watch IO vs Disk. That'll tell you for sure your direction. When I do this stuff at work, I put my sysadmin hat on, and bust out my favorite collection tool, nmon. Nmon looks like a nicer version of top, but a great collection module. http://nmon.sourceforge.net/pmwiki.php then get extra fancy with the excel plug in for it. That will answer pretty much any question you could ever have about the linux os your getting bench marks off of. I know 2003 called and whats their crap web site design back.. but this tool will help: https://www.ibm.com/developerworks/...ng=en#!/wiki/Power+Systems/page/nmon_analyser

edit.. I know ibm and aix are all over both of those websites, but nmon was ported to linux years ago.


Much obliged. Looks like my IO Delay's for the last week are pretty damned low, so maybe I am worrying for no reason. It seems to max out at about 0.5% during normal use, and only peaked at just over 1% once during a particularly heavy I/O transfer.

upload_2019-8-21_13-14-52.png


I'm guessing what happens is if a thread on CPU1 needs to access a device connected to a PCIe port belonging to CPU0 it needs to traverse the QPI, which is perfectly fine at low load, but could get congested if load is really high.

Maybe the way to do it is to distribute it. Split it over the two CPU's. This way, at most the QPI will see the load of half the pool reads, which will minimize the impact, and this way all workloads get hit evenly regardless of which CPU their threads are running on, rather than having some hit pretty high, and others not at all.
 
Is x on the left ms in response time or percent? If you are rocking 5ms or better this very moment on that setup? Don't change your setup. Whats your average response time in MS during peak loads? 5ms or better is what I'm required to keep at my desk at work. If you rock that at home, then you are already in a great spot. Your current work load doesn't appear to be a good driver to change it. However, if you want to just because its your hobby, then absolutely totally do it. The biggest boost in performance could be additional ram or higher clocked cpus. Gobs of storage is great for bragging rights.. But if you are not consuming it, you just wasted a bunch of money.
 
Is x on the left ms in response time or percent? If you are rocking 5ms or better this very moment on that setup? Don't change your setup. Whats your average response time in MS during peak loads? 5ms or better is what I'm required to keep at my desk at work. If you rock that at home, then you are already in a great spot. Your current work load doesn't appear to be a good driver to change it. However, if you want to just because its your hobby, then absolutely totally do it. The biggest boost in performance could be additional ram or higher clocked cpus. Gobs of storage is great for bragging rights.. But if you are not consuming it, you just wasted a bunch of money.


It's expressed in percent. I don't know the percent of what though. :p The legend says "IO Delay %"
 
Seems like your setup is tuned really well already. Clone your boot drive on this system and set it aside. Take this system apart and tinker with it to answer your own questions. Run bench marks. If you see gains, sweet. If not, you got a roll back process. Its been 100+ in texas lately, and that kind of project is perfect for a Saturday when its scorched Earth outside. We can spit ball all day, but I say grab a screw driver and get to work. ;)
 
Back
Top