Choosing the right CPU for a system that will support 4-6 Titan RTXs, for the purposes of TensorFlow calculations.

Gatecrasher3000

Limp Gawd
Joined
Mar 18, 2013
Messages
320
I have been tasked with building a system for my company, the system will be completely dedicated to machine learning using TensorFlow. We have chosen to use GPUs with large amounts of vRAM, so we will be using 4 or 6 Titan RTX GPUs (because of added cost we didn't use quattros).

I already have the dual PSUs, and chassis purchased, however the only parts I'm still debating is the CPU and mobo.
I was unclear about what CPU to use, because the necessary bus lanes that 4-6 Titans could possibly be using if all Titans were working at the same time. I don't know if GPUs doing machine learning calculations require the same amount of bus lanes that a GPU doing game rendering would. Also, lets say Im using 6 RTX GPUs, each (I believe) would use 16 lanes, does that mean I would need a CPU that could support 96 lanes in order for them to run without a bus lane bottleneck?
I see that the 3000 series of threadripper has 88 lanes for PCI IO, is that the highest amount of lanes a consumer grade CPU offers?

Anyways, I don't know enough about machine learning or bus lanes to fully choose the best CPU or mobo for what we need, and if anyone could shed some light that would be super helpful.

Thank you
 

E4g1e

Supreme [H]ardness
Joined
May 21, 2002
Messages
7,142
I have been tasked with building a system for my company, the system will be completely dedicated to machine learning using TensorFlow. We have chosen to use GPUs with large amounts of vRAM, so we will be using 4 or 6 Titan RTX GPUs (because of added cost we didn't use quattros).

I already have the dual PSUs, and chassis purchased, however the only parts I'm still debating is the CPU and mobo.
I was unclear about what CPU to use, because the necessary bus lanes that 4-6 Titans could possibly be using if all Titans were working at the same time. I don't know if GPUs doing machine learning calculations require the same amount of bus lanes that a GPU doing game rendering would. Also, lets say Im using 6 RTX GPUs, each (I believe) would use 16 lanes, does that mean I would need a CPU that could support 96 lanes in order for them to run without a bus lane bottleneck?
I see that the 3000 series of threadripper has 88 lanes for PCI IO, is that the highest amount of lanes a consumer grade CPU offers?

Anyways, I don't know enough about machine learning or bus lanes to fully choose the best CPU or mobo for what we need, and if anyone could shed some light that would be super helpful.

Thank you
You'll need either a multi-CPU server platform (a dual-CPU or higher Xeon) with multiple CPUs, or an AMD Threadripper CPU. Intel's HEDT platform has only enough PCIe lanes to handle three GPUs; Threadripper has enough for four GPUs. None of the mainstream CPU platforms have enough PCIe lanes for more than one single GPU.
 

KATEKATEKATE

Limp Gawd
Joined
Jan 20, 2019
Messages
350
sounds like a good fit for Epyc- 128 PCIe lanes on a single socket would do the trick. I don't know that running the GPUs at x8 would cause a performance hit though, so a Threadripper would probably work too.
 

Iratus

[H]ard|Gawd
Joined
Jan 16, 2003
Messages
1,335
Kinda need to know more about what you’re doing. I’d strongly advise starting from the data and learning model type. It materially makes a difference how you build a machine, whether you need linked GPUs, whether (and how) you are distributing in the case of tensorflow, how much is one off vs ongoing, where is the data coming from etc etc

In any multiple gpu setup you need to make sure you have enough memory and I/O to feed it. You can easily saturate 40GBe NICs. For this type of stuff Google run custom ASICS, aws has dual cpu Intel 9242’s

I’d strongly recommend talking to the data guys, Get more details, then spin something up in aws, see how scaling of gpus helps, play around with cpu options to see how many cores you need etc. Then buy hardware that reflects it, but only if you are going to use it more than 8-10 hours a day. Otherwise just use cloud.
 
Top