NVIDIA DGX GB200 NVL72

erek

[H]F Junkie
Joined
Dec 19, 2005
Messages
10,906
Interesting

"Hopefully, this was a quick and fun look at the NVIDIA DGX GB200 NVL72. For some who were unsure of what they were seeing, we had our original NVIDIA GTC 2024 Keynote Coverage that goes into many of the components. Still, it is cool to see in person and get a bit of a closer look."

1710991116373.png

Source: https://www.servethehome.com/this-is-the-nvidia-dgx-gb200-nvl72/
 
https://www.nextplatform.com/2024/0...ystems-attack-1-trillion-parameter-ai-models/
The one to compare, as Huang walked through during his keynote, was how to train the 1.8 trillion parameter GPT-4 Mixture of Experts LLM from OpenAI. On a cluster of SuperPODs based on the Hopper H100 GPUs using InfiniBand outside of the node and NVLink 3 inside of the node, it took 8,000 GPUs 90 days and 15 megawatts of juice to complete the training run. To do the same training run in the same 90 days on the GB200 NVL72, it would take only 2,000 GPUs and 4 megawatts. If you did it across 6,000 Blackwell B200 GPUs, it would take 30 days and 12 megawatts.

This is not really a computation issue as much as it is an I/O and computation issue, Buck explained to us. With these Mixture of Expert modules, there are many more layers of parallelism and communication across and within those layers. There is the data parallelism – breaking the data set into chunks and dispatching parts of the calculation to each GPU – that is the hallmark of HPC and early AI computing. Then there is tensor parallelism (breaking a given calculation matrix across multiple tensor cores) and pipeline parallelism (dispatching layers of the neural network processing to individual GPUs to process them in parallel to speed them up). And now we have model parallelism as we have a mixture of experts who do their training and inference so we can see which one is the best at giving this kind of answer.

It hurts to think about it, and you need an AI to keep track of it all probably. . . . Buck says that to figure out the right configurations of parallelism to run GPT-4 training on the GB200 NVL72 cluster, Nvidia did more than 2,600 experiments to figure out the right way to create the hardware and dice and slice the model to make it run as efficiently as possible.


The networking part of it seem really big and complicated (and just how much there goes into that rack, it make you understand how they can have such margin, in such large of a market without a competitor destroying them down to at least regular very high margin level....no one can do it even at that price let alone cheaper)


nvidia-blackwell-gpt-moe-1.8t-versus-hopper.jpg



Things change so fast I learned about the ability to do 4 bits inference last week... and now it seem if you are about to launch an ML product for doing inference without hardware acceleration for it.... you are in trouble.
 
Last edited:
Back
Top