Nvidia Reinvents GPU, Blows Previous Generation Out of the Water: A100 Big Ampere

erek

[H]F Junkie
Joined
Dec 19, 2005
Messages
10,868
Previous thread was about GA102. This thread is about Big Ampere or A100.

"Nvidia invented a new number format for AI, Tensor Float 32 (TF32), which its third generation Tensor Cores support. For AI acceleration, working with the smallest number of bits is desirable, since that’s more efficient for computation and data movement, but this is traded off with the accuracy of the final result. TF32 aims to strike this balance using the 10-bit mantissa (which determines precision) from half-precision numbers (FP16), and the 8-bit exponent (which determines the range of numbers that can be expressed) from single-precision format (FP32) (read more about AI number formats here).

“With this new precision, A100 offers 20 times more compute for single-precision AI, and because developers can continue to use the inputs as single-precision and get outputs back as single-precision, they do not need to do anything differently. They benefit from this acceleration automatically out of the box,” Kharya said.

The Tensor Cores now also natively support double-precision (FP64) numbers, which more than doubles performance for HPC applications."


1589444915485.png


Source: EETimes @ Nvidia Reinvents GPU, Blows Previous Generation Out of the Water: A100 Big Ampere
 
Powered by 54 billion transistors, it’s the world’s largest
7nm chip, according to Nvidia, delivering more than one Peta-operations per second. Nvidia

This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...
 
  • Like
Reactions: erek
like this
Another source citing 54 Billion transistors, and I don't think the got it from EEtimes. Likely they have the same press release/briefing from NVidia:
https://www.marketwatch.com/story/n...nd-the-first-target-is-coronavirus-2020-05-14

Ampere, a 7-nanometer processor that holds more than 54 billion transistors, takes the idea of parallel processing and multiplies it — each individual A100 GPU, the first launched with Ampere, can be partitioned to run up to seven different actions or dedicated to a single need, Huang said. The company has bundled eight of those GPUs together into the DGX A100, which can handle up to 56 tasks at once or be combined into one large task, and reach up to 5 petaflops of AI performance.

Looking more like 54 Billion is for real. The question is how? 1200mm2 die? Massive transistor density improvement over and above what AMD has on 7nm??
 
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...
Where did you get the die size? I didn't see it quoted anywhere.
 
Eh, I hope this GTC event will be a bit different since they canceled Ampere reveal earlier this year.
Most of us don't really care about server crap. Nor can we afford $200.000 server. Man, for that amount of money I'd rather buy a 3d bio printer, print Ebola and then scavenge what's left when everyone's dead.
 
Where did you get the die size? I didn't see it quoted anywhere.

People were estimating up to 840mm2 based on render in story. If it were the same transistor density as AMD 7nm, then the die would need to be >1300mm2. Which seems equally extreme.
 
People were estimating up to 840mm2 based on render in story. If it were the same transistor density as AMD 7nm, then the die would need to be >1300mm2. Which seems equally extreme.
NVIDIA could be using TSMC's newer N7+ node, which has a 1.2x density improvement compared to the original N7 node AMD used with Navi. That would increase density from about 41 MT/mm² to about 49 MT/mm². A die with 54 billion transistors at that density would be about 1100 mm². I don't think it's outside the realm of possibility NVIDIA produced a die this large.
 
Haven't done my research, but this literally doesn't add up...

10-bit mantissa (which determines precision) from half-precision numbers (FP16), and the 8-bit exponent (which determines the range of numbers that can be expressed) from single-precision format (FP32)

and

and because developers can continue to use the inputs as single-precision and get outputs back as single-precision, they do not need to do anything differently

Not entirely sure how you can take an 18-bit FP number, convert it to a 16-bit FP number and not do anything differently. The conversion will induce a loss of precision via truncation. Using TF32 natively though, yeah, that's a pretty significant addition in precision to typical FP16. Not sure how exactly that translates into compute gains though, but whever.
 
Yeah, too bad my company blocked YouTube so we don't clog the network since everyone is working from home :(
I'll just read the summary once it's out.
 
Yeah, too bad my company blocked YouTube so we don't clog the network since everyone is working from home :(
I'll just read the summary once it's out.
It's just the intro, so nothing interesting yet. Just telling investors how awesome you are and what we did for you this past year.
 
It's just the intro, so nothing interesting yet. Just telling investors how awesome you are and what we did for you this past year.

There are 8 parts posted on YouTube, are all 8 parts intro or you're referring to part 1?
 
There are 8 parts posted on YouTube, are all 8 parts intro or you're referring to part 1?
Part 1. I replied before looking at the channel. Part 6 is Jensen's presentation of the A100 that this thread is about.
 
This GTC registration stuff is stupid. waited around and saw on here it's happening on YT? what?


1589463167823.png
 
This is something the company I work for would likely be testing once available...if the oil industry wasn't in the shitter. We analyze seismic data, but since all of the oil companies are tightening their belts new jobs are gonna be scarce.
 
Anandtech is now reporting 826mm2. Which makes the density ~65 MillionXstors/mm2 to AMDs ~40 MillionXstors/mm2 on Navi.

That's some seriously magic process tuning.

Cache is much denser than logic. I'm sure the crapton of L2 cache is helping A100's density numbers.
 
That DLSS 2.0 540p -> 1080p was pretty impressive. And the 720p -> 1080p looked better than a native 1080p render. Kinda nuts..

I need to rewatch this segment. I thought I heard Jensen say that the 16K render was one of the training images. If that's the case, that's cheating - you can't use your training set to prove how well your trained model works. That would be a pretty sloppy mistake, so I'll just assume I misheard him.
 
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...
TSMC has 3 different 7nm fabs, N7, N7P, and N7+ AMD using N7 and nVidia using N7+ would account for that difference. N7P is the evolution of N7 and chips designed for N7 can be made in the N7P process but N7+ is fundamentally different and requires its own design process.
 
This announcement was the topics expected but the content blew me away. The A100 performs way above what I was expecting, and they are much further along with integrating Mellanox tech into their designs than I thought they would be.
 
I need to rewatch this segment. I thought I heard Jensen say that the 16K render was one of the training images. If that's the case, that's cheating - you can't use your training set to prove how well your trained model works. That would be a pretty sloppy mistake, so I'll just assume I misheard him.

He never claimed the 16k image was produced by the model.
 
TSMC has 3 different 7nm fabs, N7, N7P, and N7+ AMD using N7 and nVidia using N7+ would account for that difference. N7P is the evolution of N7 and chips designed for N7 can be made in the N7P process but N7+ is fundamentally different and requires its own design process.
Cache is much denser than logic. I'm sure the crapton of L2 cache is helping A100's density numbers.
AMD Renoir (Zen2+Vega 7nm APU) is estimated to be a bit above 63M transistors per mm^2, if something more cache-heavy is wanted as a recent 7nm reference. It's actually denser than the Zen2 compute die, and that compute die doesn't have DDR4, PCIe, and USB controllers/PHY (~52M transistors per mm^2).

Either way afaik, TSMC no longer uses N7+ and N7P naming, it's all N7.
 
There's no new technology here. Floating point registers have always worked like this. All they've done is move the index which marks the boundary between the operand and the mantissa. I wonder how hard it would be for them to implement an instruction set which allowed the programmers to change that index to an arbitrary value? I imagine it could explode the transistor count since you might need several extra permutations of the circuits for all of the operations coming out of those registers (IE: *, /, +, -, ^, %, AND, OR, XOR, etc.). Maybe that's why they're doing it the easy way?
 
He never claimed the 16k image was produced by the model.

Right. I never said he said that. I still need to go back and rewatch, but I thought he said that the 16K render was one of the training images. You can't simply test your trained model against the dataset used to train it. That's cheating. That's the same as handing someone an answer key to a physics test and then telling everyone how you're a brilliant teacher because you taught that guy physics.

I'll give it a rewatch tomorrow to see if I simply misheard him (this seems the most plausible to me).
 
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...

Nvidia is probably manufacturing this die at Samsung instead of TSMC. We already had confirmation that Nvidia will be using Samsung for some of the dies, and typically HPC stuff is done at Samsung due to the lower cost (from Samsung "incentives"). It wouldn't surprise me if that's the case again knowing that Nvidia has such large costs on these dies.
Samsung 7nm uses EUV, which will let them get much better transistor clarity in the etch which not only allows for a higher percentage of good chips but could very well allow for higher density too because you don't have to design in so much "wiggle room" due to transistor feature sharpness.
Or, if Nvidia is using TSMC's 7nm+ then that is the higher density EUV version of the process node there. Which has the same benefits listed above.


Anandtech is now reporting 826mm2. Which makes the density ~65 MillionXstors/mm2 to AMDs ~40 MillionXstors/mm2 on Navi.

Doesn't it just boggle the mind that you can fit millions of anything that we are specifically creating in a single millimeter square of space? :eek:
 
Last edited:
Nvidia is probably manufacturing this die at Samsung instead of TSMC. We already had confirmation that Nvidia will be using Samsung for some of the dies, and typically HPC stuff is done at Samsung due to the lower cost (from Samsung "incentives"). It wouldn't surprise me if that's the case again knowing that Nvidia has such large costs on these dies.
Samsung 7nm uses EUV, which will let them get much better transistor clarity in the etch which not only allows for a higher percentage of good chips but could very well allow for higher density too because you don't have to design in so much "wiggle room" due to transistor feature sharpness.
Or, if Nvidia is using TSMC's 7nm+ then that is the higher density EUV version of the process node there. Which has the same benefits listed above.

Accoring to Nvidia it's fabbed on TSMC 7nm, which exact 7nm is not specified. Nvidia themselves have not used Samsung for HPC products, so I'm not sure why you have impression either? The only Nvidia GPU that's been acknowledged to be fabbed by Samsung so far is GP107.

TSMC 7nm+ claims only a 20% density gain over it's first gen 7nm. That alone will not account for the density difference.
 
Accoring to Nvidia it's fabbed on TSMC 7nm, which exact 7nm is not specified. Nvidia themselves have not used Samsung for HPC products, so I'm not sure why you have impression either? The only Nvidia GPU that's been acknowledged to be fabbed by Samsung so far is GP107.

TSMC 7nm+ claims only a 20% density gain over it's first gen 7nm. That alone will not account for the density difference.
Advances in dummy gate design could make up the rest. There have been some big breakthroughs in the last year or 2.
 
Back
Top