Nvidia Reinvents GPU, Blows Previous Generation Out of the Water: A100 Big Ampere

erek · May 14, 2020

Previous thread was about GA102. This thread is about Big Ampere or A100.

"Nvidia invented a new number format for AI, Tensor Float 32 (TF32), which its third generation Tensor Cores support. For AI acceleration, working with the smallest number of bits is desirable, since that’s more efficient for computation and data movement, but this is traded off with the accuracy of the final result. TF32 aims to strike this balance using the 10-bit mantissa (which determines precision) from half-precision numbers (FP16), and the 8-bit exponent (which determines the range of numbers that can be expressed) from single-precision format (FP32) (read more about AI number formats here).

“With this new precision, A100 offers 20 times more compute for single-precision AI, and because developers can continue to use the inputs as single-precision and get outputs back as single-precision, they do not need to do anything differently. They benefit from this acceleration automatically out of the box,” Kharya said.

The Tensor Cores now also natively support double-precision (FP64) numbers, which more than doubles performance for HPC applications."

Source: EETimes @ Nvidia Reinvents GPU, Blows Previous Generation Out of the Water: A100 Big Ampere

Snowdog · May 14, 2020

Powered by 54 billion transistors, it’s the world’s largest
7nm chip, according to Nvidia, delivering more than one Peta-operations per second. Nvidia

This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...

Snowdog · May 14, 2020

Another source citing 54 Billion transistors, and I don't think the got it from EEtimes. Likely they have the same press release/briefing from NVidia:
https://www.marketwatch.com/story/n...nd-the-first-target-is-coronavirus-2020-05-14

Ampere, a 7-nanometer processor that holds more than 54 billion transistors, takes the idea of parallel processing and multiplies it — each individual A100 GPU, the first launched with Ampere, can be partitioned to run up to seven different actions or dedicated to a single need, Huang said. The company has bundled eight of those GPUs together into the DGX A100, which can handle up to 56 tasks at once or be combined into one large task, and reach up to 5 petaflops of AI performance.

Looking more like 54 Billion is for real. The question is how? 1200mm2 die? Massive transistor density improvement over and above what AMD has on 7nm??

TMCM · May 14, 2020

But will it play Crysis?

Armenius · May 14, 2020

Snowdog said:
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...

Where did you get the die size? I didn't see it quoted anywhere.

Nebell · May 14, 2020

Eh, I hope this GTC event will be a bit different since they canceled Ampere reveal earlier this year.
Most of us don't really care about server crap. Nor can we afford $200.000 server. Man, for that amount of money I'd rather buy a 3d bio printer, print Ebola and then scavenge what's left when everyone's dead.

NukeDukem · May 14, 2020

Nebell said:
Man, for that amount of money I'd rather buy a 3d bio printer, print Ebola and then scavenge what's left when everyone's dead.

Don't give China any more ideas!

Snowdog · May 14, 2020

Armenius said:
Where did you get the die size? I didn't see it quoted anywhere.

People were estimating up to 840mm2 based on render in story. If it were the same transistor density as AMD 7nm, then the die would need to be >1300mm2. Which seems equally extreme.

Armenius · May 14, 2020

Snowdog said:
People were estimating up to 840mm2 based on render in story. If it were the same transistor density as AMD 7nm, then the die would need to be >1300mm2. Which seems equally extreme.

NVIDIA could be using TSMC's newer N7+ node, which has a 1.2x density improvement compared to the original N7 node AMD used with Navi. That would increase density from about 41 MT/mm² to about 49 MT/mm². A die with 54 billion transistors at that density would be about 1100 mm². I don't think it's outside the realm of possibility NVIDIA produced a die this large.

SunnyD · May 14, 2020

Haven't done my research, but this literally doesn't add up...

10-bit mantissa (which determines precision) from half-precision numbers (FP16), and the 8-bit exponent (which determines the range of numbers that can be expressed) from single-precision format (FP32)

and

and because developers can continue to use the inputs as single-precision and get outputs back as single-precision, they do not need to do anything differently

Not entirely sure how you can take an 18-bit FP number, convert it to a 16-bit FP number and not do anything differently. The conversion will induce a loss of precision via truncation. Using TF32 natively though, yeah, that's a pretty significant addition in precision to typical FP16. Not sure how exactly that translates into compute gains though, but whever.

Snowdog · May 14, 2020

Comixbooks said:
Like 20 minutes before it goes live I'm not sure how to register for it.

Sounds like it's just open on youtube.
Edit: It's up now on YT. Part 1:

Nebell · May 14, 2020

Yeah, too bad my company blocked YouTube so we don't clog the network since everyone is working from home

I'll just read the summary once it's out.

Armenius · May 14, 2020

Nebell said:
Yeah, too bad my company blocked YouTube so we don't clog the network since everyone is working from home
I'll just read the summary once it's out.

It's just the intro, so nothing interesting yet. Just telling investors how awesome you are and what we did for you this past year.

Snowdog · May 14, 2020

DevBlog likely has all the details in text, and then some:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

No RT cores at all. This is a Volta replacement. As expected the HPC product, tells us just about nothing about what to expect for Ampere gaming parts.

Nebell · May 14, 2020

Armenius said:
It's just the intro, so nothing interesting yet. Just telling investors how awesome you are and what we did for you this past year.

There are 8 parts posted on YouTube, are all 8 parts intro or you're referring to part 1?

Armenius · May 14, 2020

Nebell said:
There are 8 parts posted on YouTube, are all 8 parts intro or you're referring to part 1?

Part 1. I replied before looking at the channel. Part 6 is Jensen's presentation of the A100 that this thread is about.

erek · May 14, 2020

This GTC registration stuff is stupid. waited around and saw on here it's happening on YT? what?

Snowdog · May 14, 2020

Armenius said:
Where did you get the die size? I didn't see it quoted anywhere.

Anandtech is now reporting 826mm2. Which makes the density ~65 MillionXstors/mm2 to AMDs ~40 MillionXstors/mm2 on Navi.

That's some seriously magic process tuning.

leezard · May 14, 2020

This is something the company I work for would likely be testing once available...if the oil industry wasn't in the shitter. We analyze seismic data, but since all of the oil companies are tightening their belts new jobs are gonna be scarce.

zehoo · May 14, 2020

Someone needs to shop a chef's hat on top of Jen-Hsun's head. Maybe add an infomercial feel to it.

Fleat · May 14, 2020

That is some crazy transistor density.

Nvidia right now

erek · May 14, 2020

GoodBoy · May 14, 2020

That DLSS 2.0 540p -> 1080p was pretty impressive. And the 720p -> 1080p looked better than a native 1080p render. Kinda nuts..

MangoSeed · May 14, 2020

Snowdog said:
Anandtech is now reporting 826mm2. Which makes the density ~65 MillionXstors/mm2 to AMDs ~40 MillionXstors/mm2 on Navi.

That's some seriously magic process tuning.

Cache is much denser than logic. I'm sure the crapton of L2 cache is helping A100's density numbers.

Chimpee · May 14, 2020

GoodBoy said:
That DLSS 2.0 540p -> 1080p was pretty impressive. And the 720p -> 1080p looked better than a native 1080p render. Kinda nuts..

The future of gaming at higher resolution 4K and beyond is gonna be using these kind of up-scaling.

DukenukemX · May 14, 2020

But can it play Crysis Remastered?

Lakados · May 14, 2020

DukenukemX said:
But can it play Crysis Remastered?

It can probably play it better than I can.

erek · May 14, 2020

Lakados said:
It can probably play it better than I can.

Yeah, DukenukemX the AI playing or do you mean 3D rendering it?

erek · May 14, 2020

Thunderdolt · May 14, 2020

GoodBoy said:
That DLSS 2.0 540p -> 1080p was pretty impressive. And the 720p -> 1080p looked better than a native 1080p render. Kinda nuts..

I need to rewatch this segment. I thought I heard Jensen say that the 16K render was one of the training images. If that's the case, that's cheating - you can't use your training set to prove how well your trained model works. That would be a pretty sloppy mistake, so I'll just assume I misheard him.

Lakados · May 14, 2020

Snowdog said:
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...

TSMC has 3 different 7nm fabs, N7, N7P, and N7+ AMD using N7 and nVidia using N7+ would account for that difference. N7P is the evolution of N7 and chips designed for N7 can be made in the N7P process but N7+ is fundamentally different and requires its own design process.

erek · May 14, 2020

NVIDIA Ampere Architecture In-Depth

Lakados · May 14, 2020

This announcement was the topics expected but the content blew me away. The A100 performs way above what I was expecting, and they are much further along with integrating Mellanox tech into their designs than I thought they would be.

MangoSeed · May 14, 2020

Thunderdolt said:
I need to rewatch this segment. I thought I heard Jensen say that the 16K render was one of the training images. If that's the case, that's cheating - you can't use your training set to prove how well your trained model works. That would be a pretty sloppy mistake, so I'll just assume I misheard him.

He never claimed the 16k image was produced by the model.

jeremyshaw · May 14, 2020

Lakados said:
TSMC has 3 different 7nm fabs, N7, N7P, and N7+ AMD using N7 and nVidia using N7+ would account for that difference. N7P is the evolution of N7 and chips designed for N7 can be made in the N7P process but N7+ is fundamentally different and requires its own design process.

MangoSeed said:
Cache is much denser than logic. I'm sure the crapton of L2 cache is helping A100's density numbers.

AMD Renoir (Zen2+Vega 7nm APU) is estimated to be a bit above 63M transistors per mm^2, if something more cache-heavy is wanted as a recent 7nm reference. It's actually denser than the Zen2 compute die, and that compute die doesn't have DDR4, PCIe, and USB controllers/PHY (~52M transistors per mm^2).

Either way afaik, TSMC no longer uses N7+ and N7P naming, it's all N7.

nthexwn · May 14, 2020

There's no new technology here. Floating point registers have always worked like this. All they've done is move the index which marks the boundary between the operand and the mantissa. I wonder how hard it would be for them to implement an instruction set which allowed the programmers to change that index to an arbitrary value? I imagine it could explode the transistor count since you might need several extra permutations of the circuits for all of the operations coming out of those registers (IE: *, /, +, -, ^, %, AND, OR, XOR, etc.). Maybe that's why they're doing it the easy way?

Thunderdolt · May 15, 2020

MangoSeed said:
He never claimed the 16k image was produced by the model.

Right. I never said he said that. I still need to go back and rewatch, but I thought he said that the 16K render was one of the training images. You can't simply test your trained model against the dataset used to train it. That's cheating. That's the same as handing someone an answer key to a physics test and then telling everyone how you're a brilliant teacher because you taught that guy physics.

I'll give it a rewatch tomorrow to see if I simply misheard him (this seems the most plausible to me).

EniGmA1987 · May 15, 2020

Snowdog said:
This is the the kind of thing expected for GTC, which tells us very little about Gaming Ampere.

But, unless that's a typo, NVidia somehow managed a 50% transistor density improvement over AMD on 7nm.

AMD GPU 7nm GPUs are ~40Mt/mm2, this GPU would appear to be ~60Mt/mm2.

If they somehow actually hit that transistor density, then NVidia may have lot more transistors for gaming GPUs...

Nvidia is probably manufacturing this die at Samsung instead of TSMC. We already had confirmation that Nvidia will be using Samsung for some of the dies, and typically HPC stuff is done at Samsung due to the lower cost (from Samsung "incentives"). It wouldn't surprise me if that's the case again knowing that Nvidia has such large costs on these dies.
Samsung 7nm uses EUV, which will let them get much better transistor clarity in the etch which not only allows for a higher percentage of good chips but could very well allow for higher density too because you don't have to design in so much "wiggle room" due to transistor feature sharpness.
Or, if Nvidia is using TSMC's 7nm+ then that is the higher density EUV version of the process node there. Which has the same benefits listed above.

Snowdog said:
Anandtech is now reporting 826mm2. Which makes the density ~65 MillionXstors/mm2 to AMDs ~40 MillionXstors/mm2 on Navi.

Doesn't it just boggle the mind that you can fit millions of anything that we are specifically creating in a single millimeter square of space?

limitedaccess · May 15, 2020

EniGmA1987 said:
Nvidia is probably manufacturing this die at Samsung instead of TSMC. We already had confirmation that Nvidia will be using Samsung for some of the dies, and typically HPC stuff is done at Samsung due to the lower cost (from Samsung "incentives"). It wouldn't surprise me if that's the case again knowing that Nvidia has such large costs on these dies.
Samsung 7nm uses EUV, which will let them get much better transistor clarity in the etch which not only allows for a higher percentage of good chips but could very well allow for higher density too because you don't have to design in so much "wiggle room" due to transistor feature sharpness.
Or, if Nvidia is using TSMC's 7nm+ then that is the higher density EUV version of the process node there. Which has the same benefits listed above.

Accoring to Nvidia it's fabbed on TSMC 7nm, which exact 7nm is not specified. Nvidia themselves have not used Samsung for HPC products, so I'm not sure why you have impression either? The only Nvidia GPU that's been acknowledged to be fabbed by Samsung so far is GP107.

TSMC 7nm+ claims only a 20% density gain over it's first gen 7nm. That alone will not account for the density difference.

Lakados · May 15, 2020

limitedaccess said:
Accoring to Nvidia it's fabbed on TSMC 7nm, which exact 7nm is not specified. Nvidia themselves have not used Samsung for HPC products, so I'm not sure why you have impression either? The only Nvidia GPU that's been acknowledged to be fabbed by Samsung so far is GP107.

TSMC 7nm+ claims only a 20% density gain over it's first gen 7nm. That alone will not account for the density difference.

Advances in dummy gate design could make up the rest. There have been some big breakthroughs in the last year or 2.

Nvidia Reinvents GPU, Blows Previous Generation Out of the Water: A100 Big Ampere

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]ard|Gawd

Extremely [H]

2[H]4U

2[H]4U

[H]F Junkie

Extremely [H]

2[H]4U

[H]F Junkie

2[H]4U

Extremely [H]

[H]F Junkie

2[H]4U

Extremely [H]

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

Limp Gawd

Gawd

[H]F Junkie

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

Supreme [H]ardness

[H]F Junkie

[H]F Junkie

[H]F Junkie

Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

Weaksauce

Gawd

Limp Gawd

Supreme [H]ardness

[H]F Junkie