Nvidia Blackwell rumor engine

Lakados

[H]F Junkie
Joined
Feb 3, 2014
Messages
10,442
So I was googling info about fitting a Telemon Heavy Dreadnought and I found this instead, the way the algorithm works astounds me.

https://wccftech.com/rumor-nvidia-r...-biggest-performance-leap-in-nvidias-history/

A dedicated ray tracing denoising accelerator, which would be a pretty big uplift over the generic CUDA accelerated one now.
Few other claims, obviously nothing concrete as Blackwell is like 18 months out but yeah. Something Nvidia probably leaked to steal AMD's thunder.
 
So I was googling info about fitting a Telemon Heavy Dreadnought and I found this instead, the way the algorithm works astounds me.

https://wccftech.com/rumor-nvidia-r...-biggest-performance-leap-in-nvidias-history/

A dedicated ray tracing denoising accelerator, which would be a pretty big uplift over the generic CUDA accelerated one now.
Few other claims, obviously nothing concrete as Blackwell is like 18 months out but yeah. Something Nvidia probably leaked to steal AMD's thunder.
White knighting for the loss.

Maybe it wasn't to steal AMD's thunder so much as to toot their own horn. Last I looked, most companies try to promote their own stuff.
 
Maybe it wasn't to steal AMD's thunder so much as to toot their own horn. Last I looked, most companies try to promote their own stuff.
Well, Nvidia just tends to have "Leaks" for their products when a competitor in that space has their new stuff in the news, if people are talking about something Nvidia makes that isn't Nvidia's you can rest assured that a leak concerning Nvidia is just around the corner to shift the conversation. Toot their horn, steal some thunder, shift some headlines, call it what you will, it seems to happen like clockwork.
 
Well, Nvidia just tends to have "Leaks" for their products when a competitor in that space has their new stuff in the news, if people are talking about something Nvidia makes that isn't Nvidia's you can rest assured that a leak concerning Nvidia is just around the corner to shift the conversation. Toot their horn, steal some thunder, shift some headlines, call it what you will, it seems to happen like clockwork.
Seriously, WGAF when all is said and done?
 
So I was googling info about fitting a Telemon Heavy Dreadnought and I found this instead, the way the algorithm works astounds me.

https://wccftech.com/rumor-nvidia-r...-biggest-performance-leap-in-nvidias-history/

A dedicated ray tracing denoising accelerator, which would be a pretty big uplift over the generic CUDA accelerated one now.
Few other claims, obviously nothing concrete as Blackwell is like 18 months out but yeah. Something Nvidia probably leaked to steal AMD's thunder.
AMD has 5 generations of chiplet CPUs released, and appears to have some growing pains with their chiplet GPU. I'll be very surprised if Nvidia pulls off their first chiplet product without a hitch.
 
Nvidia do some what look like some interconnecting of chiplet in their arm datacenter company segment and it is maybe close to their GPU than AMD cpu are to their Gpu:
https://www.nvidia.com/en-us/data-c...coherently interconnected with custom silicon.

https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ They can connect them a la M1ultra of apple:
https://hothardware.com/news/nvidia-144-core-grace-cpu-superchip-nvlink-c2c-interconnect

close enough to make the rumors mills of the next gen being chiplet enough at least.
 
AMD has 5 generations of chiplet CPUs released, and appears to have some growing pains with their chiplet GPU. I'll be very surprised if Nvidia pulls off their first chiplet product without a hitch.

If I was them (and I bet they are doing this)... I would have a solid performance uplift goal for a Lovelace refresh. Aim 10-20% higher then normal... just in case.
If they can get a Lovelace refresh that is a 20-30% bump that might give them what they need to basically skip a consumer generation. Do what they did with Volta. Volta added all the tensor cores but really wasn't a win for raster... so they just skipped it for consumer. Consumer at the time had zero need of expensive tensor cores.
This round they could easily release a lovelace refresh that over performes for a refresh and buy themselves time for a data center only first chiplet part. Get second gen in the pipe a little faster then usual. If AMD has decided to not target Nvidias flagship halo cards with their refresh of RDNA 3 NV might even get away with that without loosing #1 bragging rights.
 
AMD has 5 generations of chiplet CPUs released, and appears to have some growing pains with their chiplet GPU. I'll be very surprised if Nvidia pulls off their first chiplet product without a hitch.
Well, Lovelace was originally going to be their first Chiplet-based GPU but they held it back because it wasn't ready.
Nvidia already does some chiplet things in Enterprise, their Grace CPU for instance and those are rocking it.
https://www.nvidia.com/en-us/data-center/nvlink-c2c/
Nvidia's inter-chip communications platforms are top-of-the-line.
 
Well, Lovelace was originally going to be their first Chiplet-based GPU but they held it back because it wasn't ready.
Nvidia already does some chiplet things in Enterprise, their Grace CPU for instance and those are rocking it.
https://www.nvidia.com/en-us/data-center/nvlink-c2c/
Nvidia's inter-chip communications platforms are top-of-the-line.

As I understand it NVlink-c2c is for GPU->CPU interconnects... more along the lines of a data center APU. Splitting the actual GPU work out across the links is going to be a different kettle of fish.
Having said that I agree Nvidia will probably get something up and running. At this point I'm just wondering if they will target datacenter and consumer with their first GPU-GPU interconnect or just datacenter for gen1.
We also don't know if Nvidia is going do like AMD and try and split off the analog bits from the logic or go a different route all together. As I understand their "chiplets" right now they are a mix of analog and logic still.... they haven't split out caches or anything.
 
Nvidia do some what look like some interconnecting of chiplet in their arm datacenter company segment and it is maybe close to their GPU than AMD cpu are to their Gpu:
https://www.nvidia.com/en-us/data-center/nvlink-c2c/#:~:text=NVLink-C2C extends the industry,coherently interconnected with custom silicon.

https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ They can connect them a la M1ultra of apple:
https://hothardware.com/news/nvidia-144-core-grace-cpu-superchip-nvlink-c2c-interconnect

close enough to make the rumors mills of the next gen being chiplet enough at least.
The biggest problem with chiplets and GPUs is the movement of huge amounts of data and Nvidia has that down to an art form.
 
As I understand it NVlink-c2c is for GPU->CPU interconnects... more along the lines of a data center APU. Splitting the actual GPU work out across the links is going to be a different kettle of fish.
Having said that I agree Nvidia will probably get something up and running. At this point I'm just wondering if they will target datacenter and consumer with their first GPU-GPU interconnect or just datacenter for gen1.
We also don't know if Nvidia is going do like AMD and try and split off the analog bits from the logic or go a different route all together. As I understand their "chiplets" right now they are a mix of analog and logic still.... they haven't split out caches or anything.
The c2c is a platform interposer no different in concept from AMD's infinity fabric.
But where AMD's infinity fabric currently tops out at about 50 GB/sec, Nvidia's c2c comes in at just north of 900 GB/s
Granted for a consumer release I expect them to dial that back significantly, I struggle to think of a consumer application that needs that sort of speed, nor a consumer market willing to pay for it.
The big change is with Nvidia's c2c they can spread that interposer across multiple large wafer jobs so they can stretch it over up to 4 858 mm^2 wafers if need be.

AMD essentially split up their RDNA 3 and CDNA 2 platforms just as you suggested, with RDNA 3 only having one GPU compute chip and CDNA 2 supporting 2.
It would not surprise me to see Nvidia do the same but instead have a central GPU then have its various accelerator silicon spread around via their C2C fabric.

But Nvidia is going to need chiplets or something like it for TSMC's 3nm which they already announced has a 25% price increase over the N5 node and its variants they are using now.
https://www.siliconexpert.com/tsmc-3nm-wafer/
 
Last edited:
AMD has 5 generations of chiplet CPUs released, and appears to have some growing pains with their chiplet GPU. I'll be very surprised if Nvidia pulls off their first chiplet product without a hitch.
Fortunately as end customers, part of the $900-$1600 we pay on a high end GPU is for the privilege of not needing to care how the sausage is made- that's a them problem. From the end user perspective, chiplet design isn't a badge of honor or something to be excited by (yet), since at present it's mostly a cost reduction multiplier rather than performance multiplier.

A transition to chiplet architecture is obviously nice for shareholders and profits, but at least with 7900XT/XTX, manufacturing cost reduction per unit of performance isn't primarily being passed on to the end customer. Granted, R&D cost is front loaded and so cost reductions might get passed through more in subsequent generations as processes improve and dev costs get amortized out, but meantime it's hard to sympathize with the idea of "give 'em a break it's their first chiplet GPU" in the context of a monolithic megacorp. One idiot's opinion.
 
Last edited:
As I research TSMC's various packaging technologies and their partnerships with Nvidia, it occurs to me that TSMC also has those fun partnerships with Intel, which could potentially allow Nvidia to have TSMC produce the various "chiplets" but then have Intel package them together using their Foveros tile packaging tech.
 
The c2c is a platform interposer no different in concept from AMD's infinity fabric.
But where AMD's infinity fabric currently tops out at about 50 GB/sec, Nvidia's c2c comes in at just north of 900 GB/s
Granted for a consumer release I expect them to dial that back significantly, I struggle to think of a consumer application that needs that sort of speed, nor a consumer market willing to pay for it.
The big change is with Nvidia's c2c they can spread that interposer across multiple large wafer jobs so they can stretch it over up to 4 858 mm^2 wafers if need be.

AMD essentially split up their RDNA 3 and CDNA 2 platforms just as you suggested, with RDNA 3 only having one GPU compute chip and CDNA 2 supporting 2.
It would not surprise me to see Nvidia do the same but instead have a central GPU then have its various accelerator silicon spread around via their C2C fabric.

But Nvidia is going to need chiplets or something like it for TSMC's 3nm which they already announced has a 25% price increase over the N5 node and its variants they are using now.
https://www.siliconexpert.com/tsmc-3nm-wafer/

900 GB/s sounds impressive but its still not fast enough for GPU workloads. RDNA3 according to AMD has 5.3 tb/s peek bandwidth between their MSC and GCD. 50GB/second is CPU fabric speed which isn't fast enough. Anyway that puts Nvidia in the game on interconnect speed. (assuming they aren't going more ambitious then AMD, as you suggest) If they are they will probably have to up that connect speed a bit more.
I agree with you on the design layout that sounds more like something nvidia will try to me. Split the tensors off from the raster bits. C2C is only slightly faster on paper then AMDs GPU interconnect.... and AMD is using infinity cache to make it work. If Nvidia is really planning to pass logic back and forth between 2 or even 3 chips their connection is going need even more speed. (I believe anyway I could be wrong.)
The leaks for the next year should be fun.
 
Fortunately as end customers, part of the $900-$1600 we pay on a high end GPU is for the privilege of not needing to care how the sausage is made- that's a them problem. From the end user perspective, chiplet design isn't a badge of honor or something to be excited by (yet), since at present it's mostly a cost reduction multiplier rather than performance multiplier.

A transition to chiplet architecture is obviously nice for shareholders and profits, but as we're seeing with 7900XT/XTX, manufacturing cost reduction per unit of performance won't necessarily be passed on to the end customer. Granted, R&D cost is front loaded and so maybe cost reductions will gets passed through more in the next 1-3 generations as processes improve and dev costs get amortized out, but meantime it's hard to sympathize with the idea of "hey man give 'em a break it's their first chiplet GPU" in the context of a monolithic megacorp.

I think that is where the consumer benefit will come in. Refreshes. Refreshes should be much easier if all they need to do is respin the logic chip. The cache chips shouldn't need changed for refreshes, perhaps even follow up generations could get away with using last gens Cache pat. The A0 silicon fears people have been raising on 7900. May well prove one big side benefit of AMD spinning logic only parts. Without all the analog bits to model their chip simulation is probably a lot more accurate. No cross talk and hard to model interference from memory controllers and cache, might make design->production faster and cheaper.

We'll see however you are correct. They are potential advantages. So far no one is seeing those advantages other then perhaps AMDs back end bottom line.
 
As I research TSMC's various packaging technologies and their partnerships with Nvidia, it occurs to me that TSMC also has those fun partnerships with Intel, which could potentially allow Nvidia to have TSMC produce the various "chiplets" but then have Intel package them together using their Foveros tile packaging tech.
Possible yes.. probably not likely though. I can't imagine Jensen would put his dick in Intels hands. That was crude... but ya I don't see Nvidia touching Intel anything. I know business is business but I suspect the bad blood runs deep.
 
That's 900 per pin, the tech supports up to 256 pins it tops out at 115.2 TB/s all-to-all.

Well I stand corrected. I guess the question is then how much of that will they need to use on the consumer side. Obviously at those speeds and max pins we are taking datacenter power draw. :)
 
Well I stand corrected. I guess the question is then how much of that will they need to use on the consumer side. Obviously at those speeds and max pins we are taking datacenter power draw. :)
You replied before I could correct and clarify a few things but yeah I'll repost in a few minutes with links and such but Nvidia does some stupid things with interconnect speeds right now. We as consumers don't need anything near that fast.
But like most things I expect it to have the ability to scale down.
 
Fortunately as end customers, part of the $900-$1600 we pay on a high end GPU is for the privilege of not needing to care how the sausage is made- that's a them problem. From the end user perspective, chiplet design isn't a badge of honor or something to be excited by (yet), since at present it's mostly a cost reduction multiplier rather than performance multiplier.

A transition to chiplet architecture is obviously nice for shareholders and profits, but as we're seeing with 7900XT/XTX, manufacturing cost reduction per unit of performance won't necessarily be passed on to the end customer. Granted, R&D cost is front loaded and so maybe cost reductions will gets passed through more in the next 1-3 generations as processes improve and dev costs get amortized out, but meantime it's hard to sympathize with the idea of "hey man give 'em a break it's their first chiplet GPU" in the context of a monolithic megacorp.
You have to hand it to AMD's marketing department though. IT somehow made the end consumers thinking that chiplets meant more performance even during the time they released Zen 1 and Zen 2; and those 2 were them just catching up to Intel. That's quite a feat for the marketing department. I dare say that if every other department worked as well as their marketing department, they would smoke everyone else.
 
Well I stand corrected. I guess the question is then how much of that will they need to use on the consumer side. Obviously at those speeds and max pins we are taking datacenter power draw. :)
Looking at the C2C tech it is an extension of their NVLink tech so each lane supports 900 GB/s and the fabric supports up to 256 lanes, but I can't tell if that would mean each "chiplet" would need to essentially be built into a microscopic NVLink system like you would for connecting together individual GPU cores. The more I look into it the more I think it is the wrong tech for the job, and I am misunderstanding its practical limitations.

Furthermore, looking over Nvidia's published papers they focus on MCM not chiplet, and they have some interesting patents on how to cool the individual layers and it matches up with a few TSMC releases regarding their 3D fabric plans.

So the more I read the less convinced I am that Nvidia is going to pull an AMD-style chiplet out of their butts and instead make use of TSMC's CoWoS packaging technology, paired up with some of their existing tech to tie it all together.
The Samsung GDDR6w would allow them to integrate the memory right on the die like they do HBM on the server chips but at a fraction of the cost.

The more I read the more excited I get for a few possibilities that AMD, Nvidia, and Apple could work on. TSMC and Intel are doing some crazy crap right now with Tiles and Stacking and all the buzzwords.
 
Looking at the C2C tech it is an extension of their NVLink tech so each lane supports 900 GB/s and the fabric supports up to 256 lanes, but I can't tell if that would mean each "chiplet" would need to essentially be built into a microscopic NVLink system like you would for connecting together individual GPU cores. The more I look into it the more I think it is the wrong tech for the job, and I am misunderstanding its practical limitations.

Furthermore, looking over Nvidia's published papers they focus on MCM not chiplet, and they have some interesting patents on how to cool the individual layers and it matches up with a few TSMC releases regarding their 3D fabric plans.

So the more I read the less convinced I am that Nvidia is going to pull an AMD-style chiplet out of their butts and instead make use of TSMC's CoWoS packaging technology, paired up with some of their existing tech to tie it all together.
The Samsung GDDR6w would allow them to integrate the memory right on the die like they do HBM on the server chips but at a fraction of the cost.

The more I read the more excited I get for a few possibilities that AMD, Nvidia, and Apple could work on. TSMC and Intel are doing some crazy crap right now with Tiles and Stacking and all the buzzwords.
Maybe this is why NV really put in work on the 4080 & 4090 coolers, in preparation of having to cool a 2.5D wafer-level multi-chip package. Food for thought I guess.
 
Maybe this is why NV really put in work on the 4080 & 4090 coolers, in preparation of having to cool a 2.5D wafer-level multi-chip package. Food for thought I guess.
Here’s a fun one where they patented a way to have 2 layers face to face instead of the normal top to bottom. Creates one hot spot rather than 2, it also looks to solve the resistance issue that kept the clocks speeds they way they were on the AMD stacked cache.

https://www.tomshardware.com/news/nvidia-patents-face-to-face-3d-die-stacking
 
Looking at the C2C tech it is an extension of their NVLink tech so each lane supports 900 GB/s and the fabric supports up to 256 lanes, but I can't tell if that would mean each "chiplet" would need to essentially be built into a microscopic NVLink system like you would for connecting together individual GPU cores. The more I look into it the more I think it is the wrong tech for the job, and I am misunderstanding its practical limitations.

Furthermore, looking over Nvidia's published papers they focus on MCM not chiplet, and they have some interesting patents on how to cool the individual layers and it matches up with a few TSMC releases regarding their 3D fabric plans.

So the more I read the less convinced I am that Nvidia is going to pull an AMD-style chiplet out of their butts and instead make use of TSMC's CoWoS packaging technology, paired up with some of their existing tech to tie it all together.
The Samsung GDDR6w would allow them to integrate the memory right on the die like they do HBM on the server chips but at a fraction of the cost.

The more I read the more excited I get for a few possibilities that AMD, Nvidia, and Apple could work on. TSMC and Intel are doing some crazy crap right now with Tiles and Stacking and all the buzzwords.

Next few years should be interesting anyway. All the packaging tech... someone is going to hit a home run and someone is going to strike out hard. Could be about any of them. I suspect AMD won't strike out... but they might not swing for the fence for awhile either.
I love a dark horse so I might just put my money on Intel. lol (half joking, arc was crap and BM is probably to close behind... but you know on paper the way they do RT and their compute design is solid)
 
Next few years should be interesting anyway. All the packaging tech... someone is going to hit a home run and someone is going to strike out hard. Could be about any of them. I suspect AMD won't strike out... but they might not swing for the fence for awhile either.
I love a dark horse so I might just put my money on Intel. lol
Their Foveros Tile stuff is weird and neat, they can work with chips from just about anywhere and cram them all together, their existing stuff is using bits from Samsung, TSMC, and in-house Intel.
 
Here’s a fun one where they patented a way to have 2 layers face to face instead of the normal top to bottom. Creates one hot spot rather than 2, it also looks to solve the resistance issue that kept the clocks speeds they way they were on the AMD stacked cache.

https://www.tomshardware.com/news/nvidia-patents-face-to-face-3d-die-stacking
Interesting... I guess that could be the homer or the strike out.
I would think there would be a ton of issues with cross talk and interference. BUT then if AMD having good silicon simulation modeling was due to the removal of the analog noisy bits. Maybe just maybe the face to face could work well on a 100% logic chip. Face to Face logic parts... and use MCD chips via a fabric.
 
You have to hand it to AMD's marketing department though. IT somehow made the end consumers thinking that chiplets meant more performance even during the time they released Zen 1 and Zen 2; and those 2 were them just catching up to Intel. That's quite a feat for the marketing department. I dare say that if every other department worked as well as their marketing department, they would smoke everyone else.
Well, marketing is just perception creation and management, and part of the job like any mayor talks of their town, but in fairness AMD did deliver a massive win-win with Zen3 when 12 and 16 physical cores were suddenly available to filthy desktop class peasants at absurdly low power requirements - and people were fist-fighting, fisting each other, and each other's mothers just to get their hands on one.

So there's precedent, and I have no doubt their GPU roadmap leads to the same convergence of manufacturing efficiency and performance multiplication by Radeon 9900 XTX, followed immediately by a 10900 XTX "WTF?" generation that's both more expensive and slightly lower performing than the competition, just to confuse everyone once again.
 
Last edited:
All the packaging tech... someone is going to hit a home run and someone is going to strike out hard. Could be about any of them. I suspect AMD won't strike out... but they might not swing for the fence for awhile either.
And someone is going to send a perfect fly right into the outfield for an easy triple, drop the bat and run, but he's running the wrong way. He's running to third base instead of first, the teams are confused, the crowd is weeping.

And that batter's name, is Raja Koduri

1671521759678.png
 
Last edited:
So I was googling info about fitting a Telemon Heavy Dreadnought and I found this instead, the way the algorithm works astounds me.

https://wccftech.com/rumor-nvidia-r...-biggest-performance-leap-in-nvidias-history/

A dedicated ray tracing denoising accelerator, which would be a pretty big uplift over the generic CUDA accelerated one now.
Few other claims, obviously nothing concrete as Blackwell is like 18 months out but yeah. Something Nvidia probably leaked to steal AMD's thunder.
Could link directly to the source instead of WCCFTech.


Or link to The FPS Review, where our former [H] brethren are working, which provided a summary.
https://www.thefpsreview.com/2022/1...rs-surface-teasing-massive-performance-gains/
Going to chiplet being the big "rumor" here I guess
Last rumor was actually that Blackwell is still monolithic. Given that, I would think this new information of a "MCM monstrosity" is pointing to next gen enterprise products like the current gen H100, not "flagship" gaming models. The H100 is 814mm², by the way, while AD102 is 608mm².

https://www.hardwaretimes.com/nvidi...s-blackwell-to-leverage-monolithic-die-rumor/
Nvidia do some what look like some interconnecting of chiplet in their arm datacenter company segment and it is maybe close to their GPU than AMD cpu are to their Gpu:
https://www.nvidia.com/en-us/data-center/nvlink-c2c/#:~:text=NVLink-C2C extends the industry,coherently interconnected with custom silicon.

https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ They can connect them a la M1ultra of apple:
https://hothardware.com/news/nvidia-144-core-grace-cpu-superchip-nvlink-c2c-interconnect

close enough to make the rumors mills of the next gen being chiplet enough at least.
With my reply above I am guessing that NVIDIA is seeing chiplets are not good for gaming performance. Have to wonder with RDNA 3 showing bad frame time spiking in a lot of games as why NVIDIA wouldn't be moving in that direction for their gaming parts.
 
Back
Top