NVIDIA Develops Tile-based Multi-GPU Rendering Technique Called CFR

erek

[H]F Junkie
Joined
Dec 19, 2005
Messages
10,894
In CFR, the frame is divided into tiny square tiles, like a checkerboard. Odd-numbered tiles are rendered by one GPU, and even-numbered ones by the other. Unlike AFR (alternate frame rendering), in which each GPU's dedicated memory has a copy of all of the resources needed to render the frame, methods like CFR and SFR (split frame rendering) optimize resource allocation. CFR also purportedly offers lesser micro-stutter than AFR. 3DCenter also detailed the features and requirements of CFR. To begin with, the method is only compatible with DirectX (including DirectX 12, 11, and 10), and not OpenGL or Vulkan. For now it's "Turing" exclusive, since NVLink is required (probably its bandwidth is needed to virtualize the tile buffer). Tools like NVIDIA Profile Inspector allow you to force CFR on provided the other hardware and API requirements are met. It still has many compatibility problems, and remains practically undocumented by NVIDIA.

https://www.techpowerup.com/261357/...ased-multi-gpu-rendering-technique-called-cfr
 
In CFR, the frame is divided into tiny square tiles, like a checkerboard. Odd-numbered tiles are rendered by one GPU, and even-numbered ones by the other. Unlike AFR (alternate frame rendering), in which each GPU's dedicated memory has a copy of all of the resources needed to render the frame, methods like CFR and SFR (split frame rendering) optimize resource allocation. CFR also purportedly offers lesser micro-stutter than AFR. 3DCenter also detailed the features and requirements of CFR. To begin with, the method is only compatible with DirectX (including DirectX 12, 11, and 10), and not OpenGL or Vulkan. For now it's "Turing" exclusive, since NVLink is required (probably its bandwidth is needed to virtualize the tile buffer). Tools like NVIDIA Profile Inspector allow you to force CFR on provided the other hardware and API requirements are met. It still has many compatibility problems, and remains practically undocumented by NVIDIA.

https://www.techpowerup.com/261357/...ased-multi-gpu-rendering-technique-called-cfr
This isn't news. This is a desperate scramble for Nvidia to squeeze more performance out of their current lineup because they don't have shit launching on 7nm. The don't plan on launching 7nm because we all saw the mentions of the 2080Ti Super.... Nvidia doesn't take the AMD competition seriously... And why should they? AMD has shown they have nothing to match them with.

This might allow their cards to get a couple % more out of the current architecture to make certain the "2080Ti Killer" doesn't kill anything at all.
 
Playing king of the hill requires a lot of investment for very little reward. Amd doesn't have a lot of cash to throw at a title that they'll just have to fight over constantly with the limited resources they have available they're spending to fab with the cash they do have.

seems like they're playing it monetarily safe ...going after markets that have high return for relatively little risk. Starting an arms race with nvidia at this time (when they started an arms race with intel) would probably be too much of a strain on them ....and they're doing just fine without it. Let nvidia invest in making 1200 dollar graphics cards for enthusiasts. Let them keep the top title. it'll make them lazy and unprepared for when amd is ready ...just like it has done with intel. Being number 2 has it's advantages, it means you dont lose significant reputation by not being the best, but you always gain reputation when you punch above your weight ...even if it's temporarily and there's plenty of money to exist successfully being thought of as #2.
 
Playing king of the hill requires a lot of investment for very little reward. Amd doesn't have a lot of cash to throw at a title that they'll just have to fight over constantly with the limited resources they have available they're spending to fab with the cash they do have.

seems like they're playing it monetarily safe ...going after markets that have high return for relatively little risk. ...even if it's temporarily and there's plenty of money to exist successfully being thought of as #2.

True. They just let nVidia innovate and then add those ideas to their stuff in the next gen or 2. (SLI, Physx, G-Sync, Raytracing come to mind).
 
well to be fair, they add a better version of those things that are cross platform, open source and geared towards something that can be an industry standard instead of a proprietary interface that has to be licensed thru a single company for a high fee or only available on their hardware.

That's why the amd version finds it's way into everything (like vulkan, tv's vfr, etc) and nvidia's version just ends up being an expensive beta test paid for by nvidia users in windows.

So i'd say amd does their fair amount of innovating those innovations as well.
 
  • Like
Reactions: Zuul
like this
Largely irrelevant since SLI is on the decline in a big way.

Also not likely to save much in resources, since an odd an even tile next to each other will likely still need access to the same geometry and textures, so they will still have to duplicate memory pools.
 
CFR would solve a lot of the issues that result from gaming with a MCM based GPU on titles that weren’t designed for them and give them longer to tune SLI profiles for existing games. This launch paired with some of the info leaking about Hopper makes me sort of excited to see where nVidia is going with this.
 
  • Like
Reactions: erek
like this
Largely irrelevant since SLI is on the decline in a big way.

Also not likely to save much in resources, since an odd an even tile next to each other will likely still need access to the same geometry and textures, so they will still have to duplicate memory pools.
SLI is in decline but if the Hopper “leaks” are accurate then this would fix a lot of the compatibility issues should their next design be MCM based. Also doesn’t NVLink operate fast enough that the cards pool their memory and not duplicate it like the old SLI stuff did?
 
SLI is in decline but if the Hopper “leaks” are accurate then this would fix a lot of the compatibility issues should their next design be MCM based. Also doesn’t NVLink operate fast enough that the cards pool their memory and not duplicate it like the old SLI stuff did?

No NVlink is NOT fast enough to pool memory (there was an interview with NVidia guy who explained, performance would tank if you combined memory for games).

How does this fix compatibility issues? Right now the two main modes are AFR and SFR, SFR being the least supported and therefore likely more problematic. Checkerboard is more like SFR, so I suspect would likely be more problematic than AFR.
 
SLI is in decline but if the Hopper “leaks” are accurate then this would fix a lot of the compatibility issues should their next design be MCM based. Also doesn’t NVLink operate fast enough that the cards pool their memory and not duplicate it like the old SLI stuff did?

Nvlink is faster than the old SLi bridges but it doesn’t change the fact that in any split screen mode both GPUs have to process the same geometry. That’s a big waste of processing power.

That won’t be a problem on a proper MCM GPU.
 
No NVlink is NOT fast enough to pool memory (there was an interview with NVidia guy who explained, performance would tank if you combined memory for games).

How does this fix compatibility issues? Right now the two main modes are AFR and SFR, SFR being the least supported and therefore likely more problematic. Checkerboard is more like SFR, so I suspect would likely be more problematic than AFR.
I thought NVLink was faster than it is...
SFR is just too situational to be good, AFR often introduces long frame times and other such nonsense. From what I understand from gaming with many titles on MCM based cards is they often only use one of the cores not all let’s say 4. So they need a method that works with a theoretical 4 or maybe 8 way GPU setup. AFR in this case is bad because you would be rendering too many frames out at say 1/3 the speed each of a monolithic core. SFR could work but how do you decide which core is doing which part of the screen if you just broke it into 4-8 sections accordingly then you have a very uneventfully distributed work load. With a checkerboard approach it could break a scene into as many components as needed and just randomly distribute them amongst the available cores.
 
How does this fix compatibility issues? Right now the two main modes are AFR and SFR, SFR being the least supported and therefore likely more problematic. Checkerboard is more like SFR, so I suspect would likely be more problematic than AFR.

If they've already split rendering up into tiles (which it seems I missed, but NVIDIA has been doing since Maxwell), it seems like distributing tile renderijg among available GPUs shouldn't be too bad, as long as you have the bandwidth to send all the scene data to all the GPUs. Checkerboard probably gets pretty good balancing, but you could check in and redistribute towards the end of the frame with minimal overhead.
 
Thats some awesome feature, but it forces both gpus to share same memory pool.

though if they made some memory bridge like zen3 has - and connect 2 gpu's on same pcb - increasing latency but cutting the cost. | that would mean return of multiple gpu on same pcb - nv is likely going into mcm approach with this tho.
 
Thats some awesome feature, but it forces both gpus to share same memory pool.

though if they made some memory bridge like zen3 has - and connect 2 gpu's on same pcb - increasing latency but cutting the cost. | that would mean return of multiple gpu on same pcb - nv is likely going into mcm approach with this tho.
MCM is where they are going to have to go, the GPU cores to date are already too large and making them larger/denser is already hitting a point of serious diminishing returns. CPU's hit that wall a long while ago and it was about time for GPU's to do the same.
 
If they've already split rendering up into tiles (which it seems I missed, but NVIDIA has been doing since Maxwell), it seems like distributing tile renderijg among available GPUs shouldn't be too bad, as long as you have the bandwidth to send all the scene data to all the GPUs. Checkerboard probably gets pretty good balancing, but you could check in and redistribute towards the end of the frame with minimal overhead.

Sure it's more statistically likely to have an even load, but if anything, it will require more work recombining frames and is unlikely to help compatibility in any way. You are still going to need SLI drivers and specific game support. AKA the normal SLI Shit show that has everyone abandoning SLI/CF.
 
MCM is where they are going to have to go, the GPU cores to date are already too large and making them larger/denser is already hitting a point of serious diminishing returns. CPU's hit that wall a long while ago and it was about time for GPU's to do the same.

Sure eventually. But MUCH easier said than done, this will be a pretty momentous milestone when someone gets seamless/transparent MCM that doesn't require the SLI/CF driver mess, and has shared memory pools.

Whoever overcomes those obstacles first will have a nice competitive advantage, until everyone else starts doing it.

Opinion on who will do it first seem evenly split between AMD and NVidia, and who knows, Intel could be the dark horse that pulls it off.
 
Sure eventually. But MUCH easier said than done, this will be a pretty momentous milestone when someone gets seamless/transparent MCM that doesn't require the SLI/CF driver mess, and has shared memory pools.

Whoever overcomes those obstacles first will have a nice competitive advantage, until everyone else starts doing it.

Opinion on who will do it first seem evenly split between AMD and NVidia, and who knows, Intel could be the dark horse that pulls it off.
Well NVidia is already doing some MCM stuff in their Tesla lineup and it works really well. Getting it down to a consumer level is the goal I am sure, they could have this out by Nov 2020 if the various Hopper “leaks” are true. Though I use the work leaks I am pretty sure they are more a strategic whispers to steal some of AMD’s spotlight.
 
Well NVidia is already doing some MCM stuff in their Tesla lineup and it works really well. Getting it down to a consumer level is the goal I am sure, they could have this out by Nov 2020 if the various Hopper “leaks” are true. Though I use the work leaks I am pretty sure they are more a strategic whispers to steal some of AMD’s spotlight.

You keep saying "leaks".

It's ONE twitter post, nothing more.
 
Some form of multi gpu rendering seems to me the only way to drive performance towards where it needs to be for high refresh rate 4K gaming and presumably even 8K at some point.
 
In terms of MCM i think its likely going to be the next thing after 3000 series comes out | amd, and intel are likely to be first ones to use it (amd likely)- mainly due to their history, and experience with mcm's (on zen cpu's).

The graphics is a problem tho with memory latency, it needs to be very low.
(if you take any current hbm implementation you'll be trading off latency for bandwidth - sure latency will be low on both systems if load is low, but hbm datacenter approaches like one listed with tesla is for datacenters - and wider bandwidth is required - we don't mind waiting couple more ns or ms for its data.)
In this example, we'll likely see same problem with mcm, that has memory controller and multiple gpu chiplets connecting to it | so next step would be making some really low latency memory controller -> it needs to be about 20x lower than what we have on zen3 to make sense.
Another approach would be that each chiplet has its own independent access lanes to memory. (and memory has asynchronous refreshes.) Then the chiplets itself would need to understand what is his to execute - maybe partitioning or pools. Though the more gpu chiplets would make latency 2x bigger - unless they have to read whole bank everytime, and discard the data that is not theirs. (memory would run hot this way.) and would cause some tearing as certain parts of screen - so monitors free/g/adaptive-sync would need to adjust refresh per pixel clock sync with info from gpu.
As early approach with just 2-4 chiplets would be reasonable - but pcb would be expensive so cheaper to print chiplets but much more expensive on pcb, and memory (as memory would have to be changed - again).
 
In terms of MCM i think its likely going to be the next thing after 3000 series comes out | amd, and intel are likely to be first ones to use it (amd likely)- mainly due to their history, and experience with mcm's (on zen cpu's).

The graphics is a problem tho with memory latency, it needs to be very low.
(if you take any current hbm implementation you'll be trading off latency for bandwidth - sure latency will be low on both systems if load is low, but hbm datacenter approaches like one listed with tesla is for datacenters - and wider bandwidth is required - we don't mind waiting couple more ns or ms for its data.)
In this example, we'll likely see same problem with mcm, that has memory controller and multiple gpu chiplets connecting to it | so next step would be making some really low latency memory controller -> it needs to be about 20x lower than what we have on zen3 to make sense.
Another approach would be that each chiplet has its own independent access lanes to memory. (and memory has asynchronous refreshes.) Then the chiplets itself would need to understand what is his to execute - maybe partitioning or pools. Though the more gpu chiplets would make latency 2x bigger - unless they have to read whole bank everytime, and discard the data that is not theirs. (memory would run hot this way.) and would cause some tearing as certain parts of screen - so monitors free/g/adaptive-sync would need to adjust refresh per pixel clock sync with info from gpu.
As early approach with just 2-4 chiplets would be reasonable - but pcb would be expensive so cheaper to print chiplets but much more expensive on pcb, and memory (as memory would have to be changed - again).
Photos of 36 chiplet prototype floating around, looks a smidge photoshopped or rendered or both can’t tell.
 
Hopefully something or someone brings back multi gpu technology. Seems like such wasted potential. Would be great for vendors especially considering they could sell more gpus
 
Largely irrelevant since SLI is on the decline in a big way.

Also not likely to save much in resources, since an odd an even tile next to each other will likely still need access to the same geometry and textures, so they will still have to duplicate memory pools.


Basically. It still has the same overhead as SFR, where geometry must be computed before you render the scene. And yes, pretending textures don't overlap in tiles is pure crazy talk.

AFR is the only rendering technique that has the capability to scale 100%, even though it adds the most latency and stutter. But new engines have made it harder to do so.
 
This sounds wonderful, but if it is relying on NVlink on non-server hardware (ie, anything GeForce or Quadro), then it's a shame that it's limited to only two GPUs. Love the throughput for NVlink, dislike the 2-GPU limit.
 
Maybe so but Nvidia’s 36 chip MCM is an actual working prototype called Simba that has been benchmarked on AI workloads.

https://research.nvidia.com/publication/2019-10_Simba:-Scaling-Deep-Learning

Still zero to do with Gaming GPUs. AI chips, GPGPU, and CPU can all work across multiple individual chips already without the need for weird drivers like SLI/CF. So it's a relative dawdle to build an MCM version of any of them.

Gaming GPUs are unique case. They have never worked without duplicate memory pools and SLI/CF drivers. This is kind of the holy grail of GPUs that is never achieved even when they put the two chips on the same card.
 
Largely irrelevant since SLI is on the decline in a big way.

It's in decline due to lack of support. Can't fault Nvidia for looking into solutions which might encourage bringing back support.

Also not likely to save much in resources, since an odd an even tile next to each other will likely still need access to the same geometry and textures, so they will still have to duplicate memory pools.

I'm not sure that total GPU memory needs to be conserved as a precious resource in this use case. Looking at GPU-Z, most games seem to top out under 7GB even with RTX on. That leaves 4GB per xx80Ti card to spare.

Also doesn’t NVLink operate fast enough that the cards pool their memory and not duplicate it like the old SLI stuff did?

For many use cases, yes, this is correct. NVLink offers excellent throughput for use cases (ie rendering, ML training, FEA) where there is enough duplication in memory that it's easier to pool (ie, for the textures) or where multiple GPUs need to access and edit the same memory pool to maintain sync - but not with an eye towards minimizing latency. When it comes to games, and evne then pretty much only the FPS genre, and even then pretty much only for competition players, latency becomes a very big deal. In this use case, the latency of NVLink makes it no substitute for just having more VRAM on each GPU. For the remaining 99.99% of users running multiple GPUs, NVLink is wonderful.

I also anticipate a meaningful update to the bus between GPUs with Ampere. Maybe 200GB/s? Keep in mind that NVLink at only 100GB/s pretty well smokes PCIe 3.0 x16 at a paltry 32GB/s.
 
Still zero to do with Gaming GPUs. AI chips, GPGPU, and CPU can all work across multiple individual chips already without the need for weird drivers like SLI/CF. So it's a relative dawdle to build an MCM version of any of them.

Gaming GPUs are unique case. They have never worked without duplicate memory pools and SLI/CF drivers. This is kind of the holy grail of GPUs that is never achieved even when they put the two chips on the same card.
It's in decline due to lack of support. Can't fault Nvidia for looking into solutions which might encourage bringing back support.



I'm not sure that total GPU memory needs to be conserved as a precious resource in this use case. Looking at GPU-Z, most games seem to top out under 7GB even with RTX on. That leaves 4GB per xx80Ti card to spare.



For many use cases, yes, this is correct. NVLink offers excellent throughput for use cases (ie rendering, ML training, FEA) where there is enough duplication in memory that it's easier to pool (ie, for the textures) or where multiple GPUs need to access and edit the same memory pool to maintain sync - but not with an eye towards minimizing latency. When it comes to games, and evne then pretty much only the FPS genre, and even then pretty much only for competition players, latency becomes a very big deal. In this use case, the latency of NVLink makes it no substitute for just having more VRAM on each GPU. For the remaining 99.99% of users running multiple GPUs, NVLink is wonderful.

I also anticipate a meaningful update to the bus between GPUs with Ampere. Maybe 200GB/s? Keep in mind that NVLink at only 100GB/s pretty well smokes PCIe 3.0 x16 at a paltry 32GB/s.

Initial latency of the hbm and nvlink is what hurts the performance in games, and why they aren't really good (in case of nvlink at all)
(compare radeon vii and vega64 - its almost same gpu but with 4 mem hubs vs 2 with v64 | v64 is mainly hurt by initial latency of the request - while with r7 the latency is almost cut in half.).
32GB/s would be enough if latency was small but the pci-e gen3 latency adds around 900ns with round trip of 128B payload, i would assume nvlink is kinda better in some cases... but its fairly similar going from 5-40ns to over 1000ns easily.

upload_2019-11-21_19-39-37.png


but again, mcm resolves problem of comm throughput and latency - communication between the chips.

i would be more convinced to see them look like this

upload_2019-11-21_19-43-11.jpeg


or like this
upload_2019-11-21_19-43-31.jpeg



btw puget tested nvlink inhouse with 2080's and got 11-20 microseconds... much worse than pci-e 3x
 

Attachments

  • upload_2019-11-21_19-35-52.png
    upload_2019-11-21_19-35-52.png
    19.3 KB · Views: 0
  • upload_2019-11-21_19-38-32.png
    upload_2019-11-21_19-38-32.png
    15.7 KB · Views: 0
  • Like
Reactions: dgz
like this
So Nvidia is doing something PowerVR has been doing for a good 20 years now and calling it new. I mean splitting the tiles up between gpus is cool and all, but I'm pretty sure PowerVR was capable of this too.
 
So Nvidia is doing something PowerVR has been doing for a good 20 years now and calling it new. I mean splitting the tiles up between gpus is cool and all, but I'm pretty sure PowerVR was capable of this too.
And Microsoft was making smartphones before Apple. But you see, timing and execution, baby!

You see?
 
Last edited:
What if they were to to turn the NVlink bridge into the memory board?
2 GPUs without memory (or perhaps a cache/buffer) use the same memory on an even wider NVlink connector.
Multiple issues are bound to crop up, latency being a major.
The tech exists to use optical on silicon, this could help immensely keeping latency low linking to/from the NVlink memory board, but at a cost being early days for the tech.

It would make memory upgrades very simple as well.

Food for thought.
 
True. They just let nVidia innovate and then add those ideas to their stuff in the next gen or 2. (SLI, Physx, G-Sync, Raytracing come to mind).

SLI: Acquired from 3DFX
PhysX: Acquired from Agea, then hacked to be hardware locked.
G-Sync: VESA Adaptive Sync hacked to be hardware locked
Raytracing: You can thank the DirectX team for the current lineup. Or more likely CausticGraphics.


What about this innovation now?
 
This isn't news. This is a desperate scramble for Nvidia to squeeze more performance out of their current lineup because they don't have shit launching on 7nm. The don't plan on launching 7nm because we all saw the mentions of the 2080Ti Super.... Nvidia doesn't take the AMD competition seriously... And why should they? AMD has shown they have nothing to match them with.

This might allow their cards to get a couple % more out of the current architecture to make certain the "2080Ti Killer" doesn't kill anything at all.

Not to be crude but you're sounding like every Intel shill out there that said no way Zen 2 can beat or even keep up with Intel. Umm yeah LOL at that now aren't we.

Nvidia is Not taking AMD serious just like Intel didnt and Intel is getting surpassed quite quickly as AMD is gaining huge market share momentum now. AMD is going to release Big Navi soon and Jensen doesnt seem to give two shits. I'm gonna laugh when the next Navi obliterates NVidia. Zen 3 is already being talked about by AMD and it looks like its going to utterly nuclear overkill Intel.

So you might want to rethink your claim.
 
Not to be crude but you're sounding like every Intel shill out there that said no way Zen 2 can beat or even keep up with Intel. Umm yeah LOL at that now aren't we.

Nvidia is Not taking AMD serious just like Intel didnt and Intel is getting surpassed quite quickly as AMD is gaining huge market share momentum now. AMD is going to release Big Navi soon and Jensen doesnt seem to give two shits. I'm gonna laugh when the next Navi obliterates NVidia. Zen 3 is already being talked about by AMD and it looks like its going to utterly nuclear overkill Intel.

So you might want to rethink your claim.
I'm not really certain how this post makes me an Intel shill... (or you are drawing a comparison using that) I never said that about Zen 2. I own a Zen 2 processor (3600), a 3400G, a GX200, a 2200G... As as well as Intel processors. My 3600 performed as good if not better than my 9600 in all things at a clock speed 700-900 MHz lower.

Big Navi exists, it is SPECULATED to be highly scale-able and an Nvidia 2080Ti "Killer". How many times have we heard AMD talk that kind of shit and then deliver an underwhelming product? That was pretty much their marketing approach until the advent of the 5700XT. Nothing like releasing the Radeon 7 as a stop gap measure to slow their bleeding in the market and then dropping support for the card less than six months later. If that doesn't build confidence, I don't know what does...

Nvidia is taking everything seriously. Because they're fucked and they know it. The future of graphics in the marketplace is in integrated, on die, GPUs and AMD and Intel (shortly) are already market leaders in this segment. The graphics cards are getting so good that their add on card marketing will only last so long as RTX, their inefficient ray tracing hardware propaganda, survives in the market place. They have failed to make a compelling argument as to why their hardware is essential. So, I see this as a means to eek out a little more performance while they wait for their 7nm process to be ready. If they can stave off AMD's claims, they just might survive another couple generations in the marketplace. If not, I am fairly certain that they are on the cusp of sliding into obscurity.

AMD has a bunch of ducks already in a row. While they really don't have Intel's market share, they have "hardware maturity" (cough) on the GPU front compared to Intel and they own the Console market, their GPUs are licensed to Samsung for their future smart phones and they are gaining market share in the CPU / Server / Datacenter spaces. Go AMD!

Who am I shilling for now?

;)
 
Last edited:
SLI: Acquired from 3DFX
PhysX: Acquired from Agea, then hacked to be hardware locked.
G-Sync: VESA Adaptive Sync hacked to be hardware locked
Raytracing: You can thank the DirectX team for the current lineup. Or more likely CausticGraphics.


What about this innovation now?

nVidia put all of those to use in GPU's first, then AMD copied.

SLI is just a re-used well-known marketing term coined by 3Dfx, but in modern GPU's nVidia did that first. It works completely differently than what the old original 3Dfx did which was scan-line-interleave. Each alternating refresh line on a display came from alternating voodoos... Modern GPU's work completely differently, either rending the entire frame alternating (AFR) or splitting the screen into 2 halves and rending those among the GPU's (SFR). nVidia did this first, AMD copied (Crossfire).

Physx - yes acquired, so they could build the capability into the GPU for physics instead of relying on the CPU or an external Physx processor. nVidia did this first, AMD copied.

G-Sync. nVidia did this first, AMD and VESA copied.

Raytracing. So it's been around, but nVidia did it first in a consumer GPU. AMD is copying...


What has AMD done first?? Well, they did use HBM on a video card. Didn't amount to anything (innovative) except overpriced for the performance... Yeah, that's not an innovation.
 
Back
Top