NVIDIA Develops Tile-based Multi-GPU Rendering Technique Called CFR

I'm not really certain how this post makes me an Intel shill... (or you are drawing a comparison using that) I never said that about Zen 2. I own a Zen 2 processor (3600), a 3400G, a GX200, a 2200G... As as well as Intel processors. My 3600 performed as good if not better than my 9600 in all things at a clock speed 700-900 MHz lower.

Big Navi exists, it is SPECULATED to be highly scale-able and an Nvidia 2080Ti "Killer". How many times have we heard AMD talk that kind of shit and then deliver an underwhelming product? That was pretty much their marketing approach until the advent of the 5700XT. Nothing like releasing the Radeon 7 as a stop gap measure to slow their bleeding in the market and then dropping support for the card less than six months later. If that doesn't build confidence, I don't know what does...

Nvidia is taking everything seriously. Because they're fucked and they know it. The future of graphics in the marketplace is in integrated, on die, GPUs and AMD and Intel (shortly) are already market leaders in this segment. The graphics cards are getting so good that their add on card marketing will only last so long as RTX, their inefficient ray tracing hardware propaganda, survives in the market place. They have failed to make a compelling argument as to why their hardware is essential. So, I see this as a means to eek out a little more performance while they wait for their 7nm process to be ready. If they can stave off AMD's claims, they just might survive another couple generations in the marketplace. If not, I am fairly certain that they are on the cusp of sliding into obscurity.

AMD has a bunch of ducks already in a row. While they really don't have Intel's market share, they have "hardware maturity" (cough) on the GPU front compared to Intel and they own the Console market, their GPUs are licensed to Samsung for their future smart phones and they are gaining market share in the CPU / Server / Datacenter spaces. Go AMD!

Who am I shilling for now?

;)

Compare not calling but you explained so never mind
 
AMD is going to release Big Navi soon
source.gif


yea, i don't think we'll see "big" navi, where's the big Polaris?
 
nVidia put all of those to use in GPU's first, then AMD copied.

SLI is just a re-used well-known marketing term coined by 3Dfx, but in modern GPU's nVidia did that first. It works completely differently than what the old original 3Dfx did which was scan-line-interleave. Each alternating refresh line on a display came from alternating voodoos... Modern GPU's work completely differently, either rending the entire frame alternating (AFR) or splitting the screen into 2 halves and rending those among the GPU's (SFR). nVidia did this first, AMD copied (Crossfire).

Physx - yes acquired, so they could build the capability into the GPU for physics instead of relying on the CPU or an external Physx processor. nVidia did this first, AMD copied.

G-Sync. nVidia did this first, AMD and VESA copied.

Raytracing. So it's been around, but nVidia did it first in a consumer GPU. AMD is copying...


What has AMD done first?? Well, they did use HBM on a video card. Didn't amount to anything (innovative) except overpriced for the performance... Yeah, that's not an innovation.

Well if you're going to be that pedantic about it, ATI Rage Fury Maxx. Two "modern" GPUs on the same video card. Sure, 3dfx did it first but that doesn't count either.
 
It's crazy there's debate about this when all this is is another revenue stream that locks users into their brand.
 
I thought NVLink was faster than it is...
SFR is just too situational to be good, AFR often introduces long frame times and other such nonsense. From what I understand from gaming with many titles on MCM based cards is they often only use one of the cores not all let’s say 4. So they need a method that works with a theoretical 4 or maybe 8 way GPU setup. AFR in this case is bad because you would be rendering too many frames out at say 1/3 the speed each of a monolithic core. SFR could work but how do you decide which core is doing which part of the screen if you just broke it into 4-8 sections accordingly then you have a very uneventfully distributed work load. With a checkerboard approach it could break a scene into as many components as needed and just randomly distribute them amongst the available cores.

Checker boarding wouldn't necessarily have to distribute neighboring squares to different CPUs and in all probability, random distribution would also be suboptimal. It is more about load balancing like SFR but the split itself isn't a basic horizontal seem. Load balancing doesn't always have to mean pure compute either as I'll explain down below.

There are a couple of other techniques that are of assistance for checker boarding: partially resident textures, variable rate shading, variable resolution and as many have mentioned in this thread already tile based rendering. Partially resident textures permits only the data to be rendered on screen to be cached in GPU memory. This of course increases the effective capacity of GPU memory, even for single GPU systems. In a multi GPU environment, this increases the effective GPU memory size. Variable rate shading can be leveraged to help enforce a cap on how much resources are used within a tile. By enforcing a shading compute cap of sorts, each tile can be regulated on how long it takes to render. Variable resolution is another technique to reduce rendering time if a tile is taking too long to generate. Instead of say a 64 x 64 pixel tile, it is treated as a 48 x 48 tile and then scaled updward in size. For a single 48 x 48 scaled up tile among 64 x 64 tiles, this would be difficult to notice in motion (especially after load balance is recalibrated). Tile based rendering as mentioned already in this thread focuses geometry workloads to that of what is mostly in the tile being worked on. Rendering out the geometry for the entire frame for each GPU is not necessary but there is still going to be some overlap. One a single GPU system, this overlap is hidden as the excess work in many cases can be moved forward to the next tile being worked one.

The load balancing algorithm for distributing the workloads is the tricky part. Merging the smallest tiles that make up the checkerboard together to create a coding tree unit structure would be more efficient. The coding tree structure will still break down to individual tiles for rendering but permits groups of tiles to be assigned to a GPU together. There are three resources for how the trees can be built: geometry complexity, shader complexity, memory usage and caching. The first two are simply measured in the time it takes to process their tile and with memory capacity being pretty self explanatory. Caching is a bit different as this leverages data between tiles as much as possible for increased efficiency. More caching, the greater performance but it has to yield to the other factors as well. The coding tree structure is reconciled for each resource and then work is distributed in an optimal fashion based upon those factors. This is a complex process but GPUs already have some of the hardware for this in silicon to handle real-time HVEC encoding. The hardware encoder can attempt to predict how the coding tree structure will look like for the next key frame and issue changes predicatively. GPUs get a bonus here in gaming as the game engine can feed hints into this algorithm (looking left in game will shift the structure left). Data for how long previous tiles took to can be used for predictive timing on how long it'll take to render the set of tiles for the next frame and any major discrepancy can be address by altering variable rate shading or issuing a rebuild of the coding tree structure to re-balance the load prior to a scheduled redistribution.


This scheme is not without its downsides. Already hinted at is that optimal caching becomes increasingly difficult as tiles are spread out across multiple GPUs. Memory coherency between GPUs will be important for scheduling. Bandwidth and latency between cards will have to tolerate the transport of completed tiles. The more GPUs there are in a system, the more important the interconnect becomes. Tile sizing will be a factor here as well as the maximum permitted tile size in a coding tree structure. Tile based rendering still has a small overlap with other tile so scaling while good falls short of perfect. The coding tree unit structure leverages parts of the HEVC encoding pipeline to re-purpose silicon but that functionality would have to located close to the chip's own tile and shader scheduling. In addition, keeping track of the additional factors for load balancing may more than what the existing HEVC encoder has. There is also obvious driver overhead as ideally the multi-GPU setup would be seen as a single larger GPU by applications. Even with the hardware between the GPUs handling the load balancing, the drivers still have to feed data to the applications and coordinate things at the system level. This is a non-trivial task. Visual quality can be compromised by the variable rate shading and dynamic resolution scaling (console gamers have been rather tolerate of the later though). The new generation of low level API's like DX12 and Vulkan were somewhat designed to permit more multi-GPU flexibility but emulating two cards as a single bigger card could easily break them. Legacy API's of course wlll have their fair share of compatibility issues as have been seen with alternative multi-GPU techniques beyond AFR.

Regardless, there are far more resources today to implement multi-GPU scaling than there was in the past.
 
Regardless, there are far more resources today to implement multi-GPU scaling than there was in the past.

Thanks for this detailed post -- how much of this scheduling should be happening at the CPU/driver level? (or, more appropriately, given recent scale-up in physical cores, is this something that can/should leverage that capability?)

My hair brain wonders if this kind of scheduling is something that can be largely predicted heuristically a priori by training the system configuration (cough, AI, cough) on similar work flows and monitoring utilization for feedback. Perfect scaling? Absolutely not, but hopefully better than rigid scheduling as delivered by fixed code drivers.
 
Back
Top