Demystifying Asynchronous Compute - V1.0

Ieldra

I Promise to RTFM
Joined
Mar 28, 2016
Messages
3,539
Demystifying Asynchronous Compute

As some of you may know, this is one of my favorite topics ( ;) ) and I thought I'd try to write a sort of "definitive guide" to dispel some of the misconceptions, rumors and hysteria over this feature.

So to start off we're going to briefly look at what the Direct X 12 specification says, because a little background is needed in order to approach asynchronous compute.

DirectX 12; Command Lists and Multi-Engine


DirectX 12 employs a different command submission model using command lists, each of which is created by one CPU thread and submitted to one of three queues, corresponding to one of three "engines".

We have to be a little careful when reading Microsoft documentation, because their terminology often doesn't coincide with that used by hardware vendors.

Most modern GPUs contain multiple independent engines that provide specialized functionality. Many have one or more dedicated copy engines, and a compute engine, usually distinct from the 3D engine. Each of these engines can execute commands in parallel with each other. Direct3D 12 provides granular access to the 3D, compute and copy engines, using queues and command lists.

Synchronization and Multi-Engine (Windows)


So an "engine" is essentially a command processor, and each has it's own queue(s) with the API exposing signalling in the form of fences which are used to coordinate work across queues.

An "engine" is an API construct, to be clear, neither AMD/NV refer to any actual hardware blocks as engines - thanks razor1

The critical thing here is independence, the hardware is no longer forced to approach things sequentially, you can still execute compute and DMA(copy) commands through the graphics queue, but you also have the liberty to exploit these features exposed by the new API, granting more granular control to the developer.
UxOAbZ2.png



HjGZ0Ni.png


This means you can have graphics, compute and DMA dispatches in parallel; through independent dispatchers.

The compute queue, or the compute "engine" rather; ACEs for AMD, GMU for NVIDIA; do not have access to fixed function hardware, so no rasterizers, no geometry engines etc. The compute queue is good for things that need ALU/FPU power and not much else.

The copy queue is self explanatory, but don't forget about it, the key term really shouldn't be async compute but multi-engine, but we'll get to that later.

Parallelism and Concurrency

Parallelism and concurrency can be defined in a multitude of ways, and you will easily find several definitions for each with a quick google search.

Generally speaking, parallelism is a condition that stems from multiple work units ; the key here is independence. A task that can be broken up into a series of subtasks that can be executed simultaneously is said to be parallelizable.

Slapping people is an inherently parallelizable task for a vast majority of the population; if you split up slapping into left-handed slapping and right-handed slapping, you can now slap with both hands, simultaneously. - you scoundrel.

You can't talk about parallelism in a void, you need to specify "where" the parallelism takes place. At the bit level ? At the instruction level ? Thread level ?

The transition from 32-bit to 64-bit CPU architectures brought increased bit-level parallelism.

The transition to micro-ops in Intel CPUs brought increased instruction-level parallelism.

The essence of parallelism is independence and task subdivision.

Concurrency can be seen as a more general notion of parallelism, with parallelism being a subset of concurrency upon which an additional condition is placed; independence.

The key to concurrency is interruptability. In a concurrent execution model, multiple tasks move forward in their execution within the same time span on a single work unit. The start and end times of running tasks overlap.

If you weren't content with slapping people with two hands, and wanted to add insult to injury you could spit in their eye; but you know how it is; you're a busy person, you don't have all day. It takes five seconds to go through one cycle of the Parallel Slap™, if you add a two second spit routine in after it then you'll only get to do it 8 times a minute. Ain't nobody got time for that; we've all been there, time is money.

Instead you can choose to spit right in their eye in the downtime between your palms making contact with their face and the moment you retract your arms back. You can now perform two Parallel Slaps™ and a Concurrent Spit™ in 5 seconds.

Congratulations, you are now slapping in parallel and spitting concurrently - you monster.

Yeah... Ahem... Back on track...


Asynchronous Compute and DX12

Microsoft mentions a few examples of use cases for multi-engine

Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks. Background tasks may include decompression of resources or updating simulations or acceleration structures. Background tasks should be synchronized on the CPU infrequently (approximately once per frame) to avoid stalling or slowing foreground work.

Right off that bat it's worth saying the term 'asynchronous' is abused and misused all the time; if two events are asynchronous they are time-independent of each other. Developers have made claims about their 'async compute' implementations in the past that have been rather confusing because it's usually applied to rendering tasks; it ain't asynchronous if there are data dependencies. To my knowledge, there is no title that actually makes use of ASYNCHRONOUS compute. Most use cases would fall under the third bullet point quoted above.

Asynchronous execution arises when you do not have to wait for a routine to return before dispatching additional work, at least that's how it's conventionally defined, which is somewhat similar to how concurrency was defined earlier - which many people find confusing.

I encourage you to go and google "what is concurrency?" or "what's the difference between concurrency and parallelism?"; you'll find tons of answers, each slightly different from all the others.



Parallelism: Simultaneous execution of two or more tasks, they are executing at the same instant therefore on independent units.

Concurrency: Overlapping execution of two or more tasks, they are not executing at the same instant however both tasks are progressing forwards in their execution within the same time-frame.

Asynchrony: Order-independent execution of two or more tasks, a routine can be called before the preceding routine returns.


People often have trouble reconciling asynchrony with concurrency and parallelism, in fact people confuse concurrency and parallelism all the time!

Parallel is the opposite of serial.

Concurrent
is the opposite of sequential.

Asynchronous is the opposite of synchronous.

Let's go back to the previous example; let's say you have a particularly fiendish pet monkey perched on your shoulder. After slapping with both hands in parallel you exploit the brief stall time before retracting your arms to order your monkey to spit. If we consider the task to be "order your monkey to spit" you are executing it concurrently with a parallel slap. Once the order is given, you can move forward in the SLAP task without waiting for the monkey to actually spit. This is the essence of asynchrony. While from your POV it was concurrent, if you consider the whole system including the slapping maniac and his asshole-ish monkey the spitting and the slapping are asynchronous, and possibly parallel; monkey spits while you are retracting your arms from the slap.

So now, you're slapping with both hands in parallel ( two tasks executing at the same exact time, or a task subdivided into two subtasks which execute simultaneously, just depends on what you define the task to be), ordering your monkey to SPIT (task 2) concurrently, and the monkey is spitting asynchronously - you slapping, spitting lunatic.

You might be wondering why the spitting is considered asynchronous instead of concurrent, all that's changed is that we've offloaded the actual SPIT operation to another unit. Fair game, I understand your confusion and hopefully this will clear it up.

In the first example of parallel slapping and concurrent spitting, your slap and spit operations were not executing at the same time. You "paused" the slap when your hands make contact with skin, switched to spitting, finished spitting, then went back to finish your slapping task by executing the 'retract arms' operation.

While concurrency and parallelism improve the throughput of a machine, asynchrony improves latency.

Using a more serious example; a CPU and I/O operations. If the CPU were to wait for I/O operations to return it would spend a huge deal of time waiting, instead these operations are executed asynchronously such that the CPU only sends the command(s) then moves onto another task without waiting for a return. The I/O operations will still take the same amount of time, but the CPU has effectively hidden a big chunk of latency

So now we've established what we mean by parallel, concurrent and asynchronous we can finally move on to how nvidia and AMD are able to exploit this new command submission model.

NVIDIA and AMD - Two different approaches


Now for the purposes of this example let's assume we have two tasks on two different queues, let's call them A and B.

Task A is on the graphics queue, and it uses some fixed function hardware.

Task B is on the compute queue and uses only ALU/FPU resources.

We have two GPUs; one GCN and one Maxwell based, both containing 10 SMs/CUs.

Let's assume task A executes in 10 milliseconds on a single unit (an SM or a CU).

Task B executes in 3 milliseconds on a single unit.

When assigned to a single unit, both GCN and Maxwell execute the task in 10ms with no stall time on the unit whatsoever; CU/SM utilization is 100%.

However, if we spread the workload across all units (10) instead of executing in 1ms, it takes 1.25ms on both GCN and Maxwell; there's 0.25ms of stall time ; utilization is now 80%.

If we were to leave it at that then execute task B sequentially (spread across all 10 units) we would have a total execution time of 1.25 + 0.3 = 1.55ms

There's some stall time on the SM/CUs however, which means there's room for improvement.
 

Attachments

  • screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-22-33.png
    screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-22-33.png
    89 KB · Views: 106
  • screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-26-30.png
    screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-26-30.png
    73.6 KB · Views: 95
  • screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-27-47.png
    screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-27-47.png
    578.3 KB · Views: 94
  • screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-32-15.png
    screenshot-amd-dev wpengine netdna-cdn com 2016-08-30 21-32-15.png
    15.8 KB · Views: 95
  • screenshot-international download nvidia com 2016-08-30 21-56-28.png
    screenshot-international download nvidia com 2016-08-30 21-56-28.png
    390.7 KB · Views: 93
GCN and Asynchronous Shaders

Task A is assigned to all 10 CUs, after the first part of the task is processed, the data must be then sent to a fixed function unit (rasterizer for example) before returning to the CU for completion.

When this happens, the CUs are idle, and the ACEs dispatch work from task B to each of the CUs.

In order to execute this newly assigned task a context switch must be performed.

A context switch is simply the transfer of all data relevant to a task in execution (registers, cache) to some form of temporary storage and the retrieval of context from another task so it can execute.

The effectiveness of this approach is contingent on the context switch latency being significantly smaller than the execution time of the task being swapped in. If this is not the case, then the context switching latency will have a measurable effect on the stall time on the CUs, which is the issue we are trying to solve.

So on GCN each of the ACEs can dispatch work to each of the CUs, and they enable very fast context switching thanks to a dedicated cache.

Now, operating under the assumption that context switch latency is negligible and that the 0.25ms of stall time from task A is contiguous this is what happens.

Task A dispatched to all 10 CUs.
Task A is in execution for 0.5ms.
Task A is dispatched to fixed function unit(s) using intermediate result from each CU.
ACEs assigns parts of task B to each CU
Task A context is swapped to dedicated cache within an ACE.
Task B is dispatched
Task B is executing on all 10 CUs for 0.3ms
Task B is finished
Task A context is swapped back into each CU.
Task A executes for 0.5ms
Task A complete.

Total time = 1.3ms vs 1.55ms without exploiting multi-engine

So the CUs are executing graphics and compute tasks concurrently, the execution of work using FFUs is asynchronous; the work is dispatched and the CU moves on to task B without waiting for the FFU operation(s) to return.

The FFUs and the CUs are executing tasks A and B in parallel, but this parallelism does not occur within the CU, but at a higher level; the shader engine level.

Nvidia's architecture does not include a dedicated context swap cache, context swaps go to offdie to VRAM. This is very slow. Context switch latency is orders of magnitude higher than in GCN. The approach outlined above is totally untenable on a Maxwell or Pascal GPU.

The Asynchronous Shaders whitepaper seems to have confused a lot of people because they talk about simultaneous
447NhEq.png


TUiiUut.png



This is just describing general problems in implementing a scheme involving multiple independent command streams, context switches are generally only viable if the context switch latency is effectively hidden by the increased utilization.
vgDRrW6.png


As you can see in the third paragraph
...Independent command streams can be interleaved on a Shader Engine and execute simultaneously...

Note they are talking about simultaneously execution on the shader engine. what does this mean? What is a task ?

This how they define a task earlier in this whitepaper
5UarItn.png


So what's going on here is that each queue is independent; command lists within each queue execute synchronously, but command lists from different queues (different command streams) are asynchronous with respect to each other. If Task A (command list A) is a shadowmap render, and asynchronous shaders (which is just the name of this implementation involving ACEs etc) enables the ACEs to quickly context swap however many CUs to Task B while the FFUs in the Shader Engine are processing Task A. They are thus executing in parallel on the Shader Engine, asynchronously with respect to each other and concurrently on the CU. Entiende?

ACEs enable fast context switching, which means they can afford to context switch and run something else in much smaller gaps in utilization than before.



NVIDIA - Maxwell and Pascal

On Maxwell what would happen is Task A is assigned to 8 SMs such that execution time is 1.25ms and the FFU does not stall the SMs at all. Simple, right? However we now have 20% of our SMs going unused.

So we assign task B to those 2 SMs which will complete it in 1.5ms, in parallel with Task A's execution on the other 8 SMs.

Here is the problem; when Task A completes Task B will still have 0.25ms to go, and on Maxwell there's no way of reassigning those 8 SMs before Task B completes. Partitioning of resources is static(unchanging) and happens at the drawback boundary, controlled by the driver.

So if driver estimates the execution times of Tasks A and B incorrectly, the partitioning of execution units between them will lead to idle time as outlined above.

Pascal solves this problem with 'dynamic load balancing' ; the 8 SMs assigned to A can be reassigned to other tasks while Task B is still running; thus saturating the SMs and improving utilization.

For some reason many people have decided that Pascal uses preemption instead of async compute.

This makes no sense at all. Preemption is the act of telling a unit to halt execution of its running task. Preemption latency measures the time between the halt command being issued and the unit being ready for another assignment.

Pixel level preemption is good for time-critical tasks like async timewarp for VR because it means you can delay issuing the halt command as the unit will only need to finish working on the current pixel before halting, dumping context in VRAM and being ready for a new assignment.

Thankfully NVIDIA's GTX 1080 whitepaper is pretty clear and divides the "Asynchronous Compute" section into two main points; overlapping workloads and time-critical workloads. Pixel level preemption is relevant to the latter, while "dynamic load balancing" is relevant to the former.

I don't feel like I have to explain much here, think I have already covered this above.
0QAQZnN.png



So if we stop using the term asynchronous compute and focus on multi-engine all our lives would be much more pleasant, DX12 only requires you to have those "engines" and their queues exposed. It places no requirements whatsoever on how the independent command streams are executed; "Async Shaders" and "Dynamic Load Balancing" are just marketing terms used to introduce new features I guess, and frankly it's unusual for such a profoundly architecture-intimate feature to be propelled to the forefront of a marketing campaign.

GCN has one geometry engine and one rasterizer in each Shader Engine (usually 9 CUs). NVIDIA employs a geometry engine per SM and rasterizers shared by all SMs in a GPC ( 4 or 5). The balance of resources is radically different.

Wrapping things up

MS and Multi-Engine:

Synchronization and Multi-Engine (Windows)

Links to whitepapers:

http://international.download.nvidi...al/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf




If you spot any mistakes, or feel you could explain things better, or feel I have totally misinterpreted a whitepaper or a document, tell me in this thread and I will do my best to keep things fixed.

If you're going to contest any of the information here, please don't be an asshole and source your things where relevant.
 
Side note added to this leldra, I will never expect current nV architecture to gain as much as AMD's current architecture just because of the utilization factor.

Good summery BTW.
 
Long story short: people care about Asynchronous Compute now because Nvidia cards can finally benefit from it.

People care less because nvidia support it, not more. Before it was a marketing win for AMD, now it's a question how much there is to benefit and it's too complicated for a marketing spin.

Side note added to this leldra, I will never expect current nV architecture to gain as much as AMD's current architecture just because of the utilization factor.

Good summery BTW.

Yes I hinted at that at the end with balance of resources; should be clear that with different ratios of units you'll have different issues when overlapping workloads.
 
Also latency reduction or hiding latency or using asynchronous paths (all pretty much amount to the same thing as an end result) doesn't always produce better performance, in theory it should but if bottlenecks lie outside of the shader array, those will negate the possibility of seeing the end results as well (program dependent), just ran across this issue without profiling what was going on, was scratching my head on it for a day cause I knew I didn't do anything wrong lol.

Latency doesn't equal performance.
 
You are only focused on parallel execution of tasks by the SMX or Compute Units, in this example, it is correct.

However, a TRUE Multi-Engine approach allows parallel or segmentation (Pascal) assignment of SMX (CUDA Cores, SP), along with parallel execution in the DMAs and Rasterizer engines as well. This is something Maxwell & Pascal lacks.

Pascal only supports partial Asynchronous Compute and it does it via Dynamic Load Balancing on the SMXs.

There's a very easy to understand video that explains this concept clearly.



Sources used come from AMD, Sony, Gamedevs and Microsoft.
 
You are only focused on parallel execution of tasks by the SMX or Compute Units, in this example, it is correct.

However, a TRUE Multi-Engine approach allows parallel or segmentation (Pascal) assignment of SMX (CUDA Cores, SP), along with parallel execution in the DMAs and Rasterizer engines as well. This is something Maxwell & Pascal lacks.

Pascal only supports partial Asynchronous Compute and it does it via Dynamic Load Balancing on the SMXs.

There's a very easy to understand video that explains this concept clearly.



Sources used come from AMD, Sony, Gamedevs and Microsoft.


Asych compute has nothing to do with other parts of the GPU outside of the shader array, The benefits of having async compute can benefit other parts of the GPU though, hence why I mentioned the bottlenecks. But only up to a certain point, you don't want to be breaking down and segmenting your program to such a point where the rasterizer or DMA's or caches will take a hit. Yes it will happen if you have too many fragments specially when using async compute, because you end up reducing latency in the shader array but the side affect is adding stalls (increasing latency) into the other parts of the pipeline that shouldn't have been there to being with, this is exactly what happened to me.
 
You are only focused on parallel execution of tasks by the SMX or Compute Units, in this example, it is correct.

However, a TRUE Multi-Engine approach allows parallel or segmentation (Pascal) assignment of SMX (CUDA Cores, SP), along with parallel execution in the DMAs and Rasterizer engines as well. This is something Maxwell & Pascal lacks.

Pascal only supports partial Asynchronous Compute and it does it via Dynamic Load Balancing on the SMXs.

There's a very easy to understand video that explains this concept clearly.



Sources used come from AMD, Sony, Gamedevs and Microsoft.


A video isn't a source, and you can't expect me to watch it to try and figure out what you're saying. State your case here, don't give me more work to do >_>

If you look at a GPC you have compute and graphics in parallel, I don't understand your point. "partial async compute" isn't a thing, neither is "async compute" really.

Asych compute has nothing to do with other parts of the GPU outside of the shader array, The benefits of having async compute can benefit other parts of the GPU though, hence why I mentioned the bottlenecks. But only up to a certain point, you don't want to be breaking down and segmenting your program to such a point where the rasterizer or DMA's or caches will take a hit. Yes it will happen if you have too many fragments specially when using async compute, because you end up reducing latency in the shader array but the side affect is adding stalls (increasing latency) into the other parts of the pipeline that shouldn't have been there to being with, this is exactly what happened to me.

Obviously if your partitioning of the SMs (assume Pascal) has a detrimental effect on performance of graphics tasks then you're starting to work backwards lol.
 
Obviously if your partitioning of the SMs (assume Pascal) has a detrimental effect on performance of graphics tasks then you're starting to work backwards lol.

Yeah pretty much what I did lol

And this was mentioned by DICE's lead guy in one of his presentations which wasn't something I remembered, but stupid me thought I was being clever lol. So that video, although touches on a good point, doesn't go into it deep enough. If you go through the entire B3D thread about concurrency, what that guy was saying is in there, and the optimal way of doing something is vastly different for each architecture. GCN will hit cache much harder than Pascal or Maxwell will in segmenting an application. So altough a good attempt at generalizing it, it doesn't go deep enough to really understand what the crux of the real problems arise from.

The rasterizer for Pascal, even Maxwell, is also better with segmentation as its tile based, but with maxwell as you noted can't do async with out certain things being done which increase under utilization.

Now the video also talks about DX 11 and DX 12 hardware and software and why DX12 is different, at a hardware level, both function similarly, DX12 is not a hardware problem, having multi queues and engines is another API construct not a hardware rule which the video fails to mention Bahanime, that is why DX 12 can run on older DX11 cards (nV Keplar, AMD's prior to GCN can't).
 
Last edited:
just to be clear the same argument applies to copy queue, i didn't mention it specifically because this is about async COMPUTE. using the COPY queue you are free to slot in DMA commands when there is little load on I/O and don't enforce synchrony with graphics/compute tasks which is nice.
 
@ Bahimine BTW the guy doesn't source what you stated , he came up with these ideas by their talks, which some of them are not really on the write track (pun intended lol), using the copy queue for certain effects can be detrimental even to hardware that have good async compute capabilities, again, you need to see what is best for a given architecture, because the copy que is memory access and cache access (depending on what your needs are).

He also mistaken stuff about shadow maps for what developer stated that it doesn't use pixel shaders, oh yes they do use pixel shaders, if ya don't forget softshadows lol, the preliminary pass through uses vertex shaders to calculate the shadow map that is about the only time the vertex shader is involved, after that its all pixel shaders (or compute now). But shadow maps have a huge hit parallelization and performance and that is another topic entirely......
 
Last edited:
the million $$ question is how important is Asynchronous Compute in terms of gaming going forward with DX12?...and why doesn't Nvidia seem to care as much about prioritizing it over AMD?
 
I don't know why or how but it's starting to look like a lot of people are convinced that NV can't use the rasterizer in parallel with the SMs and the DMA engine and other nonsense like that. I don't know where it's coming from lol.

Let's assume my Task A is starts with vertex shading, then it's rasterized then pixel shaders are run. If when it is being rasterized I have X SMs running a compute shader using the previous frame data then rasterizer is processing graphics in parallel with SMs processing compute. Simple.

the million $$ question is how important is Asynchronous Compute in terms of gaming going forward with DX12?...and why doesn't Nvidia seem to care as much about prioritizing it over AMD?

Because for AMD it can provide very tangible gains particularly in rasterizer limited scenarios where they lose heavily to nvidia, whereas nvidia have fewer holes to plug in the first place, so they focus on extracting more meaningful performance gains from elsewhere
 
funny thing is the rasterizer the traditional way PC graphics have been done was all one step, the final step for different effects the app is going for, before the pixel shaders are involved. AMD hardware is not capable of multiple rasterization steps (must have the components as the complete screen needs (pixel shaders take over)), nV hardware using tiled rendering can.......

I would not be surprised if this is where nV's utilization of its array goes much higher and AMD's........ well not really utilization, they can actually save on what needs to be done vs what doesn't need to be done, thus gaining efficiency.
 
Last edited:
funny thing is the rasterizer the traditional way PC graphics have been done was all one step, the final step for different effects the app is going for, before the pixel shaders are involved. AMD hardware is not capable of multiple rasterization steps (must have the components as the complete screen needs (pixel shaders take over)), nV hardware using tiled rendering can.......

I would not be surprised if this is where nV's utilization of its array goes much higher and AMD's........ well not really utilization, they can actually save on what needs to be done vs what doesn't need to be done, thus gaining efficiency.

good point about rasterization ;) i hadn't thought of that
 
so talking about what async can provide to other parts of a GPU, gotta take into account the different needs and capabilities of that particular GPU, which clearly the guy in the video doesn't do.
 
I'll read this when I'm less drunk. 'Twas an interesting read with a few beers underneath the belt. I might understand it once the beers have been dispatched to the local Used Beer Department™.
 
funny thing is the rasterizer the traditional way PC graphics have been done was all one step, the final step for different effects the app is going for, before the pixel shaders are involved. AMD hardware is not capable of multiple rasterization steps (must have the components as the complete screen needs (pixel shaders take over)), nV hardware using tiled rendering can.......

I would not be surprised if this is where nV's utilization of its array goes much higher and AMD's........ well not really utilization, they can actually save on what needs to be done vs what doesn't need to be done, thus gaining efficiency.
Nvidia's rasterization stage is likely also their conservative rasterization and order independent transparency. No reason GCN wouldn't work with a similar approach beyond it not being implemented. Nvidia would have higher utilization here as their triangle throughput was simply higher by a good margin.

As for the tiled rasterization, some of the benefits of that approach are software dependent. The same approach could be performed on GCN simply by dividing the screen into a bunch of tiles. In the case of foveated rendering and some of the VR tricks this would be the approximate result.
 
Great editorial, much needed cheers!
It got me thinking so I wrote it down...


A relatively simple way to improve latency when context switching and reduce complexity of use:

Job data is written to one of 2 internally linked bit for bit data banks (registers and cache).
The 2nd bank auto populates from the 1st and has a read only switch.
Bank 1 is used for processing the current job. No different to the current setup.
Bank 2 is used for reading out the data from a finished job.

. When the context switch is invoked, for as long as it takes to read the data out, the 2nd bank is switched to "read only" to allow data to be read out at leisure (almost) while the next job starts because bank 1 is free to use.
. When the data has been read out, bank 2 switches back to writable and "auto" populates from bank 1, taking 1 cycle.

This would remove most of the delay caused reading the registers and cache out. 1 to 2 cycles would be used.


A further improvement can be made by with a 3rd bank (also bit for bit linked to bank 1) that populates with the incoming "register and cache" data ready for the next context switch.
Banks 1 and 3 will now need a disconnect switch between them.
This 3rd bank may also need a read only switch depending on the logic used to populate it.

. During the processing of a job, Bank 3 is populated with data for the next job.
. During the same cycle as the data is copied out from bank 2 at the end of the job, bank 3 connects to bank 1 and populates it with the data for the next job.
. Banks 1 and 3 disconnect.
. Bank 1 is (always) in a writable state for normal use during processing.


Overall this would use 2 to 4 cycles latency and would eliminate all other latencies associated with the switch
(assuming it is possible to know what data to populate bank 3 with early enough. This musnt be dependent on the result of the current job)

This is not the same as a normal cache. A cache is linked by a limited bandwidth connection that would take much more than 1 cycle to populate all the registers/cache for the next job and adds latency.
It is tripling the data and cache storage space using direct bit for bit connections through switches.

It will need more transistors for the bit to bit caches and switches + control logic.
Some of the currently used control logic can be dispensed with because the system will be less complex.
Availability of the engines will increase.
Overall complexity even to driver level will be greatly reduced with a side effect of further reduced latency and ease of use.

Context switches could be almost free.
The downside is that data must be prepared a little earlier for the next context switch.
If this is possible, voila.


The question is, would the extra transistors needed for the bit to bit caches and switching logic be more effectively used on CUs?
Or are the extra transistors a small enough amount in relative terms?
I am thinking on the surface here, a bit poorly so havent dug into it.

edit: edited a few times to clarify parts
 
Last edited:
Great editorial, much needed cheers!
It got me thinking so I wrote it down...


A relatively simple way to improve latency when context switching and reduce complexity of use:

Job data is written to one of 2 internally linked bit for bit data banks (registers and cache).
The 2nd bank auto populates from the 1st and has a read only switch.
Bank 1 is used for processing the current job. No different to the current setup.
Bank 2 is used for reading out the data from a finished job.

. When the context switch is invoked, for as long as it takes to read the data out, the 2nd bank is switched to "read only" to allow data to be read out at leisure (almost) while the next job starts because bank 1 is free to use.
. When the data has been read out, bank 2 switches back to writable and "auto" populates from bank 1, taking 1 cycle.

This would remove most of the delay caused reading the registers and cache out. 1 to 2 cycles would be used.


A further improvement can be made by with a 3rd bank (also bit for bit linked to bank 1) that populates with the incoming "register and cache" data ready for the next context switch.
Banks 1 and 3 will now need a disconnect switch between them.
This 3rd bank may also need a read only switch depending on the logic used to populate it.

. During the processing of a job, Bank 3 is populated with data for the next job.
. During the same cycle as the data is copied out from bank 2 at the end of the job, bank 3 connects to bank 1 and populates it with the data for the next job.
. Banks 1 and 3 disconnect.
. Bank 1 is (always) in a writable state for normal use during processing.


Overall this would use 2 to 4 cycles latency and would eliminate all other latencies associated with the switch
(assuming it is possible to know what data to populate bank 3 with early enough. This musnt be dependent on the result of the current job)

This is not the same as a normal cache. A cache is linked by a limited bandwidth connection that would take much more than 1 cycle to populate all the registers/cache for the next job and adds latency.
It is tripling the data and cache storage space using direct bit for bit connections through switches.

It will need more transistors for the bit to bit caches and switches + control logic.
Some of the currently used control logic can be dispensed with because the system will be less complex.
Availability of the engines will increase.
Overall complexity even to driver level will be greatly reduced with a side effect of further reduced latency and ease of use.

Context switches could be almost free.
The downside is that data must be prepared a little earlier for the next context switch.
If this is possible, voila.


The question is, would the extra transistors needed for the bit to bit caches and switching logic be more effectively used on CUs?
Or are the extra transistors a small enough amount in relative terms?
I am thinking on the surface here, a bit poorly so havent dug into it.

edit: edited a few times to clarify parts

I suspect something like this is already happening, I haven't actually found any data for how fast the context switch is but Mahigan said 1 clock which is redonculous but I'm comfortable with saying "in the order of ~10 clocks" until I find a decent source of information regarding this
 
Nvidia's rasterization stage is likely also their conservative rasterization and order independent transparency. No reason GCN wouldn't work with a similar approach beyond it not being implemented. Nvidia would have higher utilization here as their triangle throughput was simply higher by a good margin.

As for the tiled rasterization, some of the benefits of that approach are software dependent. The same approach could be performed on GCN simply by dividing the screen into a bunch of tiles. In the case of foveated rendering and some of the VR tricks this would be the approximate result.


Nah the graphics pipeline is different for a TBR, after triangle setup rasterization is done, this is not a programmable unit, these are fixed units, so AMD would need to change the setup and also by doing so, the front end changes and pixel shader (data) access will have to be different as well.

I remember AMD talking about this oh 5 or 6 years ago at GDC with their upcoming mobile chips. They were also talking about geometry binning lol, ironically.

Added to this memory and cache access and control are also different to the fixed function units.

here it is

http://amd-dev.wpengine.netdna-cdn....2/10/gdc2008_ribble_maurice_TileBasedGpus.pdf

older then that 9 years ago.
 
I suspect something like this is already happening, I haven't actually found any data for how fast the context switch is but Mahigan said 1 clock which is redonculous but I'm comfortable with saying "in the order of ~10 clocks" until I find a decent source of information regarding this

Interesting.
It cant be fully utilised yet, NVidia still have a penalty compared to AMD.
Perhaps the driver still needs work removing redundant elements.
Can you dig a bit deeper?
 
yeah 1 clock isn't right, 1 clock pretty much there will be no latency involved. Gotta be more than that.
 
Nah the graphics pipeline is different for a TBR, after triangle setup rasterization is done, this is not a programmable unit, these are fixed units, so AMD would need to change the setup and also by doing so, the front end changes and pixel shader (data) access will have to be different as well.

I remember AMD talking about this oh 5 or 6 years ago at GDC with their upcoming mobile chips. They were also talking about geometry binning lol, ironically.

Added to this memory and cache access and control are also different to the fixed function units.

here it is

http://amd-dev.wpengine.netdna-cdn....2/10/gdc2008_ribble_maurice_TileBasedGpus.pdf

older then that 9 years ago.
It would definitely require a hardware change. Should have been more specific about that. The software tiled approach also wouldn't perform as well if you had triangles covering multiple tiles as you would be processing them multiple times. However if geometry processing wasn't the bottleneck the impact would likely be minimal.

I suspect something like this is already happening, I haven't actually found any data for how fast the context switch is but Mahigan said 1 clock which is redonculous but I'm comfortable with saying "in the order of ~10 clocks" until I find a decent source of information regarding this
It should be roughly the time to dump the SIMD's local cache to back to the CU's cache. The time likely varies by generation as I believe Polaris has a larger SIMD cache, but should be ~16 registers/clock I'd think. So 16-32 clocks would probably be reasonable.
 
Interesting.
It cant be fully utilised yet, NVidia still have a penalty compared to AMD.
Perhaps the driver still needs work removing redundant elements.
Can you dig a bit deeper?

talking about GCN not NV :) NV is swapping to vram so latency through the roof

yeah 1 clock isn't right, 1 clock pretty much there will be no latency involved. Gotta be more than that.

Who'd have guessed mahigan got something wrong eh

It would definitely require a hardware change. Should have been more specific about that. The software tiled approach also wouldn't perform as well if you had triangles covering multiple tiles as you would be processing them multiple times. However if geometry processing wasn't the bottleneck the impact would likely be minimal.


It should be roughly the time to dump the SIMD's local cache to back to the CU's cache. The time likely varies by generation as I believe Polaris has a larger SIMD cache, but should be ~16 registers/clock I'd think. So 16-32 clocks would probably be reasonable.
good point that if geo is already bottleneck it won't change much :)
 
  • Like
Reactions: Nenu
like this
talking about GCN not NV :) NV is swapping to vram so latency through the roof

Who'd have guessed mahigan got something wrong eh
Ah lol, missed the context switch :p

Shame NV dont take this approach.
Perhaps its patented.
 
Ah lol, missed the context switch :p

Shame NV dont take this approach.
Perhaps its patented.

Meh I think the transistor overhead for implementing such a scheme would outweight the benefits, the whole point of the NV;s uarchs from Maxwell on are to enhance efficiency and do away with complex scheduling logic .

AMD and NV's approaches are basically antithetical
 
Meh I think the transistor overhead for implementing such a scheme would outweight the benefits, the whole point of the NV;s uarchs from Maxwell on are to enhance efficiency and do away with complex scheduling logic .

AMD and NV's approaches are basically antithetical
Yeah, I did wonder. fairymuff :)
So this is where AMD spent their transistors that DX11 couldnt get its mitts on.
 
Meh I think the transistor overhead for implementing such a scheme would outweight the benefits, the whole point of the NV;s uarchs from Maxwell on are to enhance efficiency and do away with complex scheduling logic .

AMD and NV's approaches are basically antithetical


yeah that was a big difference from Fermi to Keplar I think it was, nV removed a good deal of the scheduling hardware.
 
  • Like
Reactions: Nenu
like this
There are some definitely benefits to the scheduling hardware, but they only really show with multiengine. The software scheduling otherwise becomes tediously complex with so many variables from the timing. The likely outcome is Nvidia partitions the GPU for each engine at which point some parts may have some downtime. It's the time sensitive tasks which really throw a wrench into things. Nvidia's approach will be good for throughput which is the case now with DX11, but far less ideal with time sensitive tasks and variable workloads.
 
no its not purely DX11 only either, cuda isn't bound by any graphics API, but can run along side it, and compute runs just fine with software scheduling. Actually even OpenCL does too as pascal gen cards tend to better in as well even with the less features than cuda, in some bit mining software.

Achiectures differ with scheduling, its simpler to create optimal code for nV cards for now, this was one of the reason for the SM changes from Fermi to Keplar to Maxwell/Pascal.

And according to Andrew Lauritzen (Intel) there is no benefit unless the hardware requires it.

Added note scheduling can be controlled by the application itself now with LLAPI's but this also creates problems or conflicts too with the driver, it can be good if done well, if can be bad if not done well, and it can be outright disastrous if the dev doesn't know what he/she is doing, hence the bad ports lol. Its always better to have the driver do its thing if inexperienced dev's are working on an app.
 
Last edited:
There are some definitely benefits to the scheduling hardware, but they only really show with multiengine. The software scheduling otherwise becomes tediously complex with so many variables from the timing. The likely outcome is Nvidia partitions the GPU for each engine at which point some parts may have some downtime. It's the time sensitive tasks which really throw a wrench into things. Nvidia's approach will be good for throughput which is the case now with DX11, but far less ideal with time sensitive tasks and variable workloads.
for time sensitive tasks you would preempt, at the pixel level for graphics, instruction level for compute. so this wouldn't be an issue, maybe I don't follow ?
 
for time sensitive tasks you would preempt, at the pixel level for graphics, instruction level for compute. so this wouldn't be an issue, maybe I don't follow ?
Preemption is still relatively slow, and the time sensitive tasks could be coming in fast and furious. Roughly once per frame is acceptable, but not ideal. Think rigid body collision that may not be tested every frame. Another possibility would be bullet physics against an animated character without giant hitboxes.
 
Back
Top