Demystifying Asynchronous Compute - V1.0

razor1 · Aug 31, 2016

ah cool, well I can tell ya one thing, PhoBoChai doesn't know what the f he is talking about, he doesn't even know what the graphic pipeline stages are and what order they take must take place nor does he know how fixed function units work with submission. Certain things have to be done in a certain order otherwise the pipeline fails to produce the desired visuals.

Don't let em get ya worked up, the door will hit him on his way out or actually probably in lol.

Ya have a set amount of fixed function units vs the shader array, there is absolutely no way if one is over tasked any work would get done in a reasonable time or time allotted until the unit that is tied up is free.

TaintedSquirrel · Aug 31, 2016

razor1 said:
PhoBoChai

I blocked him about 6 months ago. He lied about owning a 780 Ti on /r/nvidia so he could pretend to complain about Nvidia gimping drivers and use it as an excuse to praise AMD.
Absolute trash.

I actually forgot he even existed until your post just reminded me. He and a handful of other people are responsible for turning /r/AMD into a toxic waste dump over the last year. I could name all of them, too, but I won't since this thread is being directly linked from there.

Ieldra · Aug 31, 2016

TaintedSquirrel said:
I blocked him about 6 months ago. He lied about owning a 780 Ti on /r/nvidia so he could pretend to complain about Nvidia gimping drivers and use it as an excuse to praise AMD.
Absolute trash.

I actually forgot he even existed until your post just reminded me. He and a handful of other people are responsible for turning /r/AMD into a toxic waste dump over the last year. I could name all of them, too, but I won't since this thread is being directly linked from there.

Yeah all jokes aside the AMD reddit is strangely toxic. I know they're entombing the remains of chernobyl in a giant concrete sarcophagus and I honestly think the AMD subreddit should go with it

The AMD subreddit is where independent thought goes to die

cptnjarhead · Aug 31, 2016

Nice read, i appreciate your time and effort. But im confused.... so i should slap my monkey, and spit on it...or

$C:\Users\Damon-PC\AppData\Local\Temp\msohtmlclip1\01\clip_image001.gif$

From what my feeble mind can comprehend, if async saves time, why is this tech not being used?

Shintai · Aug 31, 2016

Its very hard to code for, that's why.

And thanks to the OP for a very informative thread.

Armenius · Aug 31, 2016

Ieldra said:
that's me don't know who post on AMD though

A lot of people in r/AMD don't know the difference between parallel and asynchronous, even though you spelled it out in your post. I guess that is AMD's marketing at work. People on Reddit also have a bad habit of attacking the poster instead of the post.

OutOfPhase · Aug 31, 2016

Async compute and shading are easier to understand than fanboyism on either side, IMHO. Why anyone has loyalty to a vendor of a part is beyond me.

razor1 · Aug 31, 2016

cageymaru said:
If I were to take a shot in the dark at guessing why...

Nvidia can't figure out how to implement it properly and they have the largest share of the video card ecosystem on the PC. The only reason that you see it in PC games nowadays is because it is baked into console games to get the most out of those much less powerful APUs contained within them. Since it is already implemented within the games before they are ported to the PC, why not just leave it in there and do minimum maintenance for the DX12 port? Probably a lot cheaper to do it this way also.

Doesn't work that way, async is resource dependent (allocation and what not), and PC hardware even AMD hardware is different with resources over its different gens and even with in each gen. This is the same naive thinking that got all the folks riled up in the first place. Async makes it more complex to port, more time consuming, more money. LLAPI's more cost to the developer less cost for IHV's and driver teams, a big bonus on AMD part as they don't have the resources......

Algrim · Aug 31, 2016

cageymaru said:
If I were to take a shot in the dark at guessing why...

Nvidia can't figure out how to implement it properly and they have the largest share of the video card ecosystem on the PC. The only reason that you see it in PC games nowadays is because it is baked into console games to get the most out of those much less powerful APUs contained within them. Since it is already implemented within the games before they are ported to the PC, why not just leave it in there and do minimum maintenance for the DX12 port? Probably a lot cheaper to do it this way also.

Two points:

1) nVidia supports async just fine; it's the fact that AMD and nVidia's implementations are quite different. Furthermore, in DX12, it's up to the developer to implement it correctly on both IHVs.

2) AMD APUs are not weak, but they are underutilized. If the APU was weak, async wouldn't make them magically better. The only reason that async gives AMD APUs (and GPUs) any performance benefits at all is because there's performance available that's not being used during 'normal' API function calls on AMD hardware.

Honestly, this has been discussed to death so many times on this forum (and others) that I think you're almost trolling for an indignant response (you can't intentionally be this obtuse after almost two year's worth of discussion).

Nenu · Aug 31, 2016

cageymaru said:
If I were to take a shot in the dark at guessing why...

Nvidia can't figure out how to implement it properly and they have the largest share of the video card ecosystem on the PC. The only reason that you see it in PC games nowadays is because it is baked into console games to get the most out of those much less powerful APUs contained within them. Since it is already implemented within the games before they are ported to the PC, why not just leave it in there and do minimum maintenance for the DX12 port? Probably a lot cheaper to do it this way also.

To put it bluntly...
AMD used a lot of transistors making Async work better.
NVidia instead used those transistors to make more Compute Units, this has been termed brute force because it avoids having to make async more efficient.

While Async can be a tad faster it needs a lot of work on the driver and coding side.
This requires a lot of knowledge and effort from devs.
More CUs gives higher general performance without jumping through hoops, much easier to get high performance.

If AMD could match the clocks of NVidia and produced good libraries + dev support to make best use of async, they might be able to compete.
But DX11 wont be going anywhere soon, so this wont help them at all.
Only on DX12 do they have a chance to liberate the potential of their GPUs and thats not simple. Its a very steep slope given the performance differences of the best GPUs from each.

ChrisC · Aug 31, 2016

Leldra, superb write up. Thank you.

cageymaru said:
If I were to take a shot in the dark at guessing why...

Nvidia can't figure out how to implement it properly and they have the largest share of the video card ecosystem on the PC. The only reason that you see it in PC games nowadays is because it is baked into console games to get the most out of those much less powerful APUs contained within them. Since it is already implemented within the games before they are ported to the PC, why not just leave it in there and do minimum maintenance for the DX12 port? Probably a lot cheaper to do it this way also.

Doesn't make sense. If Async is so important for compute, Nvidia would have implemented it for Tesla.

razor1 · Aug 31, 2016

cageymaru said:
On consoles developers use every trick in the book to get more performance from those APUs. ASYNC is just one tool of many.

Time constraints on life cycle of PC hardware is different and much faster than consoles, until dev's have a full set of libraries they can pull from on the PC front and even then its impossible for them to code for ever single hardware solution on the PC side.

PC's aren't bound by closed boxes, also dev's have to use every trick in the book on consoles to get what they want but that too takes time, and you can see the differences in first games that come out on new consoles vs games that come out at the end of life of a console.

Now lets get back to async and its methodology and not this drivel, which we have discussed many times over.

spine · Aug 31, 2016

It works best for nvidia if the PC API ecosystem stagnates. And since they command the high end, it's gunna take a long time before proper multi-engine based games become the standard.

razor1 · Aug 31, 2016

back to topic

nice video I forgot about.

Goes into the differences of maxwell 2 and pascal.

And yes this is talking about Vulkan, so Doom as I stated before, is not doing what needs to be done for its hardware implementation of concurrency (not multi engine or async compute) , and it all has to do with timing, Doom was close to release when Pascal was launched, they had no access to the hardware nor probably even knew about the launch till we did, as it was a surprise to all that it came out so soon.

For further understanding of multi engine you need to understand the command buffers and how they feed the pipelines

cptnjarhead · Aug 31, 2016

Well, so not taking sides here. But is the implementation of async to yield better performance or to save time in a task, or both? Complexities aside, is lack of implementation from the tech being new, or just not worth the effort? It sounds impressive on one hand, but because the two gpu vendors have a different approach, it appears their expectations are different. I think programmers/designers are like engineers, they want the best most advanced product ever created, but then the bean counters/sales and marketing butt-holes get a hold of it, and say, remove this, add this and whats left is not what the engineer intended. So, you seem very passionate about async, but as a gamer/end user, all we get are buzz words like this, I remember the whole “MMX” instructions that were going to revolutionize the industry and you wouldn’t even need a 3d accelerator. Still waiting on that one. I do not want to derail this thread, just trying to understand the tech, and what its future will bring, from a laymen prospective. Thanks

razor1 · Aug 31, 2016

actually your post is pretty much on topic

Both, time in a task and performance, but only if done correctly for a set GPU.

DX and Vulkan don't stipulate how the hardware must do the task at hand, it just has to be able to do it and this is where the divergence comes from.

Example: DX12 doesn't say async shaders (marketing term for concurrency of graphics and compute) are prerequisites to get DX12 certification, it actually is outside of the spec and clearly mentioned in the MSDN pages as they say certain hardware can do this certain hardware cannot, but since its being used, nV had to implement some version of it that fit its hardware design. Prior to Pascal, the only IHV that had it was AMD's GCN, Intel didn't have it and still doesn't have it.

The positives are there if the developer has resources and time *part of resources, but time is the most important thing because dev's are usually bound by time by a publisher.

New tech is always good, it will push the industry forward even if its not fully usable or easily usable, because future tech will make those two things happen with standards or if one wins out. If its something that just gets superseded by something else then that tech dies.

cptnjarhead · Aug 31, 2016

Thanks for the response.
So, basically we are in a transitional phase here, when concerning the implementation of async. All these DX9-11 systems/game engines out there making money.
Another question, didnt oxide starswarm mantle demo use async, and the AOTS game? Something about draw calls per frame, is that where the performance boost is?

razor1 · Aug 31, 2016

async and draw call performance aren't really connected, yeah many have correlated them in the past because of poor information. Draw call limitations of older hardware are due to the limitation of API's inability to use more than one core of the cpu. When a draw call is issued (draw calls are used for different textures, meshes, etc, anything that isn't grouped together so the rendered and cpu sees them as separate objects because different shaders have different needs), the CPU must be involved. Now because CPU's haven't been increased IPC much, draw calls have become a bottleneck since well, we still have 6 year old CPU's still performing close to the today's CPU's. This is where the new LL API's come into play, since the GPU can now use one core for the GPU needs, and use the other cores for the application needs its a much more effective way of using the total available CPU resources.

Star Swarm , I'm not sure if it had async compute, but it was on Mantle and it did show very high draw call levels, which I think that was its primary test?

AOTS does use async compute, and recent benchmarks show that nV even on Maxwell which isn't doing async is doing good on it, and Pascal is definitely doing better with performance gains being shown. And of course GCN does very well in the implementation of async on that game too. But AOTS you can't get a good look at async compute performance in most reviews cause they don't test for that exclusively, they do more of DX11 vs DX12 tests which will include both async compute and all the other speed increasing features of LL API's which draw call limitations is one of those that help out on performance.

cptnjarhead · Aug 31, 2016

Ah, i see. That explains a lot.I was under the assumption they were connected.

Hopefully with zen, vega and hUMA we could see some impressive performance gains with these new api's. Not picking one team or the other, but competition creates innovation, and dominance creates stagnation.
Thanks again for taking the time to answer my questions

virtualheretic · Aug 31, 2016

Nenu said:
To put it bluntly...
AMD used a lot of transistors making Async work better.
NVidia instead used those transistors to make more Compute Units, this has been termed brute force because it avoids having to make async more efficient.

This is exactly what I was looking for/trying to understand.
Thanks for clarification.
Great post(s) OP & everyone else that chimed in!

Ieldra · Aug 31, 2016

virtualheretic said:
This is exactly what I was looking for/trying to understand.
Thanks for clarification.
Great post(s) OP & everyone else that chimed in!

I never understood the "NV uses bruteforce" argument, first of all in a GPU using bruteforce is perfectly acceptable, second... AMD has more processing ALU/FPU units so if anything theirs is the bruteforce approach

razor1 · Aug 31, 2016

Yeah there is no such thing as Brute Force on either architecture, nV does better at some things, AMD does better at others, regarding async, nV's is not a brute force method at all, they have higher utilization to begin with, and a method of doing async in their own way. It should be looked at as more efficient if apps are made for the way Pascal needs them to be made.

AMD on the other had, has its merits, when doing preemption, it is more efficient. When doing async it too is good, but again the same thing as pascal, needs a specific way to be programmed for. Utilization is less to begin with (harder to achieve), its not really brute force, but more work needs to be done to get that out, with async that work isn't as necessary though.

Anarchist4000 · Sep 1, 2016

razor1 said:
Doesn't work that way, async is resource dependent (allocation and what not), and PC hardware even AMD hardware is different with resources over its different gens and even with in each gen. This is the same naive thinking that got all the folks riled up in the first place. Async makes it more complex to port, more time consuming, more money. LLAPI's more cost to the developer less cost for IHV's and driver teams, a big bonus on AMD part as they don't have the resources......

More complex to port to Nvidia hardware or DX11 maybe. Once SM6.0 hits it seems like we'll be fairly close to console code executing on PC. At least in regards to shaders and rendering paths. It might not be perfectly tuned to maximize PC hardware, but performance should still be really good. Simply getting the low level optimizations ported over will go a long way.

Nenu said:
While Async can be a tad faster it needs a lot of work on the driver and coding side.
This requires a lot of knowledge and effort from devs.
More CUs gives higher general performance without jumping through hoops, much easier to get high performance.

Does it really take that much more work though? The whole point of async in any application is avoiding synchronization headaches. The coding they are doing is something that would be required for any multi-threaded synchronization. It's as simple as implementing a task graph (think flowchart). In the case of AOTS a single dev implemented async in a weekend to my understanding. While it doesn't show in benchmarks, I'd imagine there is some difficulty in getting it working effectively for different vendors.

ChrisC said:
Doesn't make sense. If Async is so important for compute, Nvidia would have implemented it for Tesla.

It's not important for compute so much as graphics and compute. Most of the gains from async compute are coming from pairing compute tasks with trivial rasterization tasks (shadows and z prepasses). If Tesla, or a future compute design, had some sort of discrete coprocessor on board it might make sense. Even then for a HPC workload the programmer could attempt to schedule complementary tasks. Takes some extra work, but you're talking an environment approximating that of consoles. In theory a hardware scheduler handling async would make the programming model easier. It should be able to automatically tune the scheduling for whatever workload is running. It might not be perfect, but far easier for a developer.

razor1 said:
And yes this is talking about Vulkan, so Doom as I stated before, is not doing what needs to be done for its hardware implementation of concurrency (not multi engine or async compute) , and it all has to do with timing, Doom was close to release when Pascal was launched, they had no access to the hardware nor probably even knew about the launch till we did, as it was a surprise to all that it came out so soon.

Running a demo with Vulkan during Pascal's product launch without testing first kind of seems like playing with fire does it not?

I know we've discussed this privately. But concurrency isn't the only benefit of async. It's that whole asynchrony vs concurrent argument that still differentiates GCN from Pascal. While it's currently problematic for Nvidia, the results on GCN I think speak for themselves. Gamers seem generally happy with the results and that's arguably the most important point. Providing excessively low input lag for a shooter is significant in by book.

razor1 said:
Example: DX12 doesn't say async shaders (marketing term for concurrency of graphics and compute) are prerequisites to get DX12 certification, it actually is outside of the spec and clearly mentioned in the MSDN pages as they say certain hardware can do this certain hardware cannot, but since its being used, nV had to implement some version of it that fit its hardware design. Prior to Pascal, the only IHV that had it was AMD's GCN, Intel didn't have it and still doesn't have it.

Outside the spec because it wasn't worth it or politics dictated it not be included? Nvidia shouldn't have any trouble executing async shaders. They just need to partition the GPU a bit. Sure it hurts performance a bit and the timing won't be as good, but it should run just fine if necessary. Of course there is always the option to not use the feature.

razor1 said:
Star Swarm , I'm not sure if it had async compute, but it was on Mantle and it did show very high draw call levels, which I think that was its primary test?

It was all about overhead and lots of draw calls. Doubt async came into play there at all. It was more a tech demo for low level APIs.

razor1 · Sep 1, 2016

Anarchist4000 said:
More complex to port to Nvidia hardware or DX11 maybe. Once SM6.0 hits it seems like we'll be fairly close to console code executing on PC. At least in regards to shaders and rendering paths. It might not be perfectly tuned to maximize PC hardware, but performance should still be really good. Simply getting the low level optimizations ported over will go a long way.

Same issues with exist to AMD hardware as well, intrinsic functions don't all map over, AMD hardware over different generations have different features, funny thing is Vulkan actually has more in common with most of GCN gens than DX12 does!

Does it really take that much more work though? The whole point of async in any application is avoiding synchronization headaches. The coding they are doing is something that would be required for any multi-threaded synchronization. It's as simple as implementing a task graph (think flowchart). In the case of AOTS a single dev implemented async in a weekend to my understanding. While it doesn't show in benchmarks, I'd imagine there is some difficulty in getting it working effectively for different vendors.

No this is not the point of async shaders, Graphics + compute have to have sync points there is no real way. Async compute is not async shaders......

Starting from code to the final compositing of frames, certain steps have to be done in a certain way and things have to be synced up, otherwise you will get artifacts. Within each step is where async shaders, async compute do their thing, but even those are bound to dependencies based on the application and what the dev is going for.

It's not important for compute so much as graphics and compute. Most of the gains from async compute are coming from pairing compute tasks with trivial rasterization tasks (shadows and z prepasses). If Tesla, or a future compute design, had some sort of discrete coprocessor on board it might make sense. Even then for a HPC workload the programmer could attempt to schedule complementary tasks. Takes some extra work, but you're talking an environment approximating that of consoles. In theory a hardware scheduler handling async would make the programming model easier. It should be able to automatically tune the scheduling for whatever workload is running. It might not be perfect, but far easier for a developer.

Depends on how and what the hardware schedulers are doing and what the architecture needs are. If code being written is simple to get the max utilization out of a hardware, hardware scheduling needs drop. And many people have stated its easier to get better utilization out of nV hardware over AMD's GCN. It could be they are more used to nV hardware as they have been using scaler architecture for a much longer time, which sets the foundation, but I don't think that is what they ment because many of them also say AMD's GCN is much harder to get similar results because of bottlenecks and to avoid those bottlenecks you need to do extra work. Right off the bat we are very well aware of three bottlenecks, without any insight into programming. With more insight into programming, I can think of a couple areas where GCN lags behind.

Running a demo with Vulkan during Pascal's product launch without testing first kind of seems like playing with fire does it not?

I know we've discussed this privately. But concurrency isn't the only benefit of async. It's that whole asynchrony vs concurrent argument that still differentiates GCN from Pascal. While it's currently problematic for Nvidia, the results on GCN I think speak for themselves. Gamers seem generally happy with the results and that's arguably the most important point. Providing excessively low input lag for a shooter is significant in by book.

nV had access to the game, doesn't mean the dev's of Doom had access to Pascal.

Concurrancy isn't the only benefit of async, but probably the one that garners the most performance if done right. It isn't problematic on Pascal at all, and that is what that video was showing. We haven't seen any Vulkan application running concurrent execution on Pascal yet. Doom is no a good example since the dev's themselves stated they didn't work on it with Pascal at the point of release of the Vulkan version. But we have timespy, AOTS (poster child for async shaders) both working fine on Pascal with performance increase when turning on async shaders. Then we have the others that just seem to do wierd things with different versions and different areas in the map, Hitman is one of those. Then ya have ROTR which is just blah, and we can throw Quantum break in there.

If we take out the DX11 versions of some of these DX12 games that have advantages for AMD hardware in DX11, which obviously the dev's have specifically optimized for AMD hardware, well the list goes pretty short for AMD advantage in DX12.

I'm not sure about input lag, can you clarify?

Outside the spec because it wasn't worth it or politics dictated it not be included? Nvidia shouldn't have any trouble executing async shaders. They just need to partition the GPU a bit. Sure it hurts performance a bit and the timing won't be as good, but it should run just fine if necessary. Of course there is always the option to not use the feature.

Most likely politics, I agree, but when you have a company with 20% marketshare vs, the rest well MS can't forget the rest. And nV doesn't have problems with it in the sense Maxwell 2 can do it, but with a performance cost. Pascal solves that problem.

Ieldra · Sep 1, 2016

I would say "async compute" never existed in the first place and was just a marketing response to "async shaders" which is just the name of the hardware implementation on GCN. We get it, three independent engines, asynchronous wrt each other, no need to call everything asynchronous X

Generally speaking would you guys consider something asynchronous if it has a dependency on another task, I was just discussing this with some of my CS friends and we had disagreements. I was arguing that say you have a particle simulation and the render of the particles on two different streams, so long as at any given time the particle simulation being run is comfortably ahead of the render we can consider the overlapping kernels (through asynchronous call to FFU for example) as asynchronous but in general terms I would not consider particle simulation and render asynchronous because if you did no exert control and let particle sim run faster you could get a mess. Now I realize you could also have non-blocking dependencies using atomics but it's an interesting topic nonetheless.

Anarchist4000 · Sep 1, 2016

razor1 said:
I'm not sure about input lag, can you clarify?

Tying the compute pass to blanks with reprojection. It should be possible

razor1 said:
AOTS (poster child for async shaders)

Not sure AOTS actually used async shaders. Async compute yes, but shaders would imply timing sensitivity or a different engine accelerating something. I don't believe they had AI or anything else running on the GPU.

Ieldra said:
Generally speaking would you guys consider something asynchronous if it has a dependency on another task,

Most things will have dependencies, it's more about allowing some sort of scheduler the freedom to do what is most efficient. File IO for example might want to rearrange operations by whichever head is active. The key here isn't that there is synchronization, but that you could have fully asynchronous tasks.

Ieldra · Sep 1, 2016

Anarchist4000 said:
Tying the compute pass to blanks with reprojection. It should be possible

Not sure AOTS actually used async shaders. Async compute yes, but shaders would imply timing sensitivity or a different engine accelerating something. I don't believe they had AI or anything else running on the GPU.

Most things will have dependencies, it's more about allowing some sort of scheduler the freedom to do what is most efficient. File IO for example might want to rearrange operations by whichever head is active. The key here isn't that there is synchronization, but that you could have fully asynchronous tasks.

How is async shaders functionally different from async compute ? async compute is just anything on the compute queue(s) which are asynchronous no ? async shaders is like dynamic load balancing in that they are the hardware features that allow developers to extract performance using the independent asynchronous streams

Anarchist4000 · Sep 1, 2016

The shaders would be more akin to asynchronous asynchronous compute. I see async compute as here's 4 independent tasks. Async shaders as here's 4 tasks, and task 5 is late to the party and is a VIP. Suire you can put that VIP at the back of the line, but it'll probably piss him off. It's more about unpredictability.

OutOfPhase · Sep 1, 2016

Ieldra said:
I would say "async compute" never existed in the first place and was just a marketing response to "async shaders" which is just the name of the hardware implementation on GCN. We get it, three independent engines, asynchronous wrt each other, no need to call everything asynchronous X

Generally speaking would you guys consider something asynchronous if it has a dependency on another task, I was just discussing this with some of my CS friends and we had disagreements. I was arguing that say you have a particle simulation and the render of the particles on two different streams, so long as at any given time the particle simulation being run is comfortably ahead of the render we can consider the overlapping kernels (through asynchronous call to FFU for example) as asynchronous but in general terms I would not consider particle simulation and render asynchronous because if you did no exert control and let particle sim run faster you could get a mess. Now I realize you could also have non-blocking dependencies using atomics but it's an interesting topic nonetheless.

I'm into massively threaded programming, but less so in graphics specifically. I think my response is correct in CS terms, but feel free to correct me on any graphics specifics.

In a particle simulation, generally each particle could be considered an asynchronous task. They do not actually depend upon other particles, just their own position and the ambient conditions. Therefore in theory you could compute particles all completely simultaneously - given enough execution units.

Further, since the computation of the N+1 positions doesn't actually depend upon the visual rendering of frame N you can overlap the computation of the next frame with the one being rendered. Again, if there are enough computation units free to do so. Otherwise it still decomposes to a linear progression.

I'm specifically going to avoid vendor names here, but this can explain why things differ on APIs for differing vendors / architectures. If a GPU was not being utilized fully by either computation or rendering for a single frame, overlapping the two would provide tangible benefits. If however you have an architecture and driver which was filling the execution units effectively with either task, then doing them "asynchronously" (i.e. concurrently) doesn't help. You're still doing them sequentially as each task fully occupies the computation units.

So getting back to your hypothetical, I'd describe a particle sim as very asynchronous / very parallel in most respects. Unless I'm really missing something, which happens.

Ieldra · Sep 1, 2016

PhaseNoise said:
I'm into massively threaded programming, but less so in graphics specifically. I think my response is correct in CS terms, but feel free to correct me on any graphics specifics.

In a particle simulation, generally each particle could be considered an asynchronous task. They do not actually depend upon other particles, just their own position and the ambient conditions. Therefore in theory you could compute particles all completely simultaneously - given enough execution units.

Further, since the computation of the N+1 positions doesn't actually depend upon the visual rendering of frame N you can overlap the computation of the next frame with the one being rendered. Again, if there are enough computation units free to do so. Otherwise it still decomposes to a linear progression.

I'm specifically going to avoid vendor names here, but this can explain why things differ on APIs for differing vendors / architectures. If a GPU was not being utilized fully by either computation or rendering for a single frame, overlapping the two would provide tangible benefits. If however you have an architecture and driver which was filling the execution units effectively with either task, then doing them "asynchronously" (i.e. concurrently) doesn't help. You're still doing them sequentially as each task fully occupies the computation units.

So getting back to your hypothetical, I'd describe a particle sim as very asynchronous / very parallel in most respects. Unless I'm really missing something, which happens.

In a particle sim the ambient conditions of all particles includes the influence of other particles though, and I wasn't talking about dependency within one task, I was talking about the asychrony of the compute shader (particle sim) and the render of the particles themselves. Say we're running frames using a fixed time-step in the engine, so that each frame is 1ms of simulation then if the simulation is ahead of the rendering you can say they are asynchronous wrt each other because for example while the simulation for frame N + k is being executed on the work units (CU/SM) the render task can execute asynchronously (without communicating with the particle sim) because you're guaranteed that the simulation results for the frame N being rendered is ready.

Again I think this links back to defining the terms because definitions are all over the place, to avoid confusion I just think of asynchrony as a condition which is used as a means to express/implement a concurrent execution model. If I send my butler to fetch wine and cheese from the pantry and sit there doing nothing but waiting for him, he and I are operating synchronously. If I write a long ass post about async compute while he fetches wine and cheese then we are operating asynchronously, because the task I need done is handled independently by him, and there is an implicit promise that he'll return and not run off with my fucking wine and cheese. Now my drinking and eating the wine and cheese are not asynchronous wrt his fetching because I have to wait for him to come back. I'd say the key thing is independent execution and lack of communication between 'tasks' - however you choose to define them - now you could argue that using fences for signalling lies outside the scope of the asynchronous operation (just determines when it is submitted).

If you are doing AI compute work while rendering then that is a very good candidate for true asynchronous execution because there's no relation between the rendering and the AI sim

i'm not a graphics programmer, electronics student here - razor and antichrist are the graphics dudes I know of, and there's someone else who posted in this thread I think michael something

OutOfPhase · Sep 1, 2016

Agreed with all your points. The real key is being able to execute tasks which have no dependencies on each other simultaneously on execution units which would otherwise be idle. There are many forms that can take, but the goal is to provide ways of keeping your shaders occupied with work and avoid gaps / wasted time.

I'm kinda a graphics dude, having written an opengl stack in the 90s. Things have progressed a lotmsince then!

razor1 · Sep 1, 2016

Ieldra said:
In a particle sim the ambient conditions of all particles includes the influence of other particles though, and I wasn't talking about dependency within one task, I was talking about the asychrony of the compute shader (particle sim) and the render of the particles themselves. Say we're running frames using a fixed time-step in the engine, so that each frame is 1ms of simulation then if the simulation is ahead of the rendering you can say they are asynchronous wrt each other because for example while the simulation for frame N + k is being executed on the work units (CU/SM) the render task can execute asynchronously (without communicating with the particle sim) because you're guaranteed that the simulation results for the frame N being rendered is ready.

Again I think this links back to defining the terms because definitions are all over the place, to avoid confusion I just think of asynchrony as a condition which is used as a means to express/implement a concurrent execution model. If I send my butler to fetch wine and cheese from the pantry and sit there doing nothing but waiting for him, he and I are operating synchronously. If I write a long ass post about async compute while he fetches wine and cheese then we are operating asynchronously, because the task I need done is handled independently by him, and there is an implicit promise that he'll return and not run off with my fucking wine and cheese. Now my drinking and eating the wine and cheese are not asynchronous wrt his fetching because I have to wait for him to come back. I'd say the key thing is independent execution and lack of communication between 'tasks' - however you choose to define them - now you could argue that using fences for signalling lies outside the scope of the asynchronous operation (just determines when it is submitted).

If you are doing AI compute work while rendering then that is a very good candidate for true asynchronous execution because there's no relation between the rendering and the AI sim

i'm not a graphics programmer, electronics student here - razor and antichrist are the graphics dudes I know of, and there's someone else who posted in this thread I think michael something

LOL antichrist, anarchist I think you mean

, but yeah both of you guys are saying the right things. async shaders aren't in the sense of CPU programming asynchronicity. Graphics and compute have to work in parallel but do to constraints, specifically fixed function rendering pipeline, conditions have to be met. Even if those fixed function parts are replaced with programmable units, you still need to composite the final image, so there will always be dependencies. So end result asynchronicity on a GPU dealing with graphics isn't the conventional asynchronicity we think of with programming, and thus a better term is concurrency.

PS. not saying CPU's or general application code doesn't have constraints, but rendering out a frame/picture to what an application needs are, its quite different.

GoodBoy · Sep 1, 2016

Nice interesting read. I love stuff like this.

Good discussion all.

WorldExclusive · Sep 2, 2016

"Concurrency can be seen as a more general notion of parallelism, with parallelism being a subset of concurrency upon which an additional condition is placed; independence."

It's like when you're in class and the instructor asks "Are there any questions?" And no one raises their hands because everyone's brain froze up from the complexity.

Great examples though.

Factum · Sep 2, 2016

I like that you made it TOO technical for the usual trolls...but not too technical to drop all none-programmers on the floor....well played

Ieldra · Sep 2, 2016

Factum said:
I like that you made it TOO technical for the usual trolls...but not too technical to drop all none-programmers on the floor....well played

Unfortunately the same can't be said on reddit, and I have the usual brigade of 'experts' saying Pascal uses preemption instead of async compute, which is an inferior solution.

I become my own worst enemy at this point because my patience thins quickly

Wtf do you mean preemption is worse than async compute, that's like saying fuel injection is worse than cars.

Electric motors are worse than watercooling loops.

Traffic lights are worse than intersections.

*deep breathing exercises*

Latest bull honkey is that GP100 is somehow different from the other Pascal parts as far async compute is concerned...
I don't even...

Ieldra · Sep 2, 2016

razor1 said:
back to topic

nice video I forgot about.

Goes into the differences of maxwell 2 and pascal.

And yes this is talking about Vulkan, so Doom as I stated before, is not doing what needs to be done for its hardware implementation of concurrency (not multi engine or async compute) , and it all has to do with timing, Doom was close to release when Pascal was launched, they had no access to the hardware nor probably even knew about the launch till we did, as it was a surprise to all that it came out so soon.

For further understanding of multi engine you need to understand the command buffers and how they feed the pipelines

I've never heard of this 'khronos group' youtube guy, must be some kind of AdoredTV wannabe. Imitation is the greatest form of flattery.

razor1 · Sep 2, 2016

khronos group is the committee behind Vulkan, Ogl,

Ieldra · Sep 2, 2016

razor1 said:
khronos group is the committee behind Vulkan, Ogl,

Nothing to fear! Captain obvious is here

Did I really need an /s there

?

JustReason · Sep 2, 2016

the part you are missing and what most of the debate was always on is access to the GPU. I have yet to find anything on Nvidias Gigathread and how it accepts commands/work. Better put: AMD claims its ACEs are visible to the API hence the ASYNC nature, the ability to accept graphics ques and compute ques simultaneously. Nvidia did a good thing with the Pascal structure as having 4 units independent of each other helps alleviate the issues with async on the end. But it looks as if it still has to accept commands/work one at a time, but unlike Maxwell it can initiate a compute que while a graphics que is active as long as there is an available unit.

This is the contested part as it was before, just with Maxwell the whole process was limited by its serial nature further escalating the issue. With Pascal they fixed a portion of it, perhaps the biggest part as far as time/latency being the accepting a command/work one at a time is likely less of a time constraint than having some idle units because the whole unit needs to context switch.

Also this might be where the pre-emption issue is originating as well. Because of how it accepts commands/work it must reassign orders based on time constraints and this is before being dispatched in the GPU. It was something alluded to in Nvidias Pascal white paper.

Demystifying Asynchronous Compute - V1.0

[H]F Junkie

[H]F Junkie

I Promise to RTFM

Crossfit Fast Walk Champion Runnerup

Supreme [H]ardness

Extremely [H]

Supreme [H]ardness

[H]F Junkie

[H]ard|Gawd

[H]ardened

Weaksauce

[H]F Junkie

2[H]4U

[H]F Junkie

Crossfit Fast Walk Champion Runnerup

[H]F Junkie

Crossfit Fast Walk Champion Runnerup

[H]F Junkie

Crossfit Fast Walk Champion Runnerup

Limp Gawd

I Promise to RTFM

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

I Promise to RTFM

[H]ard|Gawd

I Promise to RTFM

[H]ard|Gawd

Supreme [H]ardness

I Promise to RTFM

Supreme [H]ardness

[H]F Junkie

2[H]4U

[H]F Junkie

2[H]4U

I Promise to RTFM

I Promise to RTFM

[H]F Junkie

I Promise to RTFM

razor1 is my Lover