Demystifying Asynchronous Compute - V1.0

Ieldra · Sep 2, 2016

I've been waiting for you.

JustReason said:
Nvidia did a good thing with the Pascal structure as having 4 units independent of each other helps alleviate the issues with async on the end.

Uhu... Which four units would those be?

But it looks as if it still has to accept commands/work one at a time, but unlike Maxwell it can initiate a compute que while a graphics que is active as long as there is an available unit.

Does it look that way? Based on what? Doesn't look that way to me.

Maxwell *can* dispatch from compute queues while graphics tasks are in execution. That's the whole point of independent queues.

This is the contested part as it was before, just with Maxwell the whole process was limited by its serial nature further escalating the issue. With Pascal they fixed a portion of it, perhaps the biggest part as far as time/latency being the accepting a command/work one at a time is likely less of a time constraint than having some idle units because the whole unit needs to context switch.

Contested by whom? Everything is contested by someone.

I've heard all about the 'serial nature' of maxwell from you in the time I've been active on this forum, never seen any evidence for it.

Did you read my post? That part where I said' serial is the opposite of parallel, sequential is the opposite of concurrent' clearly didn't hit home.

The part about Pascal fixing latency and something about time constraints and one command at a time is basically gobbledygook.

Also this might be where the pre-emption issue is originating as well. Because of how it accepts commands/work it must reassign orders based on time constraints and this is before being dispatched in the GPU. It was something alluded to in Nvidias Pascal white paper.

Yes as opposed to GCN that doesn't reassign based on time constraints because it operates outside our spacetime dimension.

razor1 · Sep 2, 2016

Ieldra said:
Nothing to fear! Captain obvious is here

Did I really need an /s there ?

LOL sarcasm just doesn't come across in posts!

JustReason said:
the part you are missing and what most of the debate was always on is access to the GPU. I have yet to find anything on Nvidias Gigathread and how it accepts commands/work. Better put: AMD claims its ACEs are visible to the API hence the ASYNC nature, the ability to accept graphics ques and compute ques simultaneously. Nvidia did a good thing with the Pascal structure as having 4 units independent of each other helps alleviate the issues with async on the end. But it looks as if it still has to accept commands/work one at a time, but unlike Maxwell it can initiate a compute que while a graphics que is active as long as there is an available unit.

This is the contested part as it was before, just with Maxwell the whole process was limited by its serial nature further escalating the issue. With Pascal they fixed a portion of it, perhaps the biggest part as far as time/latency being the accepting a command/work one at a time is likely less of a time constraint than having some idle units because the whole unit needs to context switch.

Also this might be where the pre-emption issue is originating as well. Because of how it accepts commands/work it must reassign orders based on time constraints and this is before being dispatched in the GPU. It was something alluded to in Nvidias Pascal white paper.

Well AMD's ACE's are programmable, that comes into play with VR work and preemption, so they do have more flexibility in thier hardware for this, but without the raw performance, well we can see what happens with VR. Doesn't matter if they have the flexibility without the performance.

At the end, for concurrancy, how the hardware does it shouldn't matter, the programmer shouldn't be worried about getting it to work if he or she does their homework on how to feed the GPU.

This is why all those people who started talking about ACE's, including AMD's marketing, was talking BS, they didn't understand where the problem really was with Maxwell, no one outside of nV knew it, after that, AMD and their fans started talking about scheduling, which NO that has nothing to do with this either.

I will reiterate, number of ACE's, scheduling, if a person starts talking about that, they aren't talking about concurrency! That hardware is outside of concurrency, but by addition concurrency needs that hardware for other things.

Its pretty bad sitting there watching people get fooled by marketing docs that any company release because of ignorance and when people that know what they are talking about gets drowned out by a mob effect of fanboys.

Ieldra · Sep 2, 2016

razor1 said:
LOL sarcasm just doesn't come across in posts!

Well AMD's ACE's are programmable, that comes into play with VR work and preemption, so they do have more flexibility in thier hardware for this, but without the raw performance, well we can see what happens with VR. Doesn't matter if they have the flexibility without the performance.

At the end, for concurrancy, how the hardware does it shouldn't matter, the programmer shouldn't be worried about getting it to work if he or she does their homework on how to feed the GPU.

This is why all those people who started talking about ACE's, including AMD's marketing, was talking BS, they didn't understand where the problem really was with Maxwell, no one outside of nV knew it, after that, they started talking about scheduling, which NO that has nothing to do with this either,

Its pretty bad sitting there watching people get fooled by marketing docs that any company release because of ignorance and when people that know what they are talking about gets drowned out by a mob effect of fanboys.

We went through so many stages of maxwell grief

First it was context switching causing latency

Then it was the GMU not being active in D3D

Then it was RF size

razor1 · Sep 2, 2016

Yep, it was like find a weakness and point it out, lol pretty pathetic when AMD has its own short comings that nV could have taken advantage in their dev program and they didn't, (outside of tessellation). They should feel lucky that nV didn't push more on the CPU on purpose, and use shaders that pushes cache and register space. Just one shader would bring all AMD cards to their knees doesn't matter if the game is DX12 or 11 or Vulkan.

We have seen both AMD and nV uses shaders that have affinity to their own hardware in the past with their dev programs, so this isn't anything new.

Back to concurrency, its not nice to make people fools by marketing, because it just makes everyone dumb.

Its all about sales at the end of the day, and AMD did its marketing well, much better then usual, with the help of AOTS, Oxide, Stardock, and AMD fanboys doing their dirty work.

JustReason · Sep 2, 2016

Neither of you answered the point of how it accepts commands/work. Seriously nothing of what I spoke of is based on marketing but on articles and documentation of architectures. I mean if you do not know then just say you don't, I don't hence the question.

And watching either of you claim smoke and mirrors when a great deal of the discussion in many forums is based on facts. Maxwells inability to handle qraphics and compute ques simultaneously is a fact not some assumption. Now that we have Pascal that fact has been verified and acknowledged with the GPC (not sure on this term but it is the subset units within the Pascal architecture). It is able to run both simultaneously which was the huge issue with Maxwell. But as I stated they have yet to speak to how it accepts commands or if they are similar to ACEs and are visible to the API.

Now if either of you can be bothered and answer this question it would be great. I haven't seen anything on it as of yet. My understanding is greater with AMD hardware, not so much with Nvidia, which is where both of you seem to venture.

Ieldra · Sep 2, 2016

JustReason said:
Neither of you answered the point of how it accepts commands/work. Seriously nothing of what I spoke of is based on marketing but on articles and documentation of architectures. I mean if you do not know then just say you don't, I don't hence the question.

I've made a habit of answering questions, not points, and your post was devoid of either.

And watching either of you claim smoke and mirrors when a great deal of the discussion in many forums is based on facts. Maxwells inability to handle qraphics and compute ques simultaneously is a fact not some assumption. Now that we have Pascal that fact has been verified and acknowledged with the GPC (not sure on this term but it is the subset units within the Pascal architecture). It is able to run both simultaneously which was the huge issue with Maxwell. But as I stated they have yet to speak to how it accepts commands or if they are similar to ACEs and are visible to the API.

It's neither a fact nor an assumption; the latter implies some kind of reasoning, I haven't seen any reasoning from you.

Maxwell and Pascal both have independent dispatchers for graphics and compute, even Kepler actually.

Kepler, Maxwell and Pascal have 32 execution slots for pending grids in the GMU.

On Kepler they were all reserved for the graphics processor except one. On maxwell the reservation could be lifted dynamically.

Now if either of you can be bothered and answer this question it would be great. I haven't seen anything on it as of yet. My understanding is greater with AMD hardware, not so much with Nvidia, which is where both of you seem to venture.

Again, what question? You made a bunch of claims out of nowhere and said no nvidia documentation explains this. Wrong on all counts.

How is 'my understanding is greater with AMD hardware' relevant? These relative terms could mean anything; anything is better than nothing.

I'd argue we understand GCN better because until recently we had no confirmation of exactly why Maxwell never had the queues exposed in drivers, we speculated that this was the reason at some point, but we couldn't be sure . it makes a lot of sense, the SMs need a pipeline flush to switch mode so you can't mix , and you need context swaps every time. There's no dedicated swap buffer, it goes to vram, so obviously nvidia isn't relying on context swaps, because that would be dumb. Like I said in the OP, the effectiveness of such an approach is contingent on the task being swapped in hiding the latency incurred. So big tasks that will never fit in any holes on maxwell anyway. SM partitioning made sense and guess what, nvidia confirmed it.

ManofGod · Sep 2, 2016

Ieldra said:
I've made a habit of answering questions, not points, and your post was devoid of either.

It's neither a fact nor an assumption; the latter implies some kind of reasoning, I haven't seen any reasoning from you.

Wow man, way to defend Nvidia! So, are you actually going to answer what he is saying or just put him down like you usually do? You are one of the primary reasons I switched back to AMD hardware, thank you.

(Cannot stand closet fanboys.) Async compute is simply better on AMD hardware and so far, Nvidia has not been able to answer in kind at the hardware level. Yes, I am an AMD fan and do not hid it but that does not make me blind to the rest of the hardware out there.

Not being a fanboy is liberating and lets me enjoy what I want well also enjoying what else is out there.

lolfail9001 · Sep 2, 2016

ManofGod said:
Wow man, way to defend Nvidia! So, are you actually going to answer what he is saying or just put him down like you usually do? You are one of the primary reasons I switched back to AMD hardware, thank you. (Cannot stand closet fanboys.) Async compute is simply better on AMD hardware and so far, Nvidia has not been able to answer in kind at the hardware level. Yes, I am an AMD fan and do not hid it but that does not make me blind to the rest of the hardware out there.

Not being a fanboy is liberating and lets me enjoy what I want well also enjoying what else is out there.

The last sentence is quite ironic coming from you, but whatever. Anyways, he did answer him in the rest of the post you've quoted, so what is your problem exactly?

Ieldra · Sep 2, 2016

Where is the reasoning behind it, we've been having this back and forth for months, is it surprising I'm exasperated? He posted in *this thread* telling me my posts are smoke and mirrors, come on. It's always him, denying evidence, like when AMD clarified the scaling with mgpu in the AotS demo and he still insisted it wasn't true, just one of many inane arguments we've had, not to mention the percentages.

Async compute isn't better on AMD hardware, the implementation is more robust, more expensive, more flexible and operates on a finer level of granularity. The capacity to interweave threadgroups from different streams with little to no overhead enables gcn to saturate the shader array by filling smaller holes than nvidia possibly could.

On the other hand we have nvidia working in almost the opposite direction and optimizing with a view to reduce holes in the first place, the main advantage of enabling dynamic partitioning isn't really throughput its more flexibility, which maxwell had, but at the cost of latency if the partitioning wasn't balance wrt the load at the drawcall.

I realize this is a longer answer than 'yes x is better than y', but a little nuance goes a long way if you're actually interested in understanding how things work at something more than a superficial level

razor1 · Sep 2, 2016

JustReason said:
Neither of you answered the point of how it accepts commands/work. Seriously nothing of what I spoke of is based on marketing but on articles and documentation of architectures. I mean if you do not know then just say you don't, I don't hence the question.

And watching either of you claim smoke and mirrors when a great deal of the discussion in many forums is based on facts. Maxwells inability to handle qraphics and compute ques simultaneously is a fact not some assumption. Now that we have Pascal that fact has been verified and acknowledged with the GPC (not sure on this term but it is the subset units within the Pascal architecture). It is able to run both simultaneously which was the huge issue with Maxwell. But as I stated they have yet to speak to how it accepts commands or if they are similar to ACEs and are visible to the API.

Now if either of you can be bothered and answer this question it would be great. I haven't seen anything on it as of yet. My understanding is greater with AMD hardware, not so much with Nvidia, which is where both of you seem to venture.

Yes we did lol, the hardware from keplar onward has to have different queues, different dispatches, ability to do more than one type of processing on the same CU, with out any of these they CAN'T be DX 12 complaint!

IS that spelled out for you? You are not reading what Ieldra and I have been posting with sources and all, still falling back on AMD marketing.

Maxwell, didn't have any problems with any of this. It was freakin good at it and better then GCN (more efficient).

The problem comes in with CONCURRENT processing of graphics and compute after the initial partitioning of the SM's are done for Maxwell. (this is what we picked up at B3D with the small async program, but couldn't make sense of it because the driver was breaking down and using a context switch, where it shouldn't have been) It has to wait to repartition the SM's if the application doesn't have the same or similar ratio of compute vs. graphics commands, which the programmer has full control over by using fences and barriers. So Maxwell, ya gotta do a lot more work for async shaders (concurrency) to run well.

Now if you want to talk about async compute, and only async compute (not concurrency), nV's architectures are light years ahead of GCN. Added bonuses to their architectures is the propriety API's.

razor1 · Sep 2, 2016

ManofGod said:
Wow man, way to defend Nvidia! So, are you actually going to answer what he is saying or just put him down like you usually do? You are one of the primary reasons I switched back to AMD hardware, thank you. (Cannot stand closet fanboys.) Async compute is simply better on AMD hardware and so far, Nvidia has not been able to answer in kind at the hardware level. Yes, I am an AMD fan and do not hid it but that does not make me blind to the rest of the hardware out there.

Not being a fanboy is liberating and lets me enjoy what I want well also enjoying what else is out there.

Err you guys aren't able to read or possible even comprhend what was typed in the past 2 pages I think, because if anyone that able to have a basic understanding of programming, at a very high level, can see what has been posted is all the answers Justreason and you need, your questions have been answered many times over in the past 2 pages.

ManofGod · Sep 2, 2016

razor1 said:
Err you guys aren't able to read or possible even comprhend what was typed in the past 2 pages I think, because if anyone that able to have a basic understanding of programming, at a very high level, can see what has been posted is all the answers Justreason and you need, your questions have been answered many times over in the past 2 pages.

Sorry but, when I see the response he gave to justreason, I cannot help but think of him in any other way but the way in which I posted. And too think, I was about to give thanks after reading some of what he posted, I think I withhold that for the time being. Nice work he did but, looking down upon someone because they do not think the way you do shows a distinct lack of respect.

ManofGod · Sep 2, 2016

lolfail9001 said:
The last sentence is quite ironic coming from you, but whatever. Anyways, he did answer him in the rest of the post you've quoted, so what is your problem exactly?

Fanboy and being a fan of something is no where near the same thing. I do not hid my preference for a particular product, fanboys do but, whatever.

razor1 · Sep 2, 2016

ManofGod said:
Sorry but, when I see the response he gave to justreason, I cannot help but think of him in any other way but the way in which I posted. And too think, I was about to give thanks after reading some of what he posted, I think I withhold that for the time being. Nice work he did but, looking down upon someone because they do not think the way you do shows a distinct lack of respect.

What else can be stated at a person when who isn't reading what others have spent a good deal of time to understand and summarize it in such a way that it can be understandable to even the laymen?

His synopsis is one of the most concise, and easiest explanation to what the problems and solutions are to the async confusion.

I have stated this many times over, this topic is not easy to talk about when one side doesn't understand some things from a hardware level to a software level. Ieldra just made it so that the reader doesn't need to know the nuances of either of those.

Stev3FrencH · Sep 2, 2016

Idelra and Razor1 thanks for all of explaining in this post. These two jokers are never going to get it and are not worth your time or breath. This thread should be locked, but maybe stickied because I feel the information is all good, but there is really nothing else to add. You both did a good job of explaining what is happening on a hardware level.

ManofGod · Sep 2, 2016

Ieldra said:
Where is the reasoning behind it, we've been having this back and forth for months, is it surprising I'm exasperated? He posted in *this thread* telling me my posts are smoke and mirrors, come on. It's always him, denying evidence, like when AMD clarified the scaling with mgpu in the AotS demo and he still insisted it wasn't true, just one of many inane arguments we've had, not to mention the percentages.

Async compute isn't better on AMD hardware, the implementation is more robust, more expensive, more flexible and operates on a finer level of granularity. The capacity to interweave threadgroups from different streams with little to no overhead enables gcn to saturate the shader array by filling smaller holes than nvidia possibly could.

On the other hand we have nvidia working in almost the opposite direction and optimizing with a view to reduce holes in the first place, the main advantage of enabling dynamic partitioning isn't really throughput its more flexibility, which maxwell had, but at the cost of latency if the partitioning wasn't balance wrt the load at the drawcall.

I realize this is a longer answer than 'yes x is better than y', but a little nuance goes a long way if you're actually interested in understanding how things work at something more than a superficial level

So basically, what you are saying neither one is better than the other. However, AMD is being taken better advantage of because they have the hardware to do so in this instance. (When the time is put in to take advantage of, that is.) So, in both instances, game makers have to optimize for the respective architecture and if they do so, AMD has a tendency to do a lot better than they otherwise would.

Good thing is, we have choice and what works for one may not work for the other. Case in point, AMD downscales better on my monitor and looks good well the NVidia hardware does not. (R9 290x, R9 380 and R9 Fury all look really good on my monitor and the downscaling looks really good.) Samsung 28 inch 4k monitor. The 980 Ti, however, looked dull on the desktop and in games and the downscaling looked like crap no matter the game.

No, I did not setup my monitor incorrectly despite what some may have me try to believe. However, this is not the case on all monitors and all hardware configurations but, it is on mine. Oh, and the games would play herky jerky well for whatever reason, AMD would play better and behave better on my particular system. Monitor is not Free Sync capable but, could be that the NVidia hardware just did not play nice with everything else I had, including the monitor.

However, I will go ahead and say thanks for posting what you originally posted anyways. I still consider you an Nvidia closet fanboy but that is fine, does not make you any less capable of doing whatever you do for a living. Programming is way over my head anyways but that is not my field of expertise anyways. It is why I love computers, all computers, operating systems, hardware and so on.

Ieldra · Sep 2, 2016

ManofGod said:
So basically, what you are saying neither one is better than the other. However, AMD is being taken better advantage of because they have the hardware to do so in this instance. (When the time is put in to take advantage of, that is.) So, in both instances, game makers have to optimize for the respective architecture and if they do so, AMD has a tendency to do a lot better than they otherwise would.

Good thing is, we have choice and what works for one may not work for the other. Case in point, AMD downscales better on my monitor and looks good well the NVidia hardware does not. (R9 290x, R9 380 and R9 Fury all look really good on my monitor and the downscaling looks really good.) Samsung 28 inch 4k monitor. The 980 Ti, however, looked dull on the desktop and in games and the downscaling looked like crap no matter the game.

No, I did not setup my monitor incorrectly despite what some may have me try to believe. However, this is not the case on all monitors and all hardware configurations but, it is on mine. Oh, and the games would play herky jerky well for whatever reason, AMD would play better and behave better on my particular system. Monitor is not Free Sync capable but, could be that the NVidia hardware just did not play nice with everything else I had, including the monitor.

However, I will go ahead and say thanks for posting what you originally posted anyways. I still consider you an Nvidia closet fanboy but that is fine, does not make you any less capable of doing whatever you do for a living. Programming is way over my head anyways but that is not my field of expertise anyways. It is why I love computers, all computers, operating systems, hardware and so on.

I'm happy you're happy with your downscaling, and I'm happy you can accept this thread for what it is and look beyond how you perceive me, closet or no closet, fanboy or not - it doesn't matter in the end when we're talking about stuff in whitepapers; I've never denied AMD's Async Shaders setup is overall superior to NV's, what I've argued against is that notion that just because hardware X gains Y% from async and hardware Z does not doesn't mean X>Z. You might think, well nobody said that, yeah now... When async first launched with AotS it was foretelling doom for maxwell left right and center, and people conveniently ignored that even in DX11 (and DX12 with async disabled) the performance was matching AMD roughly flop for flop. i get that from your side you see me always arguing in NV's favor in these discussions so you think (or choose to think) it's an inherent bias, but it isn't, if everyone was talking a load of BS about GCN that wasn't true I'd be just as persistent and just as annoying when discussing it. I'm not calling for a boycott of async compute so I take away the advantage for AMD lol, I'm all for it, more competitive performance across the board.

Anarchist4000 · Sep 2, 2016

Ieldra said:
Async compute isn't better on AMD hardware, the implementation is more robust, more expensive, more flexible and operates on a finer level of granularity. The capacity to interweave threadgroups from different streams with little to no overhead enables gcn to saturate the shader array by filling smaller holes than nvidia possibly could.

It is better IMHO just because GCN doesn't have to differentiate between graphics and compute. If the partitioning isn't perfectly accurate, which is likely, you are leaving performance behind.

razor1 said:
Now if you want to talk about async compute, and only async compute (not concurrency), nV's architectures are light years ahead of GCN. Added bonuses to their architectures is the propriety API's.

Seems to me Nvidia are still a good deal behind and transitioning as quickly as possible. AMD can balance a number of streams in realtime with the highest possible utilization requiring minimal work by developers. It's why devs suggest coding for Nvidia because GCN just works. Nvidia's approach is along the lines of having separate vertex and pixel shaders with discrete hardware. There's a reason the entire industry moved to generic shaders because of commonality.

Ieldra said:
When async first launched with AotS it was foretelling doom for maxwell left right and center, and people conveniently ignored that even in DX11 (and DX12 with async disabled) the performance was matching AMD roughly flop for flop.

Problem with this argument is that the pricing structure wasn't equal. 980ti was always advertised as being superior to Fury and marked up accordingly. Most of these recent games track fairly closely to theoretical compute power of the various cards and that's an area AMD was generally superior in past generations. Polaris clockspeeds and effects on peak compute performance I'm not sure we can chalk up to an architectural problem given the recent foundry news. An extra couple hundred MHz we may have a rather interesting race.

ManofGod · Sep 2, 2016

Ieldra said:
I'm happy you're happy with your downscaling, and I'm happy you can accept this thread for what it is and look beyond how you perceive me, closet or no closet, fanboy or not - it doesn't matter in the end when we're talking about stuff in whitepapers; I've never denied AMD's Async Shaders setup is overall superior to NV's, what I've argued against is that notion that just because hardware X gains Y% from async and hardware Z does not doesn't mean X>Z. You might think, well nobody said that, yeah now... When async first launched with AotS it was foretelling doom for maxwell left right and center, and people conveniently ignored that even in DX11 (and DX12 with async disabled) the performance was matching AMD roughly flop for flop. i get that from your side you see me always arguing in NV's favor in these discussions so you think (or choose to think) it's an inherent bias, but it isn't, if everyone was talking a load of BS about GCN that wasn't true I'd be just as persistent and just as annoying when discussing it. I'm not calling for a boycott of async compute so I take away the advantage for AMD lol, I'm all for it, more competitive performance across the board.

Hell has frozen over.

razor1 · Sep 2, 2016

Anarchist4000 said:
It is better IMHO just because GCN doesn't have to differentiate between graphics and compute. If the partitioning isn't perfectly accurate, which is likely, you are leaving performance behind.

Oh it differentiates there is no way around that, that's why its multi engine.

Seems to me Nvidia are still a good deal behind and transitioning as quickly as possible. AMD can balance a number of streams in realtime with the highest possible utilization requiring minimal work by developers. It's why devs suggest coding for Nvidia because GCN just works. Nvidia's approach is along the lines of having separate vertex and pixel shaders with discrete hardware. There's a reason the entire industry moved to generic shaders because of commonality.

Yes concurrency AMD had a sizable advantage with graphics + compute prior to Pascal, now it still has some advantage but its diminished. Remember the GDC talks about how to do concurrency across different architectures and what is good for Maxwell is not good for GCN and what is good for GCN is not good for Maxwell. Pascal doesn't fit into this, but the do and don'ts doc on nV's dev site most of that still fits for Pascal, as well as GCN.

As far as async compute though, nV is way ahead, its has less compute units, less theoretical flops but the capability for them to do the same amount of work over a period of time (performance) or even higher, much higher when using CUDA shows the capabilities of the architecture. If you want to get a even better picture of this, all you have to do is down clock and match the theoretical flops of any AMD GCN product to match up with a equally tiered nV product and run an Open CL program on them. The nV counterpart will kick GCN's butt to the curb. Of course end result is GCN has more theoretical horse power when you don't do this so it will do better withstanding the API's used, which as of now, Pascal remedied that with higher clocks.

This is how it goes, companies see weaknesses in their products and fix them as soon as the industry is ready to use them, if you do something well ahead of its time, and it isn't used there is no reason to waste resources on it cause at the end it only hurts you. For AMD, they spent a good deal of die space for concurrency to cover up their underutilization problem, where nV took the other route and now are fixing the aspects of concurrency in one generation after dev's starting using it. All this time though DX11 and Ogl for more than 4 years could not show these aspects, AMD wasted silicon space and it went unused for all that time. By the time it was starting to be used, they needed new API's and nV had Pascal ready too. Timing is very important to maximize profitability. Timing on nV's side to counter punch AMD is just as important for them too. This is something AMD is traditionally weak at. nV learned it the hardway by going against MS twice in their short life, nv1 and of course nv20 (fx), they sure can't do it anymore as it will play right into AMD hands, at least before it wouldn't be as bad since nV had one console and AMD had none, now its all AMD's if they make a blunder like that.

Remember when I stated if nV and access to DX12 early on they knew what was coming and if there was an inherent problem with Maxwell with async it will be fixed in Pascal. This goes to show us that they saw the problem early on and had the time to include dynamic partitioning as a feature set.

AMD probably has the same type of system or similar, which wasn't talked about. And there really wasn't a need to because they just had to point out something like ACE's which they had and were easy for people to latch on to. They learned from nV's marketing of the gigathreaded engine. Adding that engine at the end makes everything sound powerful lol.

Even on nV's side its not spelled out for us how they are doing it either.

But these are things as end users shouldn't need to worry about, even dev's shouldn't worry too much about it either, as long as submissions are done correctly for each IHV.

Anarchist4000 · Sep 2, 2016

razor1 said:
As far as async compute though, nV is way ahead, its has less compute units, less theoretical flops but the capability for them to do the same amount of work over a period of time (performance) or even higher, much higher when using CUDA shows the capabilities of the architecture.

Under a very tight set of conditions. ARM cores are a hell of a lot faster than x86... at certain tasks.

razor1 said:
All this time though DX11 and Ogl for more than 4 years could not show these aspects, AMD wasted silicon space and it went unused for all that time.

Hypothetical, what if Nvidia supported async 4 years ago? APIs didn't stagnate sending devs running for a better solution. The only reason the silicon was "wasted" was because Nvidia screwed gamers and devs for a competitive advantage. I think there's a strong argument it's only now happening because VR required the extra performance and capabilities.

razor1 said:
But these are things as end users shouldn't need to worry about, even dev's shouldn't worry too much about it either, as long as submissions are done correctly for each IHV.

Time involved in doing it correctly is a consideration. If Nvidia has to commit a lot of resources to getting optimal and functioning code written as opposed to it just working that's significant.

Ieldra · Sep 2, 2016

Fair points anarchist, the partitioning needs a very solid driver heuristic just extract tiny performance advantages in the absence of on the fly repartitioning. I'm wondering if for truly decoupled compute loads you could simply could ween performance from maxwell by overlapping a sequence of short compute kernels on few SMs in parallel with something that would introduce inefficiency if spread over the full complement of SMs.

With Pascal even if the number of SMs doesn't allow for perfect partitioning you're not bound by draw all boundary anymore so you could just reconsider the partition state at every dispatch from the graphics slot

razor1 · Sep 2, 2016

Anarchist4000 said:
Under a very tight set of conditions. ARM cores are a hell of a lot faster than x86... at certain tasks.

Not so tight actually, nV's programming models for compute and their architectures is much more forgiving as long as you stick with their occupancy calculator. And yeah you gotta do this for all architectures when talking about async compute to get the most performance out.

Hypothetical, what if Nvidia supported async 4 years ago? APIs didn't stagnate sending devs running for a better solution. The only reason the silicon was "wasted" was because Nvidia screwed gamers and devs for a competitive advantage. I think there's a strong argument it's only now happening because VR required the extra performance and capabilities.

The API's at the time didn't allow concurrency to work, it would have been no use of doing that.

Time involved in doing it correctly is a consideration. If Nvidia has to commit a lot of resources to getting optimal and functioning code written as opposed to it just working that's significant.

We haven't seen nV do anything of that yet, they aren't even really pushing DX12 to developers yet, their game program has been stagnant for quite sometime, of course that doesn't mean they aren't planning things, like they did with VR.

Ieldra · Sep 2, 2016

Anarchist4000 said:
Under a very tight set of conditions. ARM cores are a hell of a lot faster than x86... at certain tasks.

Hypothetical, what if Nvidia supported async 4 years ago? APIs didn't stagnate sending devs running for a better solution. The only reason the silicon was "wasted" was because Nvidia screwed gamers and devs for a competitive advantage. I think there's a strong argument it's only now happening because VR required the extra performance and capabilities.

Time involved in doing it correctly is a consideration. If Nvidia has to commit a lot of resources to getting optimal and functioning code written as opposed to it just working that's significant.

Nvidia chose not to commit resources to getting it to work in maxwell though, even in Cuda it isn't easy to get concurrent kernel execution, often one kernel is enough to saturate the arrays (im talking about SMs)

Quite like was sebbbi was saying re gcn in the thread I made, most common instances in which you have overlap are the same considerations as per gcn on a CU

Anarchist4000 · Sep 2, 2016

Ieldra said:
With Pascal even if the number of SMs doesn't allow for perfect partitioning you're not bound by draw all boundary anymore so you could just reconsider the partition state at every dispatch from the graphics slot

You could, but that should still require at least a partial flushing of the device. I'd expect frames to slightly overlap on their scheduling. Otherwise you could have a single pixel holding up all the hardware waiting on a frame to complete. Unless your frame was dependent on the prior frame, the next frame should start scheduling once the current one has been fully dispatched. That would complicate the repartitioning, but the benefits may outweigh the costs to just have the gap. Point being you inevitably introduce bubbles doing it. In most cases significant swings shouldn't be necessary, but it's always possible. Especially with some of the multi-engine techniques without really partitioning the device in advance.

Ieldra said:
even in Cuda it isn't easy to get concurrent kernel execution, often one kernel is enough to saturate the arrays (im talking about SMs)

Cuda I still feel is a bad analogy because you are unlikely to have complementary workloads. And while there is overlap, to what degree? You might have complementary workloads, yet only a sufficient quantity fill part of the gap. That quantity could vary by hardware generation as well.

noko · Sep 2, 2016

So after the two stroked each other here over and over again - just kidding.

Why in hell does AMD hardware saw roughly a 40% increase in Doom using Vulkan? While Nvidia saw virtually nothing. Considering that AMD generation to similar generation of Nvidia was not to far apart in OpenGL.

Ieldra · Sep 2, 2016

noko said:
So after the two stroked each other here over and over again - just kidding.

Why in hell does AMD hardware saw roughly a 40% increase in Doom using Vulkan? While Nvidia saw virtually nothing. Considering that AMD generation to similar generation of Nvidia was not to far apart in OpenGL.

AMD is hideously bottlenecked in OpenGL. Like comparing the speed of an Olympic runner when he's half awake to when he's literally on fire

DOOM performance changed with latest drivers supporting runtime 1.0.0.17

SweClockers prestandatestar Doom i Vulkan

AMD updated to support ogl 4.5 so that decreased the % gain from vulkan.

Nvidia updated to support vulkan 1.0.0.17 and that increased the % gain from vulkan.

razor1 · Sep 2, 2016

noko said:
So after the two stroked each other here over and over again - just kidding.

Why in hell does AMD hardware saw roughly a 40% increase in Doom using Vulkan? While Nvidia saw virtually nothing. Considering that AMD generation to similar generation of Nvidia was not to far apart in OpenGL.

Same driver overhead issue persistent in Ogl and then many other issues in AMD Ogl drivers.

Ieldra · Sep 2, 2016

Maxing out GPU usage in nBodyGravity - GPUOpen

Google Translate

these will end up somewhere when I update the thread, gonna try and write some examples to explain different combinations of serial/parallel, concurrent/sequential and async/sync as well

Please note the brief description of what is going in the gpuopen blog post, then see my discussion about particle sims a few posts up

Anyway the data from computerbase is plotting percent gain (from async on) vs number of particles.

Maxwell gains nothing (presumably because the compute queue isn't even exposed, compatibility layer we mentioned) but it also doesn't lose anything.

Pascal gains tiny amounts.

Fury X gains huge amounts then loses performance as particle count goes up.

For the record I'm always harangued with links to this god forsaken demo, so here it is. I'm posting it.

noko · Sep 2, 2016

Well I think you two are right on the money with that explanation and good to see Nvidia keeps improving performance too in Vulkan.

I think the bottom line for most folks is which design/card etc. will give me the most bang/buck in games I will play or in general all games coming out or recent ones. Both AMD and Nvidia support DX 12/Vulkan, how they do it maybe somewhat different with different advantages but the bottom line is actual performance in the end.

Ieldra · Dec 1, 2016

The part about the ACEs enabling fast context switching through a dedicated cache is incorrect, the async shaders whitepaper mentioned fast context switching enabled by the ACEs in the sense that the active queue can be swapped.

This was pointed out to me a while back but I never bothered to change it because for the purposes of the example I used (assuming zero context switch latency) it works out just fine.

I should probably write up the correct explanation and post it at some point but far too lazy atm

Demystifying Asynchronous Compute - V1.0

I Promise to RTFM

[H]F Junkie

I Promise to RTFM

[H]F Junkie

razor1 is my Lover

I Promise to RTFM

[H]F Junkie

[H]ard|Gawd

I Promise to RTFM

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

Limp Gawd

[H]F Junkie

I Promise to RTFM

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

[H]ard|Gawd

I Promise to RTFM

[H]F Junkie

I Promise to RTFM

[H]ard|Gawd

Supreme [H]ardness

I Promise to RTFM

[H]F Junkie

I Promise to RTFM

Supreme [H]ardness

I Promise to RTFM