Async compute gets 30% increase in performance. Maxwell doesn't support async.

I think this was specific to DP performance, that's why Kepler mid to high, not Titan.

And compute performance for Maxwell 2, pure compute is very good more efficient than GCN.

What Maxwell 2 seems to be doing right now, and looks to be completely broken for most part in Dx12 where it works like its supposed to in Dx11, is that concurrent kernel execution of both graphics and compute, which is even before we get to async. Async is the ability to take take the compute instructions as the kernel is being executed and interleave them into the graphics path.

And this is why we were getting crazy results because at times, very few times it was doing concurrent execution and async was happening, but most of the time, it was screwing up.

This is what raised the reason of the preemption and the context switch, which isn't accurate either, well an assessment can't be made on that because it seems to be being used when it shouldn't be.

My take on that response was that it was looking at SP loads, not DP loads. David Kanter is discussing 32-bit addressing in those quotes.

As for pure compute performance, it would depend on how you're feeding both Maxwell 2 and GCN 1.1/1.2. If you're feeding them workgroups of 32 threads, then of course Maxwell 2 will be more efficient. If you're feeding Maxwell 2 workgroups of 32 threads and workgroups of 64-256 threads (varying) on GCN 1.1/1.2... then you'll likely tap both architectures theoretical peak FLOP rate.

In that case Maxwell 2 would end up slightly outperforming a Hawaii based GPU but not so much a Fiji based GPU. It's all in how you optimize your code really. True, it is easier to optimize for Maxwell 2, but that doesn't mean GCN 1.1/1.2 are less efficient. In fact the cache to flop ratio is also a limiting factor for Maxwell 2 (as it is for Fiji come to think of it as the TFLOPs rate has gone up without any changes to the overall GCN caching mechanisms).

Where AMD lacks is in the tools they provide to developers in order to code and optimize for their architectures. Their documentation is also sub-par.
 
Edited by post after that ;) sorry yes I agree, but the problem isn't that it can or can't do it, it does concurrent in Dx11 just fine just not in Dx12, don't confuse concurrent or parallel with async, two different things async needs concurrent kernels to be executed to function. Even if kernels are done concurrently it still might not be doing async.

Concurrent is where fine grained preemption comes in afaik.
 
Concurrent is where fine grained preemption comes in afaik.


Only if you switching kernels completely like from 32 compute to 1 + 31 graphics or vice versa. This is why preemption is NOT used in async at all, you shouldn't be switching kernels both should be flight at the same time.
 
My take on that response was that it was looking at SP loads, not DP loads. David Kanter is discussing 32-bit addressing in those quotes.

As for pure compute performance, it would depend on how you're feeding both Maxwell 2 and GCN 1.1/1.2. If you're feeding them workgroups of 32 threads, then of course Maxwell 2 will be more efficient. If you're feeding Maxwell 2 workgroups of 32 threads and workgroups of 64-256 threads (varying) on GCN 1.1/1.2... then you'll likely tap both architectures theoretical peak FLOP rate.

In that case Maxwell 2 would end up slightly outperforming a Hawaii based GPU but not so much a Fiji based GPU. It's all in how you optimize your code really. True, it is easier to optimize for Maxwell 2, but that doesn't mean GCN 1.1/1.2 are less efficient. In fact the cache to flop ratio is also a limiting factor for Maxwell 2 (as it is for Fiji come to think of it as the TFLOPs rate has gone up without any changes to the overall GCN caching mechanisms).

Where AMD lacks is in the tools they provide to developers in order to code and optimize for their architectures. Their documentation is also sub-par.

Even if coded for one architecture or the other, even though there will be draw backs to one of the other IHV hardware it shouldn't matter to the degree we saw with AOS. Possibly if its too much strain on Maxwell 2 you will see it fall flat but we really don't know what that limit could be without it working in the first place.
 
Only if you switching kernels completely like from 32 compute to 1 + 31 graphics or vice versa. This is why preemption is NOT used in async at all, you shouldn't be switching kernels both should be flight at the same time.

Context Switching is used for AMD GCN when performing Async (varying on Dependencies). It does come at a 1 cycle loss in performance (due to fine grained preemption).

Since you take a hit (sometimes to the order of several thousand cycles) with Maxwell 2, when performing a context switch, it could explain why Oxide was getting such horrible performance issues when they attempted to run their code on Maxwell 2. They ended up shutting it all down, at the request of nVIDIA. AMD have a lot of experience in coding closer to metal with Mantle as well as with the console guys/gals. This is an area in which AMD have an edge over nVIDIA.

That's why we're seeing nVIDIA working with Oxide in order to fully implement Async. We have no idea what the end result will be though.
 
Even if coded for one architecture or the other, even though there will be draw backs to one of the other IHV hardware it shouldn't matter to the degree we saw with AOS. Possibly if its too much strain on Maxwell 2 you will see it fall flat but we really don't know what that limit could be without it working in the first place.

Oh, I'm speaking of performance outside of Async compute here. I'm talking about pure compute.

PS. Some people have attributed some odd quotes to me over at Beyond3D. Someone stated that I stated that the Nitrous pipeline will be taking place 100% in compute. Not what I stated. I said that some game developers are doing this on the consoles (see Dreams). The next iteration of the Nitrous engine will have 50% of its pipeline occurring in Compute though.
 
not really if concurrent kernels aren't in flight how would that even occur if that was the case. Something like this problem with concurrence, the developer should not have missed at all. nV probably knew it, if its working on Dx11 and it is we can see the different queues for phsyX, compute and graphics, and this is why they didn't say anything publicly, driver just wasn't ready for it yet, and I think its going to take a while to get them to fix it, its not a small undertaking like a tweak or something. Its like initialize code and just leaving an empty spot.
Yes compute will increase as things go on, but at a fairly slow rate, 50% for next engine, how long will we see before on that come out, 2 years?
 
Last edited:
not really if concurrent kernels aren't in flight how would that even occur if that was the case. Something like this problem with concurrence, the developer should not have missed at all. nV probably knew it, if its working on Dx11 and it is we can see the different queues for phsyX, compute and graphics, and this is why they didn't say anything publicly, driver just wasn't ready for it yet, and I think its going to take a while to get them to fix it, its not a small undertaking like a tweak or something. Its like initialize code and just leaving an empty spot.
Yes compute will increase as things go on, but at a fairly slow rate, 50% for next engine, how long will we see before on that come out, 2 years?

He didn't specify how long, but he did state it would be on his next title. Whenever that is? I don't know.

I'm just saying that people have been attributing some crazy claims to me... that's all lol
 
He didn't specify how long, but he did state it would be on his next title. Whenever that is? I don't know.

I'm just saying that people have been attributing some crazy claims to me... that's all lol
yeah I don't take what people say for anything anyways, its just heresy.
 
Back
Top