[PCGH] Hitman DX12 Benchmarks -- 390 faster than 980 by 20%

sblantipodi · Mar 12, 2016

this game has an embarrassing graphic.
can't understand how a software house that develop on a brand like this can throw out such a piece of crap.

razor1 · Mar 12, 2016

Lets get more detailed into this should we Anarchist, can you tell me the difference in sorting when using direct compute vs. cuda? This could be a reason why nV might be doing so much better with any game that uses CUDA along with DX12, of course I think there are other reasons too, but one of the main factors.

Simple little things like this and you can't pick up? Instead just post crap..... confuses the fuck out of me why you can't ask a question.

Anarchist4000 · Mar 12, 2016

razor1 said:
Texture bound and compute bound workloads affect different parts of the GPU and shouldn't need to be interleaved unless end output is affected by both at the same time as the ROPs are fully utilized. Also having a ROP limited scenario at this day and age is not really looked into much as AA modes like SSAA and MSAA aren't used as much.

Just to clarify texture bound is roughly analogous to memory bound. An ALU won't execute an instruction until the data has been fetched. Therefore there is a benefit to scheduling it alongside compute bound shaders. And yes pixel, vertex, and compute shaders run on the same hardware for the actual shaders. So while shader A is waiting on texture data to show up, Shader B is active and doing math that likely doesn't have to fetch as much. It's latency hiding when 100% occupancy isn't enough to cover your latency. As compute shaders don't use ROPs, they are thus free to start doing all sorts of interesting stuff. The ROPs MAY be an issue if you start forcing a ton of shadow maps through there. In most cases you will be constrained by how fast you can fetch data, which is what GPUs are designed to alleviate. Keep in mind the warp scheduler will be looking at every available warp to see which one is ready(data fetched) to be scheduled. This is where the interleaving happens. It's mixing shaders that aren't likely to be ready with ones that are.

Keep in mind that devs have been calling async free performance. You can do a bunch of stuff and it doesn't take any longer to complete. So increasing the proportion of async capable threads by upping the resolution will have interesting scaling. It would have minimal impact on frame time until you saturated the ALUs. Overclocking the shaders and not the memory would yield an even larger pool of free performance. Yes it's possible to do this on Nvidia hardware, but the implementation is a nightmare. You would literally have to start drawing everything in fractions to generate a limited number of warps.

If you need to store how data was arranged then you need to have it in memory, that defeats the purpose of not using the RT. And also the data that a CS can read is not unlimited as I stated before there are only certain inputs that go into the CS. And this is the difference with a forward vs deferred, The depth data doesn't need to be stored in a forward render, there is no need to.

The final step of any scene will be taking all those samples of the RT, doing any simple per pixel postprocessing, and outputting a simple one sample per pixel image. This can be blur, tone-mapping, compositing, etc. As for the inputs, you could bind your entire packed memory pool to some sort of array you could index in if you really wanted. This is still limited by the APIs, but for GCN you could create a stack and use the entire memory pool as a heap if you really wanted.

LOL what you are not using overdraw limiting algorithms? You do know that's bad technique not to right? You should never draw a pixel more than once unless its absolutely necessary (like in a transparency situation) and this is why you do depth tests.

All I said was that overdraw will create even more threads than just your resolution. It should be avoided like you said, but it still occurs. You will almost never know how many pixel instances get created by the rasterizer. The amount of threads and in turn warps created

razor1 said:
Oh here ya go,

ROPs and GCN cores

So you are telling me I don't know my basics, but pretty much exactly what I stated!

Pretty much confirms what I was saying.

Anyways, this is already getting way off topic again so I'm done with this.

razor1 said:
Lets get more detailed into this should we Anarchist, can you tell me the difference in sorting when using direct compute vs. cuda? This could be a reason why nV might be doing so much better with any game that uses CUDA along with DX12, of course I think there are other reasons too, but one of the main factors.

Simple little things like this and you can't pick up? Instead just post crap..... confuses the fuck out of me why you can't ask a question.

CUDA has dynamic parallelism and DC does not. Along with a number of other features that are missing. Any game that uses CUDA is probably using Physx. So hardware vs software acceleration.

razor1 · Mar 12, 2016

Anarchist4000 said:
Just to clarify texture bound is roughly analogous to memory bound. An ALU won't execute an instruction until the data has been fetched. Therefore there is a benefit to scheduling it alongside compute bound shaders. And yes pixel, vertex, and compute shaders run on the same hardware for the actual shaders. So while shader A is waiting on texture data to show up, Shader B is active and doing math that likely doesn't have to fetch as much. It's latency hiding when 100% occupancy isn't enough to cover your latency. As compute shaders don't use ROPs, they are thus free to start doing all sorts of interesting stuff. The ROPs MAY be an issue if you start forcing a ton of shadow maps through there. In most cases you will be constrained by how fast you can fetch data, which is what GPUs are designed to alleviate. Keep in mind the warp scheduler will be looking at every available warp to see which one is ready(data fetched) to be scheduled. This is where the interleaving happens. It's mixing shaders that aren't likely to be ready with ones that are.

Texture bound is not analogues to memory bound I have no idea where you got that from. Only time that is even remotely close is if you are memory thrashing. The rest of what you stated is correct.

Keep in mind that devs have been calling async free performance. You can do a bunch of stuff and it doesn't take any longer to complete. So increasing the proportion of async capable threads by upping the resolution will have interesting scaling. It would have minimal impact on frame time until you saturated the ALUs. Overclocking the shaders and not the memory would yield an even larger pool of free performance. Yes it's possible to do this on Nvidia hardware, but the implementation is a nightmare. You would literally have to start drawing everything in fractions to generate a limited number of warps.

its never "free", you get more performance availability, because compute shaders have less overhead in doing certain things. No it is no longer possible to overclock just the shader units on nV hardware, hot clocks were removed on nV hardware with Fermi if I'm not mistake. Plus that doesn't make any sense because the frequency of the shader units has nothing to do with the programming.....

The final step of any scene will be taking all those samples of the RT, doing any simple per pixel postprocessing, and outputting a simple one sample per pixel image. This can be blur, tone-mapping, compositing, etc. As for the inputs, you could bind your entire packed memory pool to some sort of array you could index in if you really wanted. This is still limited by the APIs, but for GCN you could create a stack and use the entire memory pool as a heap if you really wanted.

Yes

All I said was that overdraw will create even more threads than just your resolution. It should be avoided like you said, but it still occurs. You will almost never know how many pixel instances get created by the rasterizer. The amount of threads and in turn warps created

There is no way to have a precise amount, but you can get estimations.

Pretty much confirms what I was saying.

No you weren't, I pointed out everything that you stated that was incorrect. Do yo want me to highlight in red? I pointed it out by quoting what you stated already.......

CUDA has dynamic parallelism and DC does not. Along with a number of other features that are missing. Any game that uses CUDA is probably using Physx. So hardware vs software acceleration.

Seriously lol, ok, I'll give you a hint try to do a radix sort with DC and then CUDA, figure out why CUDA goes so much faster, Doesn't mater what DC version either, this has specifics to do with the limitations of the shader language features. DC is hardware accelerated..... why on earth would you think DC isn't hardware accelerated and only CUDA would be? there just is not certain exposures. Even OCL has these features. As I stated you know a little bit but not all of it.

FrameBuffer · Mar 12, 2016

sblantipodi said:
this game has an embarrassing graphic.
can't understand how a software house that develop on a brand like this can throw out such a piece of crap.

wrong thread.. here let me help you out, you obviously meant THIS thread: Gears of War PC: A complete disaster on GCN1.2 cards in Microsoft's own game + API debut

Anarchist4000 · Mar 12, 2016

razor1 said:
Texture bound is not analogues to memory bound I have no idea where you got that from. Only time that is even remotely close is if you are memory thrashing. The rest of what you stated is correct.

They're both related to fetching data from memory respective to their shader types, graphics and compute.. That has nothing to do with memory thrashing.

its never "free", you get more performance availability, because compute shaders have less overhead in doing certain things. No it is no longer possible to overclock just the shader units on nV hardware, hot clocks were removed on nV hardware with Fermi if I'm not mistake. Plus that doesn't make any sense because the frequency of the shader units has nothing to do with the programming.....

So you can add a bunch of effects and still finish the frame in the same amount of time... Sounds like free to me. As for the clocks I'll have to go check the Nvidia control panel. Last I checked the core and memory clocks were still adjustable.
If you look around there are some hardware sites out there that discuss fundamentals of overclocking and the effects they have on video cards. Increasing the core clock is unlikely to have a linear increase in performance because you lack the memory bandwidth to feed the processors. Therefore ALUs start to idle because the data isn't ready.

I'm done with this because all you're doing is throwing out bullshit trying to make a point. I'm yet to see any evidence of where I'm mistaken outside your opinion of how you think things work, changing the context, or misunderstanding fundamental functions of the hardware.

n=1 · Mar 12, 2016

Can we please not have this back and forth again for like the 6th or 7th time again? Please, PLEASE?

razor1 · Mar 12, 2016

Anarchist4000 said:
They're both related to fetching data from memory respective to their shader types, graphics and compute.. That has nothing to do with memory thrashing.

Texture bound has nothing to do with memory bound, show where you read that or getting that info........

So you can add a bunch of effects and still finish the frame in the same amount of time... Sounds like free to me. As for the clocks I'll have to go check the Nvidia control panel. Last I checked the core and memory clocks were still adjustable.

Its not free think of why its not free than you will get your answer....

If you look around there are some hardware sites out there that discuss fundamentals of overclocking and the effects they have on video cards. Increasing the core clock is unlikely to have a linear increase in performance because you lack the memory bandwidth to feed the processors. Therefore ALUs start to idle because the data isn't ready.

What? that has been there seen when? Do you know how many different domains for clocks there on on GPU's right now? Do you know what the memory bus speed is on GPU's? Do you know that shader clocks were seperate before with the g80 and actually that was the only gpu that had separate shader clocks (and ilk)? Core is just what it is, the GPU. Right with Maxwell and Fiji all you can do is the entire core. If you increase memory frequency the bus speed will also be increased by that amount.

I'm done with this because all you're doing is throwing out bullshit trying to make a point. I'm yet to see any evidence of where I'm mistaken outside your opinion of how you think things work, changing the context, or misunderstanding fundamental functions of the hardware.

I'm pulling out BS, you know what from now on I will highlight everything you say in BIG RED LETTERS just to show you the crap you pull out of your ass. Then when you try to back track, I will high those in BIG BLUE LETTERS so its easy to track. Would you like that? Its pathetic that you deny what you just wrote a page and half back.

razor1 · Mar 12, 2016

n=1 said:
Can we please not have this back and forth again for like the 6th or 7th time again? Please, PLEASE?

Lets get back on topic, but this is far way from done, Anarchist, I told you want I'm going to do, and I'm going to do it, since you didn't see what you wrote wrong when I quoted the specific portions where you were, I think it has to be done, so this back and forth crap doesn't happen again.

And Anarchist, the "features" that are missing for DC is cross lane operations fyi.

Remon · Mar 13, 2016

razor1 said:
And Anarchist, the "features" that are missing for DC is cross lane operations fyi.

NO, just no. Blue on grey? No one can read that.

JustReason · Mar 13, 2016

Remon said:
NO, just no. Blue on grey? No one can read that.

I use YELLOW, though sometimes I think that could be bad idea because links show up as Yellow. But at least you can see it.

Anarchist4000 · Mar 13, 2016

This should be fun. I've now got a stalker highlighting his mistakes for everyone, a thread where half the posts aren't remotely related to the op, and apparently it's going to continue. Luckily all my positions are supported, all predictions seem to be accurate so far, and I've yet to have to walk anything back. On the bright side I blocked him, but I doubt it will help.

razor1 · Mar 13, 2016

awe too bad lol, you blocked me, so you don't want that to happen? pathetic, same thing as last time, when I show you where you are wrong, you try to back out by saying I'm talking crap, when in fact its a correction of what you posted lol, just the usual suspects. When you can't prove what you stated the other person must be wrong........

It ain't stalking, I am not going to go out of my way, the highlights are instead of the quotes that I have been doing, that way, its clear for you, you might be selectively blind to italics I don't know, but there is definitely something wrong up there when you can't even admit something you posted a page and half a go plain as text for everyone to see. Must be ego getting in the way of thinking, PS yellow is better but most likely I don't need to do anyways, it would have been funny to just highlight the crap Anarchist spews, his name says it all.

You may consider it stalking and put me on ignore since you don't want others to see the crap you write as wrong or just partial correct to push what ever view you want, and that will be easy to see, since ever single post you make has nothing of value as a whole. And that will happen since very few of us here post anything indepth about architecture and programming, you won't either so guess what, your exercise works to hide your inadequacy that was easily seen when I asked you to things, when we we on the topic of divergence and I gave two examples, I asked you which one was divergent, you never answered, and I still haven't, but I know which one it is. And this latest round of why cuda is better for the time being that DC. You float around the questions as if they will sting you. And there is a reason they will because you don't know.

Back to this recent mistake you stated, and why CS don't scale with resolution....

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiv0oj4hL7LAhXGVD4KHS94BO8QFggdMAA&url=http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Efficient%20Compute%20Shader%20Programming.pps&usg=AFQjCNGpVaDTwzI79qJFXUITbzsUvHWOYw&sig2=YgXqr5VurqqXjPvHxGpQlQ

Resolution increases CS count by 1:1 each PS thread can do 4 pixel bursts, so the end ratio is what? Then you can optimize that by doing half resolution for the CS needs are, what happens to the ratio then? Guess what, tada, CS doesn't scale with resolution anymore!

So if doing async shaders and resolution increases, if there was a bottleneck for async shaders for either IHV it should diminish as resolution increases if the shader is properly optimized, not increased as you have stated at least twice so far.

FrameBuffer · Mar 13, 2016

Jessshh.. why don't you two just get a room.. this constant back and forth has crossed several threads now and while we may appreciate SOME insight.. the garbage is enough. Make your own damn thread about DX12 vs DX11 and please keep it there. As much as I hate to going to start making ample use if the Report feature.

Zuul · Mar 13, 2016

FrameBuffer said:
Jessshh.. why don't you two just get a room.. this constant back and forth has crossed several threads now and while we may appreciate SOME insight.. the garbage is enough. Make your own damn thread about DX12 vs DX11 and please keep it there. As much as I hate to going to start making ample use if the Report feature.

ya it's borderlining ADHD or whatever combination of letters.

But good stuff. Keep it up! lol

razor1 · Mar 13, 2016

Well if you like it when people say things that are totally opposite of the truth to you because of what ever reason, in this case lack of knowledge, that is up to you, personally I don't.

Zuul · Mar 13, 2016

yes...the truth is always preferred. But for us stupid people this boils down to "hey my understanding of shit is shit"

KazeoHin · Mar 13, 2016

sblantipodi said:
this game has an embarrassing graphic.
can't understand how a software house that develop on a brand like this can throw out such a piece of crap.

pendragon1 · Mar 13, 2016

some interesting things in this thread but...
just go make a "razor1 vs Ananchist4000" in the gen. vid. section if you want to keep arguing!

razor1 · Mar 13, 2016

Zuul said:
yes...the truth is always preferred. But for us stupid people this boils down to "hey my understanding of shit is shit"

People generally aren't stupid, tell ya the truth, most of everyone that posts on tech forums specially here, are far more knowledgeable than the average joe or even their counterparts at work or in their daily lives.

pendragon1 said:
some interesting things in this thread but...
just go make a "razor1 vs Ananchist4000" in the gen. vid. section if you want to keep arguing!

Any case the whole point of that ridiculous argument was someone stated resolution would bottleneck the async pipeline for nV cards,

A) the numbers don't support that, because you will not see crazy changes on nV cards as all Maxwell 2 cards have similar capabilities with async. If resoltion affected async performance in such a manner it would scale with resolution in similar path. Not to mention there is substantial advantage for AMD cards in DX11 rendering path in this particular game and we all know DX11 async is not possible.
B) My point was resolution doesn't have a direct relationship with async operations, which is what this latest crap was about, and I have shown links to prove my points

My conclusion is this game runs better on AMD hardware for other reasons than async. Of course Async can have added benefit to certain AMD cards though.

Relayer · Mar 13, 2016

JustReason said:
I use YELLOW, though sometimes I think that could be bad idea because links show up as Yellow. But at least you can see it.

ORANGE is my favorite.

Deathroned · Mar 14, 2016

razor1 said:
People generally aren't stupid, tell ya the truth, most of everyone that posts on tech forums specially here, are far more knowledgeable than the average joe or even their counterparts at work or in their daily lives.

Any case the whole point of that ridiculous argument was someone stated resolution would bottleneck the async pipeline for nV cards,

A) the numbers don't support that, because you will not see crazy changes on nV cards as all Maxwell 2 cards have similar capabilities with async. If resoltion affected async performance in such a manner it would scale with resolution in similar path. Not to mention there is substantial advantage for AMD cards in DX11 rendering path in this particular game and we all know DX11 async is not possible.
B) My point was resolution doesn't have a direct relationship with async operations, which is what this latest crap was about, and I have shown links to prove my points

My conclusion is this game runs better on AMD hardware for other reasons than async. Of course Async can have added benefit to certain AMD cards though.

^ Damage Control is strong with this one. Go make razor 1 arnachist thread considering you have your head so far up your ******. Gaming Evolved is a counterpart of the way its meant to be played not GameWorst.

razor1 · Mar 14, 2016

Deathroned said:
^ Damage Control is strong with this one. Go make razor 1 arnachist thread considering you have your head so far up your ******. Gaming Evolved is a counterpart of the way its meant to be played not GameWorst.

except for a select few, which I mentioned here in the thread about GOWUE, this is one of them, there are two more lol...... Did you notice its always the same people? I can give you the list but its pretty easy to see.

Of course this one didn't understand the irony of what I stated, thats why he is trying to correct something that doesn't need correcting lol. PS, when I mentioned gameworks in another post earlier to the one you quoted, I was comparing two benchmarks, one was rise of tomb raider, please pay attention to what was being talked about

Going back to what I stated earilier where Fury doesn't seem to have the same performance advantages going from DX11 to 12 vs its GCN counterparts

Interesting that the read writes of Fury X is a bit lower than 390x, this might be one of the reasons why we don't see the marketable improvements as we see in the 390x with the API transition.

This is why you can't make assumptions based off of a benchmark where values are going all over the place without any correlation between them like with Anarchist4000 assumption he made, if there are incongruous results there has to be other parts of the GPU involved that we can't see and this is where synthetics come in as they are the best thing to look at to nail down problem areas.

Anarchist4000 · Mar 14, 2016

For factual evidence of what I was saying take a look at the AOTS benchmarks with Async on and off. For some reason there aren't a lot of async on/off comparisons out there by resolution, but you will see the benefits of the async increase as the resolution increases. Up until you saturate the ALUs or run out of async work. It's been explained a couple times in various articles so far. As you become more GPU bound(often by increasing resolution making you texture/memory bound) you get more potential benefit from concurrent execution of compute with async. I'd have used Hitman numbers but there aren't any async on/off benches that I can find.

While they didn't break down the differences, that's between a 27-73%(low to crazy) improvement to the gains of Async between 1080p and 4k.
AMD clobbers Nvidia in updated Ashes of the Singularity DirectX 12 benchmark | ExtremeTech

Here's a benchmark showing the effect as well. 6-20% improvement between 1440p and 2160p.
Ashes of the Singularity Revisited: A Beta Look at DirectX 12 & Asynchronous Shading

Another example, but some math is required. They're showing 30-50% gains between 1440p and 2160p and their CPU is really holding them back at lower resolutions. 1080 to 1440 giving nearly identical FPS.
Ashes of the Singularity DirectX-12-Leistungsexplosion (Seite 3)

These numbers are for a Fury X. Significant improvements as resolution increases. So turning on async you complete the same amount of work in less time with better scheduling and concurrency. Free performance that increases with resolution.
A 390 gets roughly the same benefit (~10%) regardless of resolution. If anything it's actually losing it's async gains as resolution increases. This is the opposite of what you'd expect from increasing the graphics workload unless it ran out of compute to perform. Which wasn't the case for a fury which is a compute powerhouse.
980ti, despite going backwards in all tests, also sees an improvement(or less of a hit) as resolution increases. At 4k/Crazy it roughly breaks even with async enabled. So even for a 980ti using it's software async(however they actually implemented that in AOTS) there is a benefit from the async/concurrent execution although we're talking single digit absolute percentages and practically unplayable. At 5k resolution it may actually benefit form async while being even more unplayable.

This is from a game whose devs said it doesn't use a lot of async and we're seeing 10-15% performance improvements with better results as resolution increases.

pendragon1 · Mar 14, 2016

this post WAS about Hitman...

razor1 · Mar 14, 2016

Anarchist4000 said:
For factual evidence of what I was saying take a look at the AOTS benchmarks with Async on and off. For some reason there aren't a lot of async on/off comparisons out there by resolution, but you will see the benefits of the async increase as the resolution increases. Up until you saturate the ALUs or run out of async work. It's been explained a couple times in various articles so far. As you become more GPU bound(often by increasing resolution making you texture/memory bound) you get more potential benefit from concurrent execution of compute with async. I'd have used Hitman numbers but there aren't any async on/off benches that I can find.

While they didn't break down the differences, that's between a 27-73%(low to crazy) improvement to the gains of Async between 1080p and 4k.
AMD clobbers Nvidia in updated Ashes of the Singularity DirectX 12 benchmark | ExtremeTech

They don't break it down but why with Aync off the scaling seems to be similiar to when Async on with resolution increases? What does that say, Async is what has caused the difference, relative to each figure? To the contrary, it doesn't and only muddles the situation as increasing the resolution has more affect on the ratio of PS than the CS when using a lower resolution RT for lighting.

Here's a benchmark showing the effect as well. 6-20% improvement between 1440p and 2160p.
Ashes of the Singularity Revisited: A Beta Look at DirectX 12 & Asynchronous Shading

As as above

Another example, but some math is required. They're showing 30-50% gains between 1440p and 2160p and their CPU is really holding them back at lower resolutions. 1080 to 1440 giving nearly identical FPS.
Ashes of the Singularity DirectX-12-Leistungsexplosion (Seite 3)

These numbers are for a Fury X. Significant improvements as resolution increases. So turning on async you complete the same amount of work in less time with better scheduling and concurrency. Free performance that increases with resolution.

A 390 gets roughly the same benefit (~10%) regardless of resolution. If anything it's actually losing it's async gains as resolution increases. This is the opposite of what you'd expect from increasing the graphics workload unless it ran out of compute to perform. Which wasn't the case for a fury which is a compute powerhouse.

980ti, despite going backwards in all tests, also sees an improvement(or less of a hit) as resolution increases. At 4k/Crazy it roughly breaks even with async enabled. So even for a 980ti using it's software async(however they actually implemented that in AOTS) there is a benefit from the async/concurrent execution although we're talking single digit absolute percentages and practically unplayable. At 5k resolution it may actually benefit form async while being even more unplayable.

This is from a game whose devs said it doesn't use a lot of async and we're seeing 10-15% performance improvements with better results as resolution increases.

As the bottleneck shifts, you can't determine that. Nor as you get to 5k where there is definitely more than one bottleneck competing with each other things get very crazy and without a full break down with sampling of different options its very hard to determine what those are.

Again, when you take your point of view and try to prove that point of view, it will be true in your eyes. The way around that, is to ask the question are there other possibilities that might be possible and to what degree after you look at the math, the math pretty much is as resolution increases there is a diminishing affect as resolution increases as you don't use full resolution calculations for lighting * caveat is depending on the engine and shader of course. If using RT for lighting then you never use full resolution RT's as that has diminishing returns on visuals with wasteful bandwidth.

Anarchist4000 · Mar 14, 2016

pendragon1 said:
this post WAS about Hitman...

Like I said, there aren't any async on/off benchmarks for Hitman I could use. I used those numbers just because they were available to show what affect it would likely have in Hitman. How much of an effect depends on the amount of async used. I'd like to apply the same math to Hitman to see what it shows. Cranking the resolution and toggling async should give an idea of the total workload. That would be an interesting number to know.

razor1 · Mar 14, 2016

You also have to limit the increase in the use of the PS too, and you can't really do that in an in game situation can you?

Mahigan · Mar 16, 2016

razor1 said:
That doesn't explain why nV cards lost performance though at different resolutions and gains at different reolutions, because if anything there should be more strain on the gtx 980 from a GPU stand point vs. the gtx 980 ti if resolutions increase...... So if it was only due to async and your explanation then the gtx 980 should gain more

No, but the fences involved in Asynchronous Compute + Graphics do. You generally have one fence for the graphic context and a few fences (varying on how many concurrent compute jobs) for the compute context.

The GTX 980 Ti was likely CPU/API bottlenecked at 1080p and under DX11. DX12 alleviated the bottleneck and the GTX 980 Ti gained performance (more performance than the cost of the fences themselves).

As the resolution increases, the performance benefits the GTX 980 Ti gets from the alleviation of the API bottleneck going from DX11 to DX12 decreases as we move towards a GPU bound scenario. The GPU stalls caused by the fences start to also take a toll. So performance regresses. Fences only cause a tiny stall (wait) buy its enough to inflict a 1-5% performance loss (as we see in AotS with Async on/off).

Dan Baker explained it all to me.

Looking at both AMD Gaming Evolved Titles and NVIDIA Gameworks titles, seems to me that AMD and NVIDIA benefit from varying CPU multi-threaded rendering techniques and that each IHV helps developers profile the game in a way which best benefits their own uArch. AotS was pretty neutral, as we see when we look at the DX11 numbers, but Hitman, Rise of the Tomb Raider, Gears of War UE are quite biased.

Far Cry Primal is another biased title, having been developed for the Jaguar cores on the XBox One. The Division also shows taints of AMDs Jaguar influence. It seems that barring any sort of heavy engine re-writes, most games will likely go the AMD way going forward (console influence).

Pieter3dnow · Mar 16, 2016

That should not be a long lasting effect at the start you will get that with a new API as soon as 2nd or 3rd generation games will roll out it would even out more (be that from a developer side or hardware)..

JustReason · Mar 16, 2016

As far as memory, you cant compare straight out. Fury has slower memory as far as timings and speed but far higher bandwidth by way of bus. Memory copy, read and write with low sizes are almost identical to maybe a bit better with the 390X showcasing timings advantage. With the increase in size of the file/texture you can easily start seeing the fury have an advantage and that is solely based on the bus width: 4096 to 512 bit bus/ 512 to ~336 gb/s bandwidth.

razor1 · Mar 16, 2016

Mahigan said:
No, but the fences involved in Asynchronous Compute + Graphics do. You generally have one fence for the graphic context and a few fences (varying on how many concurrent compute jobs) for the compute context.

The GTX 980 Ti was likely CPU/API bottlenecked at 1080p and under DX11. DX12 alleviated the bottleneck and the GTX 980 Ti gained performance (more performance than the cost of the fences themselves).

As the resolution increases, the performance benefits the GTX 980 Ti gets from the alleviation of the API bottleneck going from DX11 to DX12 decreases as we move towards a GPU bound scenario. The GPU stalls caused by the fences start to also take a toll. So performance regresses. Fences only cause a tiny stall (wait) buy its enough to inflict a 1-5% performance loss (as we see in AotS with Async on/off).

Dan Baker explained it all to me.

Looking at both AMD Gaming Evolved Titles and NVIDIA Gameworks titles, seems to me that AMD and NVIDIA benefit from varying CPU multi-threaded rendering techniques and that each IHV helps developers profile the game in a way which best benefits their own uArch. AotS was pretty neutral, as we see when we look at the DX11 numbers, but Hitman, Rise of the Tomb Raider, Gears of War UE are quite biased.

Far Cry Primal is another biased title, having been developed for the Jaguar cores on the XBox One. The Division also shows taints of AMDs Jaguar influence. It seems that barring any sort of heavy engine re-writes, most games will likely go the AMD way going forward (console influence).

That makes sense.

razor1 · Mar 16, 2016

DirectX 12 Requires Different Optimization on Nvidia and AMD Cards, Lots of Details Shared | DualShockers

This is probably the most complete differences between AMD and nV programming needs when Async is concerned.

This is from GDC, from AMD and nV, so leave this up to people stating that code can't be done for each IHV's....... well yeah it can be done and has been done.

They’re not inherently faster on the GPU. The gain is all on the CPU side, so they need to be used wisely. Optimizing bundles diverges for Nvidia and AMD cards, and require a different approach. In particular, for AMD cards bundles should be used only if the game is struggling on the CPU side.

Compute queues still haven’t been completely researched on DirectX 12. For the moment, they can offer 10% gains if done correctly, but there might be more gains coming as more research is done on the topic.

Since those gains don’t automatically happen unless things are setup correctly, developers should always make sure whether they do or not, as poorly scheduled compute tasks can result in the opposite outcome.

The use of root signature tables is where optimization between AMD and Nvidia diverges the most, and developers will need brand-specific settings in order to get the best benefits on both vendors’ card.

TheBlueChanell · Mar 16, 2016

TheBlueChanell said:
Completely off-topic but it's kind of a bummer that Crytek has had (created) such a terrible series of financial and structural setbacks. Unless I'm missing something they have been incredibly quiet regarding both DX12 & Vulkan in CryEngine. Say what you will about their games but they have always created beautiful and envelope pushing engines. It would be nice to see what they could of come up with, could of maybe even been something worthy of a "Crysis 2.0" label for this new generation of hardware & VR. I don't think we'll be seeing any AAA titles from them for a while, they seem to be headed strictly into the Freemium / Free Market space.

Even if they just got the engine back-up snuff we could end up with something aww inspiring from a licensee. StarCitizen is going to both a DX12/Vulkan path and that is CryEngine maybe that'll get back-ported into mainline CE.

Looking forward to playing Hitman a little bit later tonight.

I take that Crytek part back, good to see.

CryEngine V releases today on a pay-what-you-want basis

Mahigan · Mar 16, 2016

razor1 said:
DirectX 12 Requires Different Optimization on Nvidia and AMD Cards, Lots of Details Shared | DualShockers

This is probably the most complete differences between AMD and nV programming needs when Async is concerned.

This is from GDC, from AMD and nV, so leave this up to people stating that code can't be done for each IHV's....... well yeah it can be done and has been done.

Very interesting

Ty for sharing.

razor1 · Mar 16, 2016

Mahigan said:
Very interesting

Ty for sharing.

NP, now where I was getting at with Hitman (and of course we can look at it in reverse for ROTR or other games too) was this

When deciding whether to use a pixel shader or a compute shader, there are “extreme” difference in pros and cons on Nvidia and AMD cards (as shown by the table in the gallery).

There is a big difference between AMD and nV architecture and there was no way someone could state that compute by itself is what is the issue and bottleneck due to async with this game...

see that last sentence in that slide Anarchist? NOTHING IS EVER FREE. Highlighted in yellow for you!

This one is also for the people that were saying low utilization will reduced by async after the fact, yes it does but you still need to optimize prior to even thinking about using async. Utilization is very important even prior to async as async will not help much when ALU's are waiting for work when ALU's in their block are working on a thread.

Anarchist4000 · Mar 17, 2016

DirectX 12 Requires Different Optimization on Nvidia and AMD Cards, Lots of Details Shared | DualShockers
Found this GDC presentation that does a good job laying out what I was explaining earlier.

http://cdn4.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-10.jpg
http://cdn4.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-17.jpg
http://cdn4.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-23.jpg
http://cdn4.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-25.jpg
http://cdn4.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-27.jpg
http://cdn3.dualshockers.com/wp-content/uploads/2016/03/Direct-X12-Panel-Slides-59.jpg

Maintain non-async compute path (Nvidia and graphics like tasks). Nvidia would benefit from async/concurrent execution, they just can't schedule it efficiently.
Pair memory/texture bound graphics with ALU heavy compute tasks for async (Slides 23,35,27)
Nvidia: Avoid compute unless you must. AMD: Always use compute unless you need a rasterizer. (Slide 59)
They're still researching async queues and how to utilize them. 10% gains and possibly more coming.

Basically texturing is like video decoding. Doubling resolution won't double bitrate/bandwidth (increased cache efficiency from denser/similar pixels). So as resolution increases, so does the ability to schedule async tasks concurrently because the quantity of texture lookups remains proportional.

Hitman 2016: PC graphics performance benchmark review
Apparently they tried async in Hitman, but it appears to be broken. In addition to the MS Store 60fps thing.

JustReason · Mar 17, 2016

Razor, I am a little confused. Who stated that you didn't need to code specifically for individual vendors in DX12? Maybe I missed it.

And I think Anarchist4000 has you blocked, lol.

SamiiRoss · Mar 17, 2016

JustReason said:
Razor, I am a little confused. Who stated that you didn't need to code specifically for individual vendors in DX12? Maybe I missed it.

And I think Anarchist4000 has you blocked, lol.

because razor is annoying, he bleed green no matter what. He tried to argue with mohigan once. LOL! and he got served pretty [H]ard.

razor1 · Mar 17, 2016

JustReason said:
Razor, I am a little confused. Who stated that you didn't need to code specifically for individual vendors in DX12? Maybe I missed it.

And I think Anarchist4000 has you blocked, lol.

That wasn't Anarchist. that was wantapple, dethroned and another.

I don't care if he blocked me, doesn't matter, a retard that says I don't know my basics, and totally mis construed a benchmark to show what he thought, now he is back tracking, now he is saying Hitman is broken lol, wtf how many times is he going to do this.

Again he read the slides wrong, where does it say its specific to nV to not use async in any of those slides?

Slide 59 doesn't say that, they are saying use it for specific needs other needs use the PS. Same thing for AMD, those needs are different per architecture. I can't understand WTF he is reading.

Consider the performance benefits is for both IHV's its not just nV.

I don't understand why people can't read his posts and then read what he linked to, and then judge what he is saying. Its so much BS he spouts and you guys just feed into it like cattle going down a feed chain.

[PCGH] Hitman DX12 Benchmarks -- 390 faster than 980 by 20%

2[H]4U

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

Limp Gawd

[H]ard|Gawd

2[H]4U

[H]F Junkie

[H]F Junkie

Limp Gawd

razor1 is my Lover

[H]ard|Gawd

[H]F Junkie

Limp Gawd

Gawd

[H]F Junkie

Gawd

[H]F Junkie

Extremely [H]

[H]F Junkie

[H]ard|Gawd

Gawd

[H]F Junkie

[H]ard|Gawd

Extremely [H]

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

Limp Gawd

Supreme [H]ardness

razor1 is my Lover

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

Limp Gawd

[H]F Junkie

[H]ard|Gawd

razor1 is my Lover

Gawd

[H]F Junkie