[RUMOR] Pascal in trouble with Asynchronous Compute

FrgMstr · Mar 29, 2016

zone74 said:
Whoever did that test is looking at entirely the wrong thing.
Minimum framerates have gone up significantly in DX12.
Benchmarks run on a 6600K with a 390X:

DX11. Overall Score: 55.82 FPS

DX12. Overall Score: 48.44 FPS

So is DX11 is better than DX12 based on that benchmark?
Look at the minimum framerates. In DX11 it drops to 9.32 FPS, while in DX12 the lowest is 30.76 FPS!
That is far more important for a fluid gameplay experience than an average drop of ~7 FPS.
It could still be better optimized, because there shouldn't really have been a performance drop at all, but it's still a huge improvement.

Async Compute is a big deal looking forward, and NVIDIA is going to have problems if Pascal does not support it.
That said, I won't be switching to AMD until they sort out their DX11 performance.
Without driver command list support, their DX11 driver is single-threaded and performance is considerably worse than NVIDIA as a result.

So for me it really depends who gets there first: AMD with properly multi-threaded and well-optimized DX11 drivers, or NVIDIA with Async Compute.
Unfortunately I suspect that means I'll be holding onto my 960 for another generation instead of upgrading this year.

That test is complete shit and comparing a single result will tell you NOTHING. I have seen minimum frame rate values change over 30% from test to test.

razor1 · Mar 29, 2016

Ocellaris said:
Must have meant QRQ (Quick Response Queue)

Ah thx, yeah that makes more sense, and back to what he stated, no QRQ's won't improve base line frame rates, not to the degree that we have seen low level API's help. QRQ's seem to help re leave another problem with VR, which isn't associated with low frame rates.... will need more info on this though. Its for dropped frames.

Yakk · Mar 29, 2016

razor1 said:
And was that to compute functionality or that because of the ability to use more cores? I would think it is the later, Compute shader usage doesn't have much to do with frame rate spikes unless written badly, or not properly used.

I'm not sure what RQR is... Typo?

And there would be a theoretical maximum that you can get out of efficiency not to mention, a hard line maximum the scheduler can offer... So no it just doesn't keep going more.

The use of more cores was implied when using the Async scheduler, otherwise the usefulness becomes questionable with just using 2 CPU cores, like DX11 for example. With using more and more threads pre-scheduling work for the gpu would become a nightmare, hence the use of the Async scheduler at the GPU level. I can see well written code, just having 4 to 8+ threads accessing the GPU almost simultaneously it would become easy to bottleneck the single GPU queue out of order, and a headache to debug with each thread potentially affecting 3+ other threads. Having multiple threads each accessing multiple ACE units even more so.

There is a maximum of efficiency to be gained of coarse, ie: if the card is 100% used, and as the code matures we should be getting closer and closer to extracting the maximum efficiency of GPUs through using them with multi-threading in parallel tasks.

I wouldn't want to be a person tasked with doing this without the help of a gpu Async scheduler.

razor1 · Mar 29, 2016

true!

Zion Halcyon · Mar 29, 2016

evilsofa said:
In football terms, the Hail Mary is still in the air, the score is 40-12 not 10-3, and there's a large but unknown amount of time left on the clock.

In other news, DX12 performs slower than DX11 in Rise Of The Tomb Raider on both platforms because optimization is hard and DX12 is new.

That's just you being silly. On the former, not the latter.

I made that original analogy, not Moorish, and the reason for it is because with newer games being developed with DX12 in mind, and async providing some benefits, you will have a situation now where some games will run better on AMD and others on NVidia. Hence, a metaphorical tie. What you are trying to do is award "style" points - just because NVidia has been crushing it in DX11 doesn't mean that if AMD performs better in DX 12 once games fully developed in DX 12 come along, NVidia gets an award and a medal for "doing so great for so long".

The point of a metaphor, is to HAVE a point. It's not for fanboys of either side to manipulate in a "nuh-uhhh! - uh-huuuuh!" argument, which is essentially what your first sentence just did.

Have a point. Preferably an intelligent one. Otherwise, you sound smart to no one but yourself.

Zion Halcyon · Mar 29, 2016

Anyway, I still stand by what I said - we're really going to know the real deal once we get some "pure" DX 12 games later in the year.

Right now, DX 12 implementation, in Hitman, TombRaider, etc are "tack ons". It's like back in the day when DDO, a DX 9 game, said they at one point now supported DX11, when really, the only job DX 11 was given was to render better looking water - not the best use for it if you are claiming support.

The fact that there even are gains is interesting, although everything also depends on how the development team is implementing it - I'd imagine in Tomb Raider's case, its a lot like DDO.

As it is, I don't think it is realistic to expect Pascal to have async. Volta might - it depends on where it is in the design process. Which means if async does end up killing it in pure DX 12 games, you are going to see some shit fly as MS and AMD lure developers to design games for DX 12 with open resources, while NVidia tries to strongarm devs into sticking with DX11 and their proprietary software where their cards are superior (not unlike the early days of the intel-AMD war in the 90s).

I can see that swing coming. Although I will say, if you are both an investor and a believer in NVidia's R&D, then when that downturn happens, if it happens, that would be a heck of a time to invest in NVidia with the expectation of a post Volta bounce back.

Yakk · Mar 29, 2016

Zion Halcyon said:
It's like back in the day when DDO, a DX 9 game, said they at one point now supported DX11, when really, the only job DX 11 was given was to render better looking water - not the best use for it if you are claiming support.

Nice to see someone remember DDO!

Actually, at the time they implemented DX11, AMD worked with Turbine to make it not only possible, but that water used *DirectCompute* to make interactive waves when many people ran through the water, and DDO was also the FIRST game (I believe) to use Ambient Occlusion Shadows!

Not only that, it was also one of, if not the first game, to use system RAM as a texture cache to supplement GPU VRAM which at 1GB at the time was woefully inadequate for the massive amounts of textures DDO uses.

AMD & Turbine were way ahead of their time, especially with the GPUs used in the day.

And then... Turbine was bought by WB and it all went to hell with practically no graphics improvements since.

Zion Halcyon · Mar 29, 2016

Yakk said:
Nice to see someone remember DDO!

Actually, at the time they implemented DX11, AMD worked with Turbine to make it not only possible, but that water used *DirectCompute* to make interactive waves when many people ran through the water, and DDO was also the FIRST game (I believe) to use Ambient Occlusion Shadows!

Not only that, it was also one of, if not the first game, to use system RAM as a texture cache to supplement GPU VRAM which at 1GB at the time was woefully inadequate for the massive amounts of textures DDO uses.

AMD & Turbine were way ahead of their time, especially with the GPUs used in the day.

And then... Turbine was bought by WB and it all went to hell with practically no graphics improvements since.

Not arguing that it wasn't some damn impressive water. Just that at the end of the day, it still really only gave the game prettier water...

Relayer · Mar 30, 2016

trandoanhung1991 said:
I thought DX12 was gonna be a silver bullet for AMD?

People already are recommending AMD cards over their counterparts based on benchmarks of games not even on sale yet.

Tomb Raider was gonna be one of those pro-AMD games, touted by AMD fans because of DX12. Turns out it's a dud.

So, do you think this game should be taken as representative of what we should expect from DX12? And please consider which IHV is sponsoring this game. When was the last time a Gameworks title was pro AMD?

ChosenUno · Mar 30, 2016

Relayer said:
So, do you think this game should be taken as representative of what we should expect from DX12? And please consider which IHV is sponsoring this game. When was the last time a Gameworks title was pro AMD?

OC3D :: Article :: "The vast majority of DX12 titles in 2015/2016 are partnering with AMD" :: The vast majority of DX12 titles in 2015/2016 are partnering with AMD

https://www.reddit.com/r/pcgaming/c...t_majority_of_dx12_titles_in_20152016/cutgmrb

So Tomb Raider was an AMD title right until it's released? Wasn't it developed with Mantle in mind? Suddenly now it's an Nvidia title?

And yes, I expect the early crops of games to be broken under DX12 for at least a few months, as happened with previous DX versions. Right now the score is 2-0 broken-working AFAIK, since both Gears and Tomb Raider's DX12 implementation is borked.

Zion Halcyon · Mar 30, 2016

trandoanhung1991 said:
OC3D :: Article :: "The vast majority of DX12 titles in 2015/2016 are partnering with AMD" :: The vast majority of DX12 titles in 2015/2016 are partnering with AMD

https://www.reddit.com/r/pcgaming/c...t_majority_of_dx12_titles_in_20152016/cutgmrb

So Tomb Raider was an AMD title right until it's released? Wasn't it developed with Mantle in mind? Suddenly now it's an Nvidia title?

And yes, I expect the early crops of games to be broken under DX12 for at least a few months, as happened with previous DX versions. Right now the score is 2-0 broken-working AFAIK, since both Gears and Tomb Raider's DX12 implementation is borked.

For the 3rd time - right now, the implementations are not well done. These are games developed with DX11 and with some 12 features tacked on.

Everything we know about DX 12 is that it is a low level that is close to the hardware, and therefore to get the maximum benefit from it, you need to be developing with it from the word go.

These latest DX12 titles are in name only, to cash in on hype, as the real DX 12 titles (DX12 only) are incoming for the latter half of this year.

I get some of you fanboys love to crow about premature victories at the first sniffle, but until we start getting in THOSE games, we really won't know one way or the other.

I might be wrong, but I believe Deus Ex: Mankind Divided will be the first game we see that has full DX 12 implementation from the beginning to see full release.

Yakk · Mar 30, 2016

AoTS is formally releasing tomorrow.

Zion Halcyon · Mar 30, 2016

Yakk said:
AoTS is formally releasing tomorrow.

That one should be interesting, but I still don't know if its a full DX12 implementation. I do know they dove full in to the async shaders, so we shall see.

ChosenUno · Mar 30, 2016

Zion Halcyon said:
For the 3rd time - right now, the implementations are not well done. These are games developed with DX11 and with some 12 features tacked on.

Everything we know about DX 12 is that it is a low level that is close to the hardware, and therefore to get the maximum benefit from it, you need to be developing with it from the word go.

These latest DX12 titles are in name only, to cash in on hype, as the real DX 12 titles (DX12 only) are incoming for the latter half of this year.

I get some of you fanboys love to crow about premature victories at the first sniffle, but until we start getting in THOSE games, we really won't know one way or the other.

I might be wrong, but I believe Deus Ex: Mankind Divided will be the first game we see that has full DX 12 implementation from the beginning to see full release.

And yet people are saying that AMD's hardware will have a leg up in the future "because DX12". Nobody but the AMD fanboys have been beating that drum, ever since AotS start having benchmarks.

Also, what you're saying makes no sense. Should we label all games 'fake' DX12 because it's not DX12 exclusive? If so, how should we judge the implementation of DX12 in that particular game, since there's nothing else to compare to?

If a game is DX12 exclusive, came out and ran at 60FPS, is that a good implementation or a bad implementation?

razor1 · Mar 30, 2016

Zion Halcyon said:
That's just you being silly. On the former, not the latter.

I made that original analogy, not Moorish, and the reason for it is because with newer games being developed with DX12 in mind, and async providing some benefits, you will have a situation now where some games will run better on AMD and others on NVidia. Hence, a metaphorical tie. What you are trying to do is award "style" points - just because NVidia has been crushing it in DX11 doesn't mean that if AMD performs better in DX 12 once games fully developed in DX 12 come along, NVidia gets an award and a medal for "doing so great for so long".

The point of a metaphor, is to HAVE a point. It's not for fanboys of either side to manipulate in a "nuh-uhhh! - uh-huuuuh!" argument, which is essentially what your first sentence just did.

Have a point. Preferably an intelligent one. Otherwise, you sound smart to no one but yourself.

Well the main thing is, when doing async, you can have peak occupancy but that doesn't mean you have peak utilization. From a wavefront or wave point of view and threads, the pipeline might be saturated, but that doesn't mean all ALU are actively working nor can they be nor can async help in a situation like that as the blocks are locked into which threads they are working on. So it really depends on what is going on....

This is what makes async hard (single path for all cards not possible without performance issues), we know each IHV have different affinities towards these then on top of that you have scheduling which is different based on each generation of IHV's cards........

So porting over console games, we might see them working on some cards at least with async is concerned but there has to be much more work done on all IHV's cards.

Polaris will stray away from what the Xbox and PS4 GPU's look like even more than Fiji, so we might see performance degradation on Polaris, too along with Pascal with the current crop of "DX12" games

Anarchist4000 · Mar 30, 2016

trandoanhung1991 said:
And yet people are saying that AMD's hardware will have a leg up in the future "because DX12". Nobody but the AMD fanboys have been beating that drum, ever since AotS start having benchmarks.

Also, what you're saying makes no sense. Should we label all games 'fake' DX12 because it's not DX12 exclusive? If so, how should we judge the implementation of DX12 in that particular game, since there's nothing else to compare to?

If a game is DX12 exclusive, came out and ran at 60FPS, is that a good implementation or a bad implementation?

Good implementation should be the one that can utilize all CPU cores.

razor1 said:
Well the main thing is, when doing async, you can have peak occupancy but that doesn't mean you have peak utilization. From a wavefront or wave point of view and threads, the pipeline might be saturated, but that doesn't mean all ALU are actively working nor can they be nor can async help in a situation like that as the blocks are locked into which threads they are working on. So it really depends on what is going on....

This is what makes async hard (single path for all cards not possible without performance issues), we know each IHV have different affinities towards these then on top of that you have scheduling which is different based on each generation of IHV's cards........

That's one area where the queue priorities would probably help. There is likely some control, albeit in drivers, over work distribution. Limit graphics to 70% occupancy, reserving 30% for compute for example. This will probably get exposed in Vulkan, DX12 is another matter(one compute queue). Those features should make async somewhat self tuning based on hardware. I know ACEs are programmable(drivers). It would make sense the work distributor could be configured as well. Score shaders by tex/memory:math ratio and attempt to balance all the compute units.

Zion Halcyon · Mar 30, 2016

trandoanhung1991 said:
And yet people are saying that AMD's hardware will have a leg up in the future "because DX12". Nobody but the AMD fanboys have been beating that drum, ever since AotS start having benchmarks.

Also, what you're saying makes no sense. Should we label all games 'fake' DX12 because it's not DX12 exclusive? If so, how should we judge the implementation of DX12 in that particular game, since there's nothing else to compare to?

If a game is DX12 exclusive, came out and ran at 60FPS, is that a good implementation or a bad implementation?

The minute you said "no one but AMD Fanboys" tells me all I need to know - you are too emotionally caught up in brand wars, and are being a bit silly. This is about winning a pissing contest for you; not about what's been reported thus far.

You oversimplified what was being said by saying "And yet people are saying that AMD's hardware will have a leg up in the future "because DX12"." You essentially broadly dismissed what's been written so far by overgeneralizing everyone as "AMD Fanboys."

It begs the question - why are you so defensive? Do you own stock in NVidia? Do you have a personal stake? Why so srs, brah?

I learned a long time ago after being on the wrong end of a particular brand war that brand loyalty is nothing but incredibly stupid. I've owned both AMD and Nvidia cards in the past. I've owned both AMD and Intel over the years. All that matters to me, and frankly, all that should matter to any sane individual is this:

"Were you happy with your purchase?"

If you were, great! No one can take that away from you unless you let them.

However, if you weren't, is there a chance that maybe, you bought something out of brand loyalty? And now regret it, and are trying to convince yourself you love it even though you don't really? Maybe to the point ego is involved? There's a special name for those types of people. I call them "Mac users."

All kidding aside, if you bought, NVidia, and you love it, good on you. But stop acting like picking one brand over another makes you en expert on that brand, or means that you are now so married to that brand that you have to defend it for all your worth like it's some misbehaving child who slugged another kid in a store or something.

There are sound discussions out there about AMD potentially seeing gains as DX 12 is adopted more and more, because of the Vulkan integration and the existing hardware in AMD cards that will start to be taken advantage of. This isn't the talk of AMD fanboys - this is LEGIT. The question we all have is whether we will see an increase in performance as AMD is stating. We all are in a wait and see pattern, and it makes sense that the early DX 12 adopters are doing so for marketing purposes, as was far too late in those games' development lifecycle to fully integrate DX 12 properly. So yes, we are waiting to see, and not predicting anything - merely echoing reports already out there.

So really, why act so mad? Either there will be a benefit and AMD cards will rule the DX 12 roost for a while, or there won't and NVidia will still be in charge. But really, if the former happens, so what? You are not your video card. You are not a lesser human being because AMD got better (if it does).

If AMD suddenly is king when the latter DX 12 games come out, so what? As long as you are happy with whatever it is you bought, that's all that matters. Of course, if you aren't, you could always finally cast ego and pride to the side and just buy AMD - at least until Nvidia comes out on top again

razor1 · Mar 30, 2016

Anarchist4000 said:
Good implementation should be the one that can utilize all CPU cores.

That's one area where the queue priorities would probably help. There is likely some control, albeit in drivers, over work distribution. Limit graphics to 70% occupancy, reserving 30% for compute for example. This will probably get exposed in Vulkan, DX12 is another matter(one compute queue). Those features should make async somewhat self tuning based on hardware. I know ACEs are programmable(drivers). It would make sense the work distributor could be configured as well. Score shaders by tex/memory:math ratio and attempt to balance all the compute units.

Can't be done that way at least not easily on the driver side, if you don't keep enough threads in flight, the pipeline and ALU's might stalls too, so that 70% graphics work load, the driver has to analyze it and know what the scheduler is doing and compare them based on what it has for work load. Drivers aren't that smart nor will an IHV want to do this as it changes based on the application. Its much more work than it will solve, it has to be done by the developer.

Anarchist4000 · Mar 30, 2016

razor1 said:
Can't be done that way at least not easily on the driver side, if you don't keep enough threads in flight, the pipeline and ALU's might stalls too, so that 70% graphics work load, the driver has to analyze it and know what the scheduler is doing and compare them based on what it has for work load. Drivers aren't that smart nor will an IHV want to do this as it changes based on the application. Its much more work than it will solve, it has to be done by the developer.

The driver wouldn't be involved in the decisions, beyond configuration settings, for AMD. It wouldn't be much different than checking for available resources prior to dispatching a wave or choosing a prioritized ACE over another. All of which happens in hardware largely without driver interference. For Nvidia, yeah it'd be the driver and a rather interesting implementation.

That 70% is just an arbitrary number and it wouldn't have to be a hard limit. If no compute was available more graphics could be scheduled. It'd be fairly simple logic to include with the scheduler and wouldn't need to be perfect. It would also make a developers life far more easy if it tuned itself with hardware.

razor1 · Mar 30, 2016

AMD hardware doesn't analyze anything right now,, the hardware isn't "smart" either this is why Fiji sometimes has issues with workloads that work better with Hawaii, when we look at async. You are describing a way through drivers to create something that AMD and nV has been trying to get away from as much as possible with scalar architectures. It is the opposite of what the architectural design's for GCN, and tesla and so on, intents were.

So with that said why would they want to go backwards, by making drivers more complex?

The % of the workload doesn't matter in the sense of what the % is, but just by trying to figure out which work load is what and splitting it up, does that seem like its going to be easy for the driver to do?

It is very far fetched to even think this kind of system would be viable let alone even doable.

Anarchist4000 · Mar 30, 2016

razor1 said:
AMD hardware doesn't analyze anything right now,, the hardware isn't "smart" either this is why Fiji sometimes has issues with workloads that work better with Hawaii, when we look at async. You are describing a way through drivers to create something that AMD and nV has been trying to get away from as much as possible with scalar architectures. It is the opposite of what the architectural design's for GCN, and tesla and so on, intents were.

So with that said why would they want to go backwards, by making drivers more complex?

The % of the workload doesn't matter in the sense of what the % is, but just by trying to figure out which work load is what and splitting it up, does that seem like its going to be easy for the driver to do?

Again, this wouldn't happen through drivers. The drivers would simply provide hints. The logic of it seems relatively simple and not overly difficult to implement. I've programmed similar circuits myself. It wouldn't be a simple split, you'd just score waves by fetch:alu and try to keep the number low on each compute unit. Odds are compute is going to be low which is the whole complementary thing async is going for in the first place. Pair math heavy with texture/fetch heavy to better fill in gaps. It's not too different from network routing to be quite honest. Drivers likely wouldn't work because of the added latency, but hardware is a whole other ballgame as it could analyze values for every compute unit simultaneously and pick a task all in a single cycle.

The fiji issues are more likely the limited ROPs and color compression. Same thing that trips up tonga. Can't say I've seen any issues specifically with async on them.

Derfnofred · Mar 30, 2016

Zion Halcyon said:
That one should be interesting, but I still don't know if its a full DX12 implementation. I do know they dove full in to the async shaders, so we shall see.

You keep using this term.

1.) Sounds a whole lot like "no true Scotsman" fallacy.
2.) When does an engine that's been evolving in its codebase from DX9 or earlier (depending on the vendor), become a full DX12 implementation? Since you know a developer isn't going to start from ground zero.

razor1 · Mar 30, 2016

Anarchist4000 said:
Again, this wouldn't happen through drivers. The drivers would simply provide hints. The logic of it seems relatively simple and not overly difficult to implement. I've programmed similar circuits myself. It wouldn't be a simple split, you'd just score waves by fetch:alu and try to keep the number low on each compute unit. Odds are compute is going to be low which is the whole complementary thing async is going for in the first place. Pair math heavy with texture/fetch heavy to better fill in gaps. It's not too different from network routing to be quite honest. Drivers likely wouldn't work because of the added latency, but hardware is a whole other ballgame as it could analyze values for every compute unit simultaneously and pick a task all in a single cycle.

The fiji issues are more likely the limited ROPs and color compression. Same thing that trips up tonga. Can't say I've seen any issues specifically with async on them.

so with async on and off and we see no gain or performance drop from Fiji vs Hawaii, its ROP limited? And this happens at different resolutions like it Hitman? Nope. doesn't fit that theory. If something was ROP limited the frame rates should end up being close for Fiji and Hawaii at all times once that is hit. You wouldn't have performance swings and definitly not drops. Remember AMD stated Aync done right.

Yeah it would have to be hardware that takes care of that , not drivers, the transistor budget will they allocate for that? I don't think it will be cheap. The methodology of it sounds like its easy to do, but with the varying workloads, and work types, by no means is it easy to do. networking is nothing like this, work types are not varied this much nor is there any type of analysis going on after the fact or prior to the fact.

Zion Halcyon · Mar 30, 2016

Derfnofred said:
You keep using this term.

1.) Sounds a whole lot like "no true Scotsman" fallacy.
2.) When does an engine that's been evolving in its codebase from DX9 or earlier (depending on the vendor), become a full DX12 implementation? Since you know a developer isn't going to start from ground zero.

Sounds like you haven't really been keeping up with DX12.

Part of the reason DX 12 has been much ballyhoo'ed, unlike previous DX versions, is that it ISN'T just what they did previously since DX started, which was DX + more junk. DX 12 involved a partial rewrite and optimization that a lot of people critical of the previous DX versions had been clamoring for. You listen to the developers trying to learn DX12 and its got its own challenges unique to itself, and the developers are having to learn how to program all over again for a card, with the tradeoff being less overhead and more direct control over a graphic card without middleware or bloatware getting in the way and causing weird issues.

So I understand your question in #2 (since #1 was just a pissing contest statement), but frankly, it's not the type of question you ask when you are trying to prove you know something else it shows you aren't really up on what's going on. That question is better suited for someone trying to learn why the interest in DX12 given the history of DX, in the interest of actually gaining knowledge, which, based on your first statement, I would say isn't what you were doing.

TLDR version: You make 2 bad assumptions due to a lack of education on DX12.

Let me help educate you:

The primary feature highlight for the new release of DirectX was the introduction of advanced low-level programming APIs for Direct3D 12 which can reduce driver overhead. Developers are now able to implement their own command lists and buffers to the GPU, allowing for more efficient resource utilisation through parallel computation. Lead developer Max McMullen, stated that the main goal of Direct3D 12 is to achieve "console-level efficiency on phone, tablet and PC".[38] The release of Direct3D 12 comes alongside other initiatives for low-overhead graphics APIs including AMD's Mantle for AMD graphics cards, Apple's Metal for iOS and OS X and Khronos Group's cross-platform Vulkan.

That from the DX12 Wiki.

It should be noted that some of Vulkan was also baked into DX12. We're looking at a very different animal in a lot of respects than previous DX versions, and developers are still learning how best to use those low level APIs to maximum effect in their games. Games like Tomb Raider and Hitman implemented at best an aspect of it, but they didn't program those games from scratch to take advantage of the low overhead DX12 offers, and based on comments from the developers who current ARE doing this, this is something that needs to be done on an engine level at the beginning when the game is being constructed for maximum benefit. Which pretty much discredits your "no true scottsman" snide comment.

We will know for certain about DX 12 when the developers who have spent the time and effort programming for the low level API and low overhead release their games whether we will see real world performances, which is what we are waiting on.

Derfnofred · Mar 30, 2016

Thanks for the patronizing and condescending response.

So, in short, you don't have a clear distinction about what makes up a "full DX12 implementation", but you'll make it up along the way because it's an arbitrary choice. Cool.

In case you missed my point, when a developer takes their DX11 engine (which was probably evolved from their DX10 engine, and on) and rewrites swaths of it to implement low level features, it's a DX12 engine. Simple. End of story. It may be a badly optimized one, but it's still a "full DX12 implementation". A non full DX12 engine masquerading as a DX12 implementation crashes.

Zion Halcyon · Mar 30, 2016

Derfnofred said:
Thanks for the patronizing and condescending response.

So, in short, you don't have a clear distinction about what makes up a "full DX12 implementation", but you'll make it up along the way because it's an arbitrary choice. Cool.

In case you missed my point, when a developer takes their DX11 engine (which was probably evolved from their DX10 engine, and on) and rewrites swaths of it to implement low level features, it's a DX12 engine. Simple. End of story. It may be a badly optimized one, but it's still a "full DX12 implementation". A non full DX12 engine masquerading as a DX12 implementation crashes.

I felt a condescending and patronizing response was warranted given the condescending and patronizing comments, and the lack of education you have on the subject. Although now I see it wasn't a lack of education, but one of willful ignorance, where you are trying to "lawyer" your way through commentary to win a pissing contest. And given that you need to go to such lengths to try to justify yourself, the only reasonable conclusion a smart person can come away with is that I was indeed crystal clear, but you are trying to muddy things up for the sake of a pissing contest.

Which, to be honest, you just unintentionally copped to when you decided to make this an argument over semantics instead of the technology, and confirming by acknowledging "bad optimization" in your recent commentary, that you allow you knew exactly what I was referring to.

But hey, I'll humor you. replace "full DX12 Implementation" with "fully optimized DX 12 Implementation" if it makes you feel better. The point still remains the same, and you are bickering over wordplay.

Zion Halcyon · Mar 30, 2016

Anyway, getting us back on track, Ashes of the Singularity releases tomorrow. Did some digging, and some are claiming its Nitrous engine is DX12-native. May give us our first real look into what I was talking about, in seeing how the video cards perform with a native DX12 engine.

Anarchist4000 · Mar 30, 2016

razor1 said:
so with async on and off and we see no gain or performance drop from Fiji vs Hawaii, its ROP limited? And this happens at different resolutions like it Hitman? Nope. doesn't fit that theory. If something was ROP limited the frame rates should end up being close for Fiji and Hawaii at all times once that is hit. You wouldn't have performance swings and definitly not drops. Remember AMD stated Aync done right.

Async seems to work just fine for other games out there. ROPs were one of the significant differences, but in this case it's probably the compression which doesn't work with compute. Color compression as a stumbling block was pointed out in some of the GDC presentations. The original release kept crashing when they tried to use async, so it's likely disabled and figi/tonga(gcn 1.2) were the cards that had the ability. The only other real difference is the 4 vs 8 ACEs.

Yeah it would have to be hardware that takes care of that , not drivers, the transistor budget will they allocate for that? I don't think it will be cheap. The methodology of it sounds like its easy to do, but with the varying workloads, and work types, by no means is it easy to do. networking is nothing like this, work types are not varied this much nor is there any type of analysis going on after the fact or prior to the fact.

Honestly that wouldn't take a whole lot of transistors and the workload is always a bunch of waves so it doesn't really vary that much. The logic is simply "will it fit", "is fetch:alu > x", and "grab highest priority compute, else graphics". Fit and priority are already documented and the comparison could be as basic as comparing two 2-bit numbers. The priority values are 0-3 as I recall. The added transistor budget for that may not even break 100.

Derfnofred · Mar 30, 2016

Zion Halcyon said:
Anyway, getting us back on track, Ashes of the Singularity releases tomorrow. Did some digging, and some are claiming its Nitrous engine is DX12-native. May give us our first real look into what I was talking about, in seeing how the video cards perform with a native DX12 engine.

Sigh. This is exactly what I'm talking about. Nitrous engine is an evolution over several generations that has been aggressively rewritten for DX12. There's at least 700 threads on AOTS, nitrous engine, and its respective benchmarking. Unless there's some new magic, I'd think the latest beta's are pretty close to expectation of its release, no?

Anarchist--while not out of the realm of possibility, small changes like that would have long since been done if that's all it takes.

ChosenUno · Mar 30, 2016

Zion Halcyon said:
The minute you said "no one but AMD Fanboys" tells me all I need to know - you are too emotionally caught up in brand wars, and are being a bit silly. This is about winning a pissing contest for you; not about what's been reported thus far.

You oversimplified what was being said by saying "And yet people are saying that AMD's hardware will have a leg up in the future "because DX12"." You essentially broadly dismissed what's been written so far by overgeneralizing everyone as "AMD Fanboys."

So what has been written so far?

When AotS benchmarks first came out, until very recently, people have been pretty much using only that to base all claims regarding DX12 performance for NVIDIA vs AMD. And then when the large number of potential DX12 games seem to line up with AMD, what did people say? Next gen belongs to AMD, even though we had 0 information regarding Polaris and Pascal. All this noise out of 1 alpha benchmark.

And when the first 2 DX12 games were released, Tomb Raider and Gears of War Ultimate, people dismissed them because they performed poorly, saying exactly what you're saying: "It's not true DX12 brah".

You keep bringing up "fully optimized for DX12" as a term as if it has any meaning. Do tell, what metrics do you use to judge optimization quality for DX12?

There are sound discussions out there about AMD potentially seeing gains as DX 12 is adopted more and more, because of the Vulkan integration and the existing hardware in AMD cards that will start to be taken advantage of. This isn't the talk of AMD fanboys - this is LEGIT. The question we all have is whether we will see an increase in performance as AMD is stating. We all are in a wait and see pattern, and it makes sense that the early DX 12 adopters are doing so for marketing purposes, as was far too late in those games' development lifecycle to fully integrate DX 12 properly. So yes, we are waiting to see, and not predicting anything - merely echoing reports already out there.

The term here is potential. And where did people get this assumption from? A single benchmark of an alpha/beta game called AotS. That's all they had until rather recently. Vulkan integration is largely irrelevant as of this moment, so I don't have a clue why you're bringing that up.
Also, waiting to see is different from repeating from the echo chambers that DX12 will be this and that. Because so far it isn't. You can sweep it aside with whatever rationale you want, the cold harsh truth is against you as of this moment. DX12 is hard, is easy to get wrong, and developers are bound to get it wrong in their first couple of attempts.

Also please refrain from trying to sound smart and witty, because you really aren't. Just stay with facts.

Let's get this straight. Most, if not all developers, will not attempt a DX12 native engine for their AAA game right now. It's just financially unwise and suicidal. You're betting a huge amount of money on unstable, unproven new technology. So all we will have is a bunch of DX11 engines retrofitted to run under DX12 API. This won't change in the short term, potentially not even the medium term. In the long term we might see this change, but that's a long way away.

Derfnofred · Mar 30, 2016

trandoanhung1991 said:
Let's get this straight. Most, if not all developers, will not attempt a DX12 native engine for their AAA game right now. It's just financially unwise and suicidal. You're betting a huge amount of money on unstable, unproven new technology. So all we will have is a bunch of DX11 engines retrofitted to run under DX12 API. This won't change in the short term, potentially not even the medium term. In the long term we might see this change, but that's a long way away.

Whenever anyone talks about making a "native" this or "clean slate" that, I'm always reminded of: Programming Sucks

Which is good, cuz if essays like that didn't exist, we'd all go mad.

Oh! Going back to the original post--yes, NVidia (to be seen) not having flexible async capabilities will make it harder on programmers to squeeze the most out of the hardware. Guess what? We'll find out when the hardware is released!

razor1 · Mar 30, 2016

Anarchist4000 said:
Async seems to work just fine for other games out there. ROPs were one of the significant differences, but in this case it's probably the compression which doesn't work with compute. Color compression as a stumbling block was pointed out in some of the GDC presentations. The original release kept crashing when they tried to use async, so it's likely disabled and figi/tonga(gcn 1.2) were the cards that had the ability. The only other real difference is the 4 vs 8 ACEs.

This phenomenon isn't purely due to color compression when talking about async performance the developer stated that it is hard to do and they had to do different things for different hardware.

Honestly that wouldn't take a whole lot of transistors and the workload is always a bunch of waves so it doesn't really vary that much. The logic is simply "will it fit", "is fetch:alu > x", and "grab highest priority compute, else graphics". Fit and priority are already documented and the comparison could be as basic as comparing two 2-bit numbers. The priority values are 0-3 as I recall. The added transistor budget for that may not even break 100.

It not that simple as you still need branch prediction which is very complex in silicon, just look at Intel and all their patents for it and how it encompasses hyperthreading. Will it 'fit" is the end stage there has to be much more things that go on before that before it can get to that stage. I'm sure there must be other things than just branch prediction, but I know for a fact that is very complex in Intel's CPU's, and takes up a good amount of die space, this is why for a while Intel going down one node size I think it was from 30's to 20's, the size of their CPU dies didn't change and number of cores, didn't change either. With very limited performance differences in most scenarios, but hyperthreading performance increased quite a bit.

If I remember correctly branch prediction took about 8-10% of the die space (per core) for Intel CPU's *but they got 15-20% performance increases in hyperthreading* so this trade off was good, but its a pretty hefty chuck of a CPU. And then you still have other things to look at.

A GPU with all the alu's they have probably will take up more, this is out of my realm so I'm guessing here, but it wouldn't surprise me at all. Lets say for argument sake it takes up 10% die space, will loosing 10% of your ALU's/rops/tmu's, all other fixed function units etc be over looked by the increase of performance by specific tasks? That is hard to say, but I don't think it will be.....

There is a rumor that Navi won't be increasing shader count, so maybe they are going to do it, but that is putting a rumor with a guess and with another guess along with a chip that is coming next year....... I don't believe in the rumor about the shader count to begin with.

Anarchist4000 · Mar 30, 2016

razor1 said:
It not that simple as you still need branch prediction which is very complex in silicon, just look at Intel and all their patents for it and how it encompasses hyperthreading. Will it 'fit" is the end stage there has to be much more things that go on before that before it can get to that stage. I'm sure there must be other things than just branch prediction, but I know for a fact that is very complex in Intel's CPU's, and takes up a good amount of die space, this is why for a while Intel going down one node size I think it was from 30's to 20's, the size of their CPU dies didn't change and number of cores, didn't change either. With very limited performance differences in most scenarios, but hyperthreading performance increased quite a bit.

If I remember correctly branch prediction took about 8-10% of the die space (per core) for Intel CPU's *but they got 15-20% performance increases in hyperthreading* so this trade off was good, but its a pretty hefty chuck of a CPU. And then you still have other things to look at.

A GPU with all the alu's they have probably will take up more, this is out of my realm so I'm guessing here, but it wouldn't surprise me at all. Lets say for argument sake it takes up 10% die space, will loosing 10% of your ALU's/rops/tmu's, all other fixed function units etc be over looked by the increase of performance by specific tasks? That is hard to say, but I don't think it will be.....

There's no branch prediction involved in this. Branch prediction is fetching data to start doing math. It's only job is to take a wave from GCP/ACE and assign it to a CU with available resources. It should be doing this in a single cycle. This should be a very simple operation. It's probably programmable logic like the ACEs too. Point being, the workload is probably already self tuning if given appropriate work. The programmers job is making sure fences and barriers are positioned so they can be scheduled together or executed in the proper order.

As for the async performance, it's something unique to tonga/figi as both seem to have issues with Hitman. My money is still on fences/decompression being out of order.

razor1 · Mar 30, 2016

I'm going to post this on B3D, 99% sure what I stated is accurate, will post back the information I get back.

Without branch prediction hyper threading fails.... btw.

Anarchist4000 · Mar 31, 2016

razor1 said:
I'm going to post this on B3D, 99% sure what I stated is accurate, will post back the information I get back.

Without branch prediction hyper threading fails.... btw.

Sure. But keep in mind GPUs don't really branch predict, that's the latency they're always hiding. You'd only branch predict for single threaded performance. They start fetching and move on to the next wave, returning when the data shows up. And by programmable, I'm not talking about C style code either, it would be hardware stuff where the entire program executes synchronously and relatively simple by comparison. Hardware scheduling is pointless if you have to go to the CPU/driver every time you dispatch a wave. At the very least it's going to figure out which compute units aren't actually doing anything and send work there first. Load balancing is just an extension of that. As for any analysis, you could record the percentage of fetching operations per shader when you compile it.

razor1 · Mar 31, 2016

Anarchist4000 said:
Sure. But keep in mind GPUs don't really branch predict, that's the latency they're always hiding. You'd only branch predict for single threaded performance. They start fetching and move on to the next wave, returning when the data shows up. And by programmable, I'm not talking about C style code either, it would be hardware stuff where the entire program executes synchronously and relatively simple by comparison. Hardware scheduling is pointless if you have to go to the CPU/driver every time you dispatch a wave. At the very least it's going to figure out which compute units aren't actually doing anything and send work there first. Load balancing is just an extension of that. As for any analysis, you could record the percentage of fetching operations per shader when you compile it.

first response

1.0 certainly isn't, the ACEs had not been programmable back then at all.
1.1 is programmable, but the space is limited, and the space is required for the queue decoding logic.
1.2 might be able to do such a thing. But I'm not entirely sure what the HWS/ "new" ACE units on Tonga and Fiji are actually doing right now.

We have no idea of what the HWS units do, what Dave Baumann (works at AMD). told me was each HWS unit works like 2 ACE units. So I'm going with queue decoding logic isn't there and they need a lot of space for that.

I did PM a couple of Intel engineers as well, waiting on their responses, but I'm pretty sure I'll get a similar response.

Zion Halcyon · Mar 31, 2016

Creig said:
Thanks for that. I was surprised to learn that the DX11 version was mainly run on a single CPU core whereas the DX12 version supposedly spread the load over all available CPU cores. I would expect the DX12 version to be of much greater benefit for people running the game at lower resolutions as they are more likely to be CPU bound than those people at higher resolutions.

I still suspect that games written expressly with DX12 in mind will see better performance than those with DX12 enhancements added later.

Your last sentence is all I was trying to say before some people got bent out of shape over semantics, lol.

And despite what one person in particular thinks about it being "suicide", there are developers already developing expressly with DX12 in mind, Doom being one such recent example. It's not suicide, because AMD played this smart by making everything open source and free, as well as being ahead of the game in terms of Mantle which evolved into Vulkan, which has components melded into DX 12.

It might be harder to program for, but gaming developers are finally being given more direct control over the video card, and the tools are being given to them for free by AMD to do so, and AMD is helping them with understanding how to use those tools to get the most out of their cards and whatever they want to do...

Basically, AMD finally has a strategy that is working. I have no idea if they planned it this way or just lucked into it, but they have people there now who are smart enough to take full advantage.

We just now have to see what kind of benefits come from that, and if they are anything to write home about. But developers are starting to embrace DX12 a lot more quickly than some previous DX versions.

Anarchist4000 · Mar 31, 2016

razor1 said:
first response

1.0 certainly isn't, the ACEs had not been programmable back then at all.
1.1 is programmable, but the space is limited, and the space is required for the queue decoding logic.
1.2 might be able to do such a thing. But I'm not entirely sure what the HWS/ "new" ACE units on Tonga and Fiji are actually doing right now.

We have no idea of what the HWS units do, what Dave Baumann (works at AMD). told me was each HWS unit works like 2 ACE units. So I'm going with queue decoding logic isn't there and they need a lot of space for that.

I did PM a couple of Intel engineers as well, waiting on their responses, but I'm pretty sure I'll get a similar response.

What they do would be in the microcode, and likely not public. Might be able to look at the scheduling for any hints, but it could just as easily be labeled reserved. They're programmable circuits so could literally do anything that would fit. Should have been a series of them with two ACEs fitting within one. I think I read it in some open source linux driver documentation or something. Might have been about HSA as well. Anyways, it could be re-purposed for any number of things. So if you had 4 units(2 ACEs each) you could make 4 really big ACEs, 3x2 ACEs and something interesting, or anything else. The actual work scheduler would be reading from the ACEs, or anything else you created, along with all the compute units to track progress. Keep in mind this has nothing to do with ALU scheduling. It's just handing off waves to a local CU scheduler. Actual ACEs should just be tracking a compute context and progress.

razor1 · Mar 31, 2016

Yeah that was my impression of the HWS's as well. The software queues vs. hardware queues giving the GPU more flexibility over more blocks.

One of the Intel engineers responded, its very complex to do and there are no ways of making sure it will work all the time over different hardware sets even same generation let alone different generations in regards to work balancing over different work load types....... well guess that kinda summed up what I stated.

Anarchist4000 · Apr 1, 2016

razor1 said:
Yeah that was my impression of the HWS's as well. The software queues vs. hardware queues giving the GPU more flexibility over more blocks.

One of the Intel engineers responded, its very complex to do and there are no ways of making sure it will work all the time over different hardware sets even same generation let alone different generations in regards to work balancing over different work load types....... well guess that kinda summed up what I stated.

Yeah, all the flexibility of Nvidia's async driver support.

[RUMOR] Pascal in trouble with Asynchronous Compute

Just Plain Mean

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

Supreme [H]ardness

2[H]4U

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

2[H]4U

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

Gawd

[H]F Junkie

2[H]4U

Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

2[H]4U

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd