Ashes of the Singularity Day 1 Benchmark Preview @ [H]

Dayaks · Apr 4, 2016

17seconds said:
So little mention that Ashes of the Singularity is straight out of the AMD development loop. Prior to this implementation, the engine was used by Oxide for Star Citizen, which was created from the ground up to be the very first showcase for AMDs Mantle technology. Predictably, Ashes of the Singularity indeed does favor AMD cards and promotes the one and only aspect of DirectX 12 where AMD has an advantage. AMD optimizations have been heavily baked into this game engine from the beginning, so the results need to be filtered through that lens.

To place so much emphasis and to prognosticate over the future based on performance in these one or two AMD Gaming Evolved titles is taking it a bit too far. In fact, the mere mention of Mantle begins to remind me of how this whole discussion could end up seeming like deja vu. If only a few AMD titles end up adopting Async Compute, the results will likely be the same as Mantle.

Star Citizen uses the Crysis 3 engine and is made by RSI.

17seconds · Apr 4, 2016

Dayaks said:
Star Citizen uses the Crysis 3 engine and is made by RSI.

Sorry, meant the Star Swarm demo was developed by Oxide as a showcase for Mantle. Thanks for catching that one.

noko · Apr 4, 2016

razor1 said:
the question is how under utilized are AMD cards, if its 30% or more, that is pretty bad coding don't you think? Even on consoles, its very hard to believe that the GPU's are that underutilized by bottlenecks or poor code. Specially when you look at the crap GPU they have in there..... (not saying its crap as in bad, just crap as in compared to what we got on PC's).

Yeah CR's can be done on older hardware with other techniques, but there are penalties, quality and performance, quality hard to see but its there.

I don't see that at all to be honest. many parallel operations, highways, roads will overtake a single big road in moving stuff around. Games have many things happening at once thus having parallel paths will give better results - hence 30% seems like a good number. If that carries over to the PC games, today's AMD hardware will be blowing away Nvidia hardware in a big way even if Nvidia does everything they can to make it run as fast as possible. Nvidia spent more time with AoTs then AMD in this, resulting in getting respectable performance from their hardware.

CR can give a very noticeable IQ improvement if used, good or accurate triangle edge detection is needed to calculate accurately shadows, lights, fog effects, mist etc. Nvidia should be developing additional libraries to take advantage of this (Intel should benefit as well). So what should happen, developers should not take advantage of available hardware? As in Async or CR - why of course not, use them both and have options to allow most hardware to play the game to the maximum extent as possible.

Now I cannot imagine that Nvidia will not develop a more parallel operating GPU, as you add more shaders, keeping them pre-occupied becomes harder and harder. I expect Pascal to do Async much better then Maxwell and probably with some pleasant surprises.

Zion Halcyon · Apr 4, 2016

17seconds said:
So little mention that Ashes of the Singularity is straight out of the AMD development loop. Prior to this implementation, the engine was used by Oxide for the Star Swarm demo as the very first showcase for AMDs Mantle technology. Predictably, Ashes of the Singularity indeed does favor AMD cards and promotes the one and only aspect of DirectX 12 where AMD has an advantage. AMD optimizations have been heavily baked into this game engine from the beginning, so the results need to be filtered through that lens. This is a best case scenario for AMD.

To place so much emphasis and to prognosticate over the future based on performance in these one or two AMD Gaming Evolved titles is taking it a bit too far. In fact, the mere mention of Mantle begins to remind me of how this whole discussion could end up seeming like deja vu. If only a few AMD titles end up adopting Async Compute, the results will likely be the same as Mantle.

Doesn't seem like it's only going to be a few. Already you can add the upcoming Deus Ex and Doom titles to the async train, and the Doom devs already commenting on how much better performance they are getting with Async. If the ID name still carries weight in the industry, then even more developers will follow suit.

razor1 · Apr 4, 2016

noko said:
I don't see that at all to be honest. many parallel operations, highways, roads will overtake a single big road in moving stuff around. Games have many things happening at once thus having parallel paths will give better results - hence 30% seems like a good number. If that carries over to the PC games, today's AMD hardware will be blowing away Nvidia hardware in a big way even if Nvidia does everything they can to make it run as fast as possible. Nvidia spent more time with AoTs then AMD in this, resulting in getting respectable performance from their hardware.

CR can give a very noticeable IQ improvement if used, good or accurate triangle edge detection is needed to calculate accurately shadows, lights, fog effects, mist etc. Nvidia should be developing additional libraries to take advantage of this (Intel should benefit as well). So what should happen, developers should not take advantage of available hardware? As in Async or CR - why of course not, use them both and have options to allow most hardware to play the game to the maximum extent as possible.

Now I cannot imagine that Nvidia will not develop a more parallel operating GPU, as you add more shaders, keeping them pre-occupied becomes harder and harder. I expect Pascal to do Async much better then Maxwell and probably with some pleasant surprises.

in theory it doesn't. Scheduling, has been based how shaders are written and there are good ways for occupation and utilization to make sure a GPU is saturated and always working.

AMD's GPU on the other had have bottlenecks where the ALU's are sitting around and waiting. Its not the size of the shader array that matters, its how a shader is written broken down and laid out for the GPU to use, so the driver does have some things to do here.

We have seen nV's utilization of ALU increase from the g80 onward for every single architecture they have created, this has simplified scheduling as well (but along with the the warp lengths have been changed too for better granularity, so shaders that worked well on a previous architecture aren't as well suited for the newer one, but the performance might still be better because of the horsepower of the newer architecture) . GCN is much better at utilization and scheduling than previous VLIW architectures AMD had prior. This is a normal way of doing architectures. Every single new architecture strives for this. I can understand that stretching an architecture for too long that some of these types of problems creep up, but not to this extent.

I find it hard to believe AMD's GPU array's are 30% under utilized even on their top end. 10% yeah that is more realistic. If it was truly 30% under utilized, something is just not right. That means its either a really horrible design to get max performance or people don't know what they are doing when writing code for it. Either way you slice that, someone is doing something wrong.

Remember async will only give you a certain amount of performance based on ALU's not being utilized (and that is a maximum) at that time and if the compute instructions being queued out by a specific ACE in conjunction with the block that instruction can be put in with the one cycle loss of using L2 cache. So in practice block 4 has free ALU's but block 1 has compute instructions that can fit in those free ALU's for the time that is given. You end with losing performance in this scenario because of the use of the L2 cache as L1 is local and not global. So writing shaders that are scheduled properly is also very important.

PS added to this, a program that saturates a GPU shader array now might not saturate a GPU shader array in the future even if the architecture is the same because well they just won't occupy or utilize the increased ALU amounts, in cases like this, async won't help at all, as there is no work to push to the free ALU's.

pendragon1 · Apr 4, 2016

reran tests with new drivers and the ironman guys glitching is gone but I still get the big patch of "snow" in the bottom right corner. primetime, do you still get it? numbers between 16.3.2 and 16.4.1 are almost identical, les than 1 FPS difference. also, put my oc back to 1175/1750 and it runs perfectly fine, so far! numbers went up just slightly with the OC, 3-4FPS but the "GPU bound" numbers went way down.

17seconds · Apr 4, 2016

Well again, Deus Ex is a AMD Gaming Evolved title. But DOOM, now that one's interesting. It is well known that Id and Bethesda prefer Nvidia hardware, even to a point of antipathy (anyone remember Rage?). That may be the first time we get a different result out of DirectX 12. If the results still favor AMD with that one, then that would be worth paying attention.

noko · Apr 4, 2016

razor1 said:
in theory it doesn't. Scheduling, has been based how shaders are written and there are good ways for occupation and utilization to make sure a GPU is saturated and always working.

AMD's GPU on the other had have bottlenecks where the ALU's are sitting around and waiting. Its not the size of the shader array that matters, its how a shader is written broken down and laid out for the GPU to use, so the driver does have some things to do here.

We have seen nV's utilization of ALU increase from the g80 onward for every single architecture they have created, this has simplified scheduling as well (but along with the the warp lengths have been changed too for better granularity, so shaders that worked well on a previous architecture aren't as well suited for the newer one, but the performance might still be better because of the horsepower of the newer architecture) . GCN is much better at utilization and scheduling than previous VLIW architectures AMD had prior. This is a normal way of doing architectures. Every single new architecture strives for this. I can understand that stretching an architecture for too long that some of these types of problems creep up, but not to this extent.

I find it hard to believe AMD's GPU array's are 30% under utilized even on their top end. 10% yeah that is more realistic. If it was truly 30% under utilized, something is just not right. That means its either a really horrible design to get max performance or people don't know what they are doing when writing code for it. Either way you slice that, someone is doing something wrong.

Remember async will only give you a certain amount of performance based on ALU's not being utilized (and that is a maximum) at that time and if the compute instructions being queued out by a specific ACE in conjunction with the block that instruction can be put in with the one cycle loss of using L2 cache. So in practice block 4 has free ALU's but block 1 has compute instructions that can fit in those free ALU's for the time that is given. You end with lossing performance in this scenario because of the use of the L2 cache as L1 is local and not global.

razor1 said:
the question is how under utilized are AMD cards, if its 30% or more, that is pretty bad coding don't you think? Even on consoles, its very hard to believe that the GPU's are that underutilized by bottlenecks or poor code. Specially when you look at the crap GPU they have in there..... (not saying its crap as in bad, just crap as in compared to what we got on PC's).

Yeah CR's can be done on older hardware with other techniques, but there are penalties, quality and performance, quality hard to see but its there.

I think you are leaving out other advantages of Async Compute with AMD hardware compared to Nvidia hardware capability or you have it backwards. You don't need all the stream processors for compute operations in many cases - AMD hardware can do graphics at the same time as using the number of shaders needed for compute. Nvidia uses all the shaders for compute, when all are not needed or used. AMD has more efficiency doing both, hence AMD has better utilization doing both while Nvidia when using more and more compute operations tanks. For example, just doing PhysX on Nvidia GPU's tanks performance with Cuda. If AoTs had more compute operations then it does now I do believe Nvidia hardware would have a much harder time then AMD designed hardware.

razor1 · Apr 4, 2016

noko said:
I think you are leaving out other advantages of Async Compute with AMD hardware compared to Nvidia hardware capability or you have it backwards. You don't need all the stream processors for compute operations in many cases - AMD hardware can do graphics at the same time as using the number of shaders needed for compute. Nvidia uses all the shaders for compute, when all are not used or needed. AMD has more efficiency doing both, hence AMD can use all with better utilization while Nvidia using more and more compute operations starts to tank. For example, just doing PhysX on Nvidia GPU's tanks performance with Cuda.

No physX doesn't tank performance on a GPU, well it can, depending on what physics simulations are being done and the type of simulations. For the most part for the basic stuff PhysX on a GPU won't take more than 5% of the GPU for today's GPU's. If you are doing fluid simulations and things like that, yeah that hits it hard, but those are like hundreds of thousands of simulations per clock.

And architectures strive to use all of their units as much as possible. Idle time is not a good thing. Thats like hiring someone and tell him its ok, we will pay you for now but when work comes we expect you to do it. Its good for the hires, not good for the employee. Why would ya want to keep the guy around. I would cut him out and save the money.

In a GPU sense, get rid of those ALU's if possible save the space, and reap in more profits by selling a smaller die at the same price.

noko · Apr 4, 2016

I have not delt with Maxwell 2 so I am not sure there, Maxwell 1 tanks using PhysX (Ti750 and 860M Gpu). I do believe Maxwell 2 and Cuda can actually do a compute and graphic operation at the same time so should do better there but start throwing in other compute workloads and I do believe it will start to tank.

Yes your right, strive to keep all units active at once - and that is the problem with Nvidia when you have both Graphic and Compute operations going on. As game workloads increase the amount of compute operations are going up with graphic operations - Nvdia hardware stopping the graphic operation to do compute operations starts to hit on that utilization. Nvidia hardware is highly efficient and designed around DX11 with limited threading. Hopefully future designs will be designed more for DX12 parallel and more threaded operations. As more DX 12 titles hit the streets and depending upon the number of compute operations going on, AMD hardware will be more effective then Nvdia hardware if Async Compute is used. I hope you see the distinction, it is not so much AMD is going up in performance with Async Compute it is more that Nvidia performance is tanking with the increase number of compute operations while AMD hardware is more effective in handling it and keeping all units working.

primetime · Apr 5, 2016

pendragon1 said:
reran tests with new drivers and the ironman guys glitching is gone but I still get the big patch of "snow" in the bottom right corner. primetime, do you still get it? numbers between 16.3.2 and 16.4.1 are almost identical, les than 1 FPS difference. also, put my oc back to 1175/1750 and it runs perfectly fine, so far! numbers went up just slightly with the OC, 3-4FPS but the "GPU bound" numbers went way down.

only fix for now is disable AA and with vsr at 1440p its not really needed anyway....i remember hearing about this kind of glitch in a few games like grand theft auto. Im not sure if people ever found a fix. Truth be told i DONT think AA in this game works at all...8x-2x AA gives me exact same performance as disabled and makes me feel like it does nothing anyway.

edit: seems like AA works fine overriding thru radeon menu.....at least theres no glitch's...i tried 4x and only losed 1fps compared to off

Relayer · Apr 5, 2016

17seconds said:
If only a few AMD titles end up adopting Async Compute, the results will likely be the same as Mantle.

Dx12 will be donated to Khronos for Vulkan?

pendragon1 · Apr 5, 2016

primetime said:
only fix for now is disable AA and with vsr at 1440p its not really needed anyway....i remember hearing about this kind of glitch in a few games like grand theft auto. Im not sure if people ever found a fix. Truth be told i DONT think AA in this game works at all...8x-2x AA gives me exact same performance as disabled and makes me feel like it does nothing anyway.

edit: seems like AA works fine overriding thru radeon menu.....at least theres no glitch's...i tried 4x and only losed 1fps compared to off

Yeah I've had it before in other games so turned it off there too. I've just been running without AA. I'll try the Radeon settings see how that goes.

NExUS1g · Apr 5, 2016

TaintedSquirrel said:
If Nvidia is so incompetent that they didn't see the benefits of async until async-gate started (~August 2015) then they have no business being in the GPU space at all. Just beyond ridiculous.
I would hope Nvidia is at least slightly more observant than random forum posters (Mahigan) and sites like WCCFTech. How much is their R&D budget exactly?

And how long was the DX12 API in development for Nvidia to see this coming? Clearly even AMD knew about it circa 2011. Plus we had the same signs with Mantle, back in 2013.

I think that it's a business decision more than it is technical a decision. They were probably hoping to delay tech that increases efficiency like this until they couldn't squeeze any more performance out of the silicon, then start introducing the efficiency increases with future generations if the need arose to continue sales while waiting for a new substrate. I think that business model ended when AMD won over the console market (across the board with MS, Sony and Nintendo). AMD is now holding all the cards (no pun intended, but after re-reading it, I thought it was chuckle worthy), and they're the one leading the show in the consumer market. If it weren't for Nvidia having its hand in massive parallel supercomputing, I think that this move could have ended them.

And I honestly can't blame Nvidia for trying to play their hand this way (even though it affects the performance bottom line for me). If I were viewing the limited advancements into new technologies and substrate while silicon's on life support, I'd be wondering how I was going to pay the bills too for however long a new semi-conductor tech took to come out. Holding onto efficiency boosters for as long as possible before releasing them would be a viable option.

Mahigan · Apr 5, 2016

razor1 said:
Yeah and that aligned with what I stated right lol

Maybe I misunderstood what you stated but it seemed to me that you stated that Asynchronous Compute + Graphics was not enabled in the drivers at first, when it was.

The feature was available in NVIDIA's driver. Asynchronous Compute + Graphics worked. When I say worked, I mean that Maxwell was executing Compute and Graphics kernels in parallel. The issue is that performance tanked and the driver crashed (hence the use of the term Conformance).

NVIDIA requested that the feature be turned off for their hardware and in its place Oxide worked with NVIDIA in order to implement Vendor ID specific shader paths.

NVIDIA then removed the Asynchronous Compute support from their drivers.

This was all during the alpha stage of development for AotS.

When the BETA released, NVIDIA had now introduced a workaround in their driver. Basically their driver now intercepted Asynchronous Compute + Graphics commands and converted them into synchronously executing kernels. The kicker is that this didn't remove the fences. So with Async turned on, NVIDIA's Maxwell takes a hit from the added idle time the fences cause.

Oxide stated that by release time, they'd choose the quickest path for NVIDIA's Maxwell. That path is with Asynch turned off. So now the game defaults with Async turned off for NVIDIA hardware.

I'm not sure if that's what you mean't.

Mahigan · Apr 5, 2016

CSI_PC said:
Although this is made more complicated by the fact Brent's recent Ashes benchmark shows that AMD FuryX is faster than 980ti even with DX11 and also with 390x compared to 980.
Considering how poor AMD's DX11 driver implementation is compared to NVIDIA, it raises some questions what exactly is going on; whether it relates to issue with NVIDIA driver development,Oxide developers (IMO they have more of an interest with AMD), or a mix of both.
Another consideration is just how fast NVIDIA cards were with the earlier version of the Oxide Nitrous engine used for the Star Swarm benchmark, so something is up IMO with the Ashes development in some way.
And yeah I appreciate Ashes would have a different post processing/rendering/shader solution but one would not expect NVIDIA to really lose out on DX11 if a game/benchmark is equally well optimised for both card manufacturers.

So I think this may limit somewhat any overall conclusions, whether the fault lies with NVIDIA/Oxide/or both.
Cheers

AMDs DX11 performance is a recent development for AotS. It is due to AMDs driver work and not due to anything on Oxide's end. What it shows is that AMD can optimize DX11 drivers but the manpower it takes to do so, and AMDs financials, don't really allow for AMD to do this routinely.

The Star Swarm demo contained no post processing lighting, shadows or any compute effects. It was a pure rendering demo which tested rendering capabilities relative to draw calls. There's no conspiracy here, AMD GCN is a vastly superior compute architecture and NVIDIA Maxwell is a vastly superior Graphics Rendering architecture.

AMD can optimize for DX11 but their financial situation makes the efforts tenuous at best.

Oxide has repeatedly answered your accusation, worded differently but you're still trying to play NVIDIA apologetics, of bias. Star Dock has a marketing agreement with AMD and they're Oxide's parent company. Dan Baker has repeatedly denied and invalidated your accusations of bias.

Both NV and AMD have access to the source code and can make change requests. Both had 3 on site visits with Oxide optimizing code. There's no AMD bias in the code. AotS is a prime example of a vendor agnostic title optimized properly for both vendors.

What AotS is, is heavy on compute and this benefits AMDs GCN more so than NVIDIA's architectures.

Mahigan · Apr 5, 2016

17seconds said:
Well again, Deus Ex is a AMD Gaming Evolved title. But DOOM, now that one's interesting. It is well known that Id and Bethesda prefer Nvidia hardware, even to a point of antipathy (anyone remember Rage?). That may be the first time we get a different result out of DirectX 12. If the results still favor AMD with that one, then that would be worth paying attention.

Doom is an OpenGL/Vulkan title.

CSI_PC · Apr 5, 2016

Mahigan said:
AMDs DX11 performance is a recent development for AotS. It is due to AMDs driver work and not due to anything on Oxide's end. What it shows is that AMD can optimize DX11 drivers but the manpower it takes to do so, and AMDs financials, don't really allow for AMD to do this routinely.

The Star Swarm demo contained no post processing lighting, shadows or any compute effects. It was a pure rendering demo which tested rendering capabilities relative to draw calls. There's no conspiracy here, AMD GCN is a vastly superior compute architecture and NVIDIA Maxwell is a vastly superior Graphics Rendering architecture.

AMD can optimize for DX11 but their financial situation makes the efforts tenuous at best.

Oxide has repeatedly answered your accusation, worded differently but you're still trying to play NVIDIA apologetics, of bias. Star Dock has a marketing agreement with AMD and they're Oxide's parent company. Dan Baker has repeatedly denied and invalidated your accusations of bias.

Both NV and AMD have access to the source code and can make change requests. Both spent 3 months on site with Oxide optimizing code. There's no AMD bias in the code. AotS is a prime example of a vendor agnostic title optimized properly for both vendors.

What AotS is, is heavy on compute and this benefits AMDs GCN more so than NVIDIA's architectures.

Err what accusation?
I think my post was pretty fair and without bias, my view is that all parties are at fault and sorry but I tend to disagree regarding AMD being able to be that efficient at DX11 to be more competitive than NVIDIA.
The trend is that now we do see AMD performing better in the most recent games at DX11, but then funny enough those games are being optimised to the nth degree for consoles first as their focus (apart from Ashes), this is good for AMD and a case in point is the alpha state of Doom; this performs much better than NVIDIA even in its unoptimised state on PC.
Anyway I mentioned that Star Swarm does not use the same post processing/rendering/compute capabilities although it DOES use DX12 features, and shows just how competitive NVIDIA DX11 is when looking at the breakdown on batch performance - yes AMD made no effort for the DX11 on Star Swarm and this is an accusation now probably because they wanted to showcase the benefits of Mantle over DX11.
Mahigan if anyoone is being biased between us two here, sorry but it is you and I am not making excuses for NVIDIA (they have a part to play with the Ashes and its optimisation that goes beyond just compute).
As a proportion of performance optimisation, how much can be attributed to compute?
And if you want to be truly impartial, why did they not consider CR/ROV for aspects beyond compute?
Intel has developers looking to implement this in a couple of games.

Anyway there is more to DX12 than just using compute effects, unless one wants to make a narrative of course :

I think NVIDIA were idiots and made a bad decision not to try and be involved in at least one of the consoles as this really now gives a good edge for AMD IMO and the weight it provides in game development-optimisation.......

Edit:
I just also want to say, you do agree that Star Swarm is still at core the Nitrous engine without the post processing-compute effects?
Oxide do state:
Star Swarm is designed as a total game engine test, not just a graphics test. As such, we have complete AI, flocking, physics, etc. being simulated instead of a pre-canned demo.
Q. This is just a marketing tool for AMD; you’ve obviously crippled the DirectX version!
A. We really haven’t; to be perfectly honest we’ve spent more time optimizing for DirectX than we have on Mantle. The fact is that DirectX is conceived and implemented as a single-threaded API, and so a lot of the more significant gains we see thanks to the Nitrous engine’s aggressive multithreading are badly limited by API overhead when we’re using it.

That said I do appreciate this does not mean it is apples-to-apples with Ashes, but still important IMO in the context of what has been said by a few.
And importantly as Razor and myself have said, NVIDIA probably spent more effort optimising-collaborating on Star Swarm than Ashes (which sucks I agree if one is an NVIDIA owner and like this game) but the fault cannot just lie with NVIDIA IMO in terms of performance optimisation and feature-fuctions implemented and how (performance of the alpha Doom an example where both parties will need to work to get it to a good optimised state for NVIDIA on PC that currently lags behind AMD due to its console focus to date).
Cheers

Mahigan · Apr 5, 2016

This was from Dan Baker...

JustReason · Apr 5, 2016

What I have seen makes me think this:

1. Consider the time of a single moment of compute and a graphics operation.
2. Lets give this single operation a time of 14 for compute and 36 for graphics.
3. In the traditional sense of serial that completion is a 50.
4. Now with asynchronous completion/parallel the best time would be the 36 of the graphics operation, of course assuming there were hardware resources available to conduct the compute operation without affecting the graphics operation.
5. Now in regards to the saturation of ALUs: What if they are slowing the graphics operation to say 42 to allow the compute operation to complete parallel.
6. This would still be faster than the original 50 and allow for those up to % that may have seemed a bit high.

Just seems that with coding, especially with tasks and all the out-of-order talks, that there is quite a bit of flexibility, not saying it wouldn't be complex or difficult.

CSI_PC · Apr 5, 2016

Just to add,
seems the UE4 engine with asynchronous shaders may have 10% to at best 20% performance gain, although worth pointing out this does not have the same scope-focus design to that of Ashes.
Mahigan you would agree part of the post processing/rendering/collision/etc makes sense in future to also consider CR/ROV?
This is not a defence of NVIDIA because ideally that needs improving (maybe will happen for Pascal but a pain for current Maxwell 1&2 owners), just highlighting one aspect of the DX12 technology Oxide seem to be omitting not just now but in the future according to the quote you linked.
CR/ROV feature-functions that seem to be of great interest to Intel and not just NVIDIA, and to be fair also some developers.

Cheers

Mahigan · Apr 5, 2016

CSI_PC said:
Err what accusation?
I think my post was pretty fair and without bias, my view is that all parties are at fault and sorry but I tend to disagree regarding AMD being able to be that efficient at DX11 to be more competitive than NVIDIA.

You disagree on what basis? Please share with us your thorough knowledge on the topic which leads you to disagree? What specific aspects of AMDs architecture inhibits it from applying driver optimizations in order to alleviate API bottlenecks under DX11?

Folks keep shifting the goal posts around as an attempt to dabble in partisan apologetics. When I suggested that AMDs Command Processor might be at fault for AMDs poor DX11 performance I was met with many NVIDIA users swearing that what I was saying wasn't true (Including Razor1) and that the issues were entirely driver related.

So now I suggest that the issues with AMDs surprising DX11 performance optimizations for AotS and it seems that the goal post has shifted to AMD being unable to rectify this API overhead in their drivers and alluding to a hardware fault.

Pardon me but I've had it with disingenuous partisan fans who act like Hillary Clinton.

The trend is that now we do see AMD performing better in the most recent games at DX11, but then funny enough those games are being optimised to the nth degree for consoles first as their focus (apart from Ashes), this is good for AMD and a case in point is the alpha state of Doom; this performs much better than NVIDIA even in its unoptimised state on PC.

It's not that they're ooptimized per-say but more that the rendering capabikities of XBox One and PS4 APUs have been exhausted and as such developers are taping into the available compute capabilities of those APUs as a means of performing traditonally Graphics Rendered tasks over the compute queue. This changes the ratio of Graphics:Compute which evidently benefits GCN.

Anyway I mentioned that Star Swarm does not use the same post processing/rendering/compute capabilities although it DOES use DX12 features, and shows just how competitive NVIDIA DX11 is when looking at the breakdown on batch performance - yes AMD made no effort for the DX11 on Star Swarm and this is an accusation now probably because they wanted to showcase the benefits of Mantle over DX11.

What DX12 features does Star Swarm use? Bet you don't know the answer yet you assert that it uses features with an S thus alluding to many. Star Swarm does use Multi-threaded rendering, which alleviates CPU bottlenecks. What else?

Mahigan if anyoone is being biased between us two here, sorry but it is you and I am not making excuses for NVIDIA (they have a part to play with the Ashes and its optimisation that goes beyond just compute).
As a proportion of performance optimisation, how much can be attributed to compute?

An enormous amount. AotS started off with 20% of its pipeline occurring in the compute pipeline (Alpha). According to Razor 1 this has increased to 40% starting with the Beta according to slides he's seen. Star Swarm has 0 (or near that).

And if you want to be truly impartial, why did they not consider CR/ROV for aspects beyond compute?
Intel has developers looking to implement this in a couple of games.

What good is Conservative Rasterization for a real time strategy game? I mean you can use CR to efficiently calculate shadow effects (pixel/triangle coverage) but the effect itself would bring down performance. Considering the amount of shadows and lights the numerous units on screen in AotS emit then CR is pointless unless you like gaming in single digit land. Look at the DX11 CR implementation for VXAO in Tomb Raider. Enormous performance hit.

RoV is a performance hit inducing feature. It can be used to properly emulate smoke effects. Pointless for AotS.

Intel has a far more robust CR implementation than NVIDIA. Intel is tier3 where as NVIDIA are tier1.

There's no reason to use either for AotS.

Anyway there is more to DX12 than just using compute effects, unless one wants to make a narrative of course :
I think NVIDIA were idiots and made a bad decision not to try and be involved in at least one of the consoles as this really now gives a good edge for AMD IMO and the weight it provides in game development-optimisation.......

And AotS makes use of a robust amount of DX12 features. Moreso than any other engine to date.

And no, it's entirely NVIDIAs fault. Their lack of features, they're lack of wanting to get involved in the consoles because they wanted more money from MS. From GameWorks, now leading to push back, to the 3.5GB Ram debacle etcetcetc.

It's entirely NVIDIAs fault. Neither AMD or NVIDIA deserve our empathy. If both are playinh on an even playing field (Open Standards) then there should be no sympathizing for either company. My 2 cents.

Dayaks · Apr 5, 2016

I can set this game to crazy/4x MSAA at 3440x1440 with my Titan X and I get around 47 fps.

So much for "future proof" to challenge hardware....

It looks good. Not amazing, but, as an RTS cosmetics are more of a plus for me. Well, looks just like the screenshots. Some games look better. This one is pretty much spot on.

Mahigan · Apr 5, 2016

Pardon me if I come across as frustrated but I am. Since around July of last year I've been discussing these topics and all along the way I've received massive amounts of push back only to be proven right every time.

I've corrected my knowledge, learned quite a lot along the way and have been able to bring a lot of issues to light surrounding DX12/Vulkan.

Seems to me that when you speak truthfully, and the truth itself is biased towards one architecture over another, you're consistently accused of being a partisan hack by partisan hacks.

Face it, Maxwell (Aside from the brute force of GM200) is an inferior DX12 architecture. That's the truth. Not a truth derived out of partisanship but an actual truth (objective truth).

This may not matter to some who upgrade every product cycle, but it does matter to folks who keep their GPUs for 2 years or more.

The GTX 970 is going to really suck at DX12. The GTX 980 is going to take its place. The re-branded Hawaii (Grenada) is going to supplant the GTX 980 and the FuryX is going to trade blows with the GM200 cards.

Pascal (GP104) and Polaris will arrive. Polaris will likely target the mainstream market whereas Pascal will target the high end. That's what I think reading the info available so far. Vega and GP100 will trade blows later on this year or Q1 2017.

Unless you upgrade to Pascal, expect to see GCN cards gaining across the board. That's just what's going to happen.

CSI_PC · Apr 5, 2016

Mahigan said:
You disagree on what basis? Please share with us your thorough knowledge on the topic which leads you to disagree? What specific aspects of AMDs architecture inhibits it from applying driver optimizations in order to alleviate API bottlenecks under DX11?

Folks keep shifting the goal posts around as an attempt to dabble in partisan apologetics. When I suggested that AMDs Command Processor might be at fault for AMDs poor DX11 performance I was met with many NVIDIA users swearing that what I was saying wasn't true (Including Razor1) and that the issues were entirely driver related.

So now I suggest that the issues with AMDs surprising DX11 performance optimizations for AotS and it seems that the goal post has shifted to AMD being unable to rectify this API overhead in their drivers and alluding to a hardware fault.

Pardon me but I've had it with disingenuous partisan fans who act like Hillary Clinton.

It's not that they're ooptimized per-say but more that the rendering capabikities of XBox One and PS4 APUs have been exhausted and as such developers are taping into the available compute capabilities of those APUs as a means of performing traditonally Graphics Rendered tasks over the compute queue. This changes the ratio of Graphics:Compute which evidently benefits GCN.

What DX12 features does Star Swarm use? Bet you don't know the answer yet you assert that it uses features with an S thus alluding to many. Star Swarm does use Multi-threaded rendering, which alleviates CPU bottlenecks. What else?

An enormous amount. AotS started off with 20% of its pipeline occurring in the compute pipeline (Alpha). According to Razor 1 this has increased to 40% starting with the Beta according to slides he's seen. Star Swarm has 0 (or near that).

What good is Conservative Rasterization for a real time strategy game? I mean you can use CR to efficiently calculate shadow effects (pixel/triangle coverage) but the effect itself would bring down performance. Considering the amount of shadows and lights the numerous units on screen in AotS emit then CR is pointless unless you like gaming in single digit land. Look at the DX11 CR implementation for VXAO in Tomb Raider. Enormous performance hit.

RoV is a performance hit inducing feature. It can be used to properly emulate smoke effects. Pointless for AotS.

Intel has a far more robust CR implementation than NVIDIA. Intel is tier3 where as NVIDIA are tier1.

There's no reason to use either for AotS.

And AotS makes use of a robust amount of DX12 features. Moreso than any other engine to date.

And no, it's entirely NVIDIAs fault. Their lack of features, they're lack of wanting to get involved in the consoles because they wanted more money from MS. From GameWorks, now leading to push back, to the 3.5GB Ram debacle etcetcetc.

It's entirely NVIDIAs fault. Neither AMD or NVIDIA deserve our empathy. If both are playinh on an even playing field (Open Standards) then there should be no sympathizing for either company. My 2 cents.

Wow now that is an attack.
My only response is yet again the focus on post processing compute from you and then when I raise CR/ROV you bring it back to a real-time strategy game...
DX12 is more than RTS but you seem very vocal purely about DX12 and way used by Oxide; although aspects of shading/collision do still make sense even in an RTS and use with CR/ROV?
And regarding Star Swarm, well it was meant to be the Nitrous engine before its evolution and use with Ashes, so maybe you can clarify in detail what it had and what was added-changed for Ashes as I do not know.
The point is that even if as you say the only aspect of the Nitrous engine in existence back then pertaining to DX12 was multi-threaded rendering NVIDIA actually performed better in DX12 option than DX11 and by a lot.
Oxide mention they disable async compute for NVIDIA cards with Ashes, so please can you explain what has happened to the multi-threaded rendering and whatever else the previous iteration of Nitrous engine did that improved NVIDIA performance and why it could not be implemented in Ashes?

And your really saying XBOX-1 games are not optimized and its just compute capabilities implemented now that gives them their DX11 PC performance improvement over the past????
Some ways they have closed the gap quite a lot, and in some very recent releases actually have better DX11 performance than NVIDIA.
I appreciate you say AMD has improved their DX11 driver, but that has nothing to do with as a clear example of performance difference between NVIDIA and AMD with the PC alpha of DOOM, and neither does compute.
BTW I agree AMD has improved their DX11 drivers but not to the extent we see in terms of performance gains from latest game releases due to console focus optimisation - even the developer has commented that PC alpha Doom was still only optimised for console.

Cheers

Ieldra · Apr 5, 2016

Too sleepy to find the quote but someone here said if it weren't for NVIDIA's

NExUS1g said:
I think that it's a business decision more than it is technical a decision. They were probably hoping to delay tech that increases efficiency like this until they couldn't squeeze any more performance out of the silicon, then start introducing the efficiency increases with future generations if the need arose to continue sales while waiting for a new substrate. I think that business model ended when AMD won over the console market (across the board with MS, Sony and Nintendo). AMD is now holding all the cards (no pun intended, but after re-reading it, I thought it was chuckle worthy), and they're the one leading the show in the consumer market. If it weren't for Nvidia having its hand in massive parallel supercomputing, I think that this move could have ended them.

And I honestly can't blame Nvidia for trying to play their hand this way (even though it affects the performance bottom line for me). If I were viewing the limited advancements into new technologies and substrate while silicon's on life support, I'd be wondering how I was going to pay the bills too for however long a new semi-conductor tech took to come out. Holding onto efficiency boosters for as long as possible before releasing them would be a viable option.

A circumstantial 10% gain would have ended the company with an 80% desktop market share ?

Pretty sure I can think of at least ONE major chip designer that has survived several years with competely non-competitive products

Mahigan said:
Pardon me if I come across as frustrated but I am. Since around July of last year I've been discussing these topics and all along the way I've received massive amounts of push back only to be proven right every time.

I've corrected my knowledge, learned quite a lot along the way and have been able to bring a lot of issues to light surrounding DX12/Vulkan.

Seems to me that when you speak truthfully, and the truth itself is biased towards one architecture over another, you're consistently accused of being a partisan hack by partisan hacks.

Face it, Maxwell (Aside from the brute force of GM200) is an inferior DX12 architecture. That's the truth. Not a truth derived out of partisanship but an actual truth (objective truth).

This may not matter to some who upgrade every product cycle, but it does matter to folks who keep their GPUs for 2 years or more.

The GTX 970 is going to really suck at DX12. The GTX 980 is going to take its place. The re-branded Hawaii (Grenada) is going to supplant the GTX 980 and the FuryX is going to trade blows with the GM200 cards.

Pascal (GP104) and Polaris will arrive. Polaris will likely target the mainstream market whereas Pascal will target the high end. That's what I think reading the info available so far. Vega and GP100 will trade blows later on this year or Q1 2017.

Unless you upgrade to Pascal, expect to see GCN cards gaining across the board. That's just what's going to happen.

Mahigan I'm replying to a few of the posts you made on this page, apologies if I misinterpret anything I'm half asleep but I have a few problems with what you're saying:

1. As you mentioned AotS is an example of a compute bound game, and as you mentioned different games will likely perform differently overall

2. There's no indication kepler/maxwell would even benefit to the same extent as GCN derivatives with 'async'

3. It's a 10%, best case scenario, performance improvement, if you ever see a 30/40% boost it's because the non 'async' path was a pile of shit

4. gm204 still matches/outperforms stock 390/390 in Ashes ('with async on for gcn) when you overclock to match 390/390 fp32 throughput;

Conversely, at fp 32 parity, with the 980ti and Fury X both at 8.6 Tflops, the 980Ti will outperform it if Async isn't enabled, if it is, it will match it

I expect this will be exactly the same with overclocked 970 results and I'm waiting on them, do you see what i mean?

8Tflops will always be 8Tflops, you want to argue the ease and flexibility of the ACEs are a huge bonus (from the developer perspective), I'm inclined to agree

You want to argue AMD's finer preemption will give them an upper hand in VR ? I agree

You want to argue ASYNC will make or break the dx12 generation and I genuinely have no idea where you're coming from with this

I remember reading a post of yours regarding the command processor, I think it was about it not having a large cache for commands ? Can you link ?

I understand you getting frustrated mahigan, i always appreciate the thought you put into your posts!

Face it, Maxwell (Aside from the brute force of GM200) is an inferior DX12 architecture. That's the truth. Not a truth derived out of partisanship but an actual truth (objective truth).

This does *sound* pretty partisan...

A 980ti does 6.75Tflops at stock of 1200mhz, still outperforms 8.6Tflops watercooled monster we call Fury X

Even though ever since their Omega driver release AMD performance has increased significantly, this is also why I'm interested in reading your comments regarding the GCP again

razor1 · Apr 5, 2016

noko said:
I have not delt with Maxwell 2 so I am not sure there, Maxwell 1 tanks using PhysX (Ti750 and 860M Gpu). I do believe Maxwell 2 and Cuda can actually do a compute and graphic operation at the same time so should do better there but start throwing in other compute workloads and I do believe it will start to tank.

Yes your right, strive to keep all units active at once - and that is the problem with Nvidia when you have both Graphic and Compute operations going on. As game workloads increase the amount of compute operations are going up with graphic operations - Nvdia hardware stopping the graphic operation to do compute operations starts to hit on that utilization. Nvidia hardware is highly efficient and designed around DX11 with limited threading. Hopefully future designs will be designed more for DX12 parallel and more threaded operations. As more DX 12 titles hit the streets and depending upon the number of compute operations going on, AMD hardware will be more effective then Nvdia hardware if Async Compute is used. I hope you see the distinction, it is not so much AMD is going up in performance with Async Compute it is more that Nvidia performance is tanking with the increase number of compute operations while AMD hardware is more effective in handling it and keeping all units working.

If you are talking about Maxwell 1 and Keplar, which you are as you stated 750ti and 860m, they don't have the same capabilities as maxwell 2 when you are talking about multiple queues going concurrently. Also being less performant to begin with is a start to show that yeah the same PhysX effects will have more performance hits on them too.

I have stated that async doesn't work well on Maxwell 2 under direct-compute, this is a problem that needs to be fixed. This is not a issue under any other circumstances. Why do you think AMD haven't done any type of demos with fluid dynamics yet? I mean to create a quick demo like what nV has done with CUDA, you would think won't take more than a week? There are limitations in compute shader language for Direct-compute and HLSL, which would make it go slower....., that isn't even about async.

Threaded operations are nice, but not a necessity if the design is more efficient to begin with. The amount of efficiency to gain is the amount of ALU's not utilized. You are not understanding that, there is a maximum amount of gain. Now do you see why Dev's are saying it is "hard" to get performance from Async. The number of idle ALU's are scattered amoungst different generations in IHV line up, and of course different IHV's. Think about the nurerous number of shaders a game has and how all of them have to be scheduled properly to get full advantage of Aync. That means every single shader might have to be rewritten for optimal performance on each IHV's cards! That is a lot of work. Perferable there has to be a better approach, as async creates a malestrum. And that better approach would be better utilization base line without too much developer interaction. If that is not a possibility than yeah there is no other options we have to go with what is there.

I don't think any one here can say that AMD's GPU's are 30% or more under utilized? Does that even sound remotely real? What have they been doing this far? What have developers been doing on consoles? What I think that 30% number comes from is the transition to a low level API, Sony has 2 API a low level on and high level one and I think they are comparing the two to make that assessment. Not just the async shaders.

razor1 · Apr 5, 2016

Mahigan said:
This was from Dan Baker...

See he did again too, async compute is part of DX spec, async shaders are not.

Mahigan said:
AMDs DX11 performance is a recent development for AotS. It is due to AMDs driver work and not due to anything on Oxide's end. What it shows is that AMD can optimize DX11 drivers but the manpower it takes to do so, and AMDs financials, don't really allow for AMD to do this routinely.

The Star Swarm demo contained no post processing lighting, shadows or any compute effects. It was a pure rendering demo which tested rendering capabilities relative to draw calls. There's no conspiracy here, AMD GCN is a vastly superior compute architecture and NVIDIA Maxwell is a vastly superior Graphics Rendering architecture.

AMD can optimize for DX11 but their financial situation makes the efforts tenuous at best.

Oxide has repeatedly answered your accusation, worded differently but you're still trying to play NVIDIA apologetics, of bias. Star Dock has a marketing agreement with AMD and they're Oxide's parent company. Dan Baker has repeatedly denied and invalidated your accusations of bias.

Both NV and AMD have access to the source code and can make change requests. Both had 3 on site visits with Oxide optimizing code. There's no AMD bias in the code. AotS is a prime example of a vendor agnostic title optimized properly for both vendors.

What AotS is, is heavy on compute and this benefits AMDs GCN more so than NVIDIA's architectures.

And this contradicts what you have stated before that AMD's architecture is inherently held back by DX11 because of driver overhead.

Ieldra · Apr 5, 2016

razor1 said:
If you are talking about Maxwell 1 and Keplar, which you are as you stated 750ti and 860m, they don't have the same capabilities as maxwell 2 when you are talking about multiple queues going concurrently. Also being less performant to begin with is a start to show that yeah the same PhysX effects will have more performance hits on them too.

I have stated that async doesn't work well on Maxwell 2 under direct-compute, this is a problem that needs to be fixed. This is not a issue under any other circumstances. Why do you think AMD haven't done any type of demos with fluid dynamics yet? I mean to create a quick demo like what nV has done with CUDA, you would think won't take more than a week? There are limitations in compute shader language for Direct-compute and HLSL, which would make it go slower....., that isn't even about async.

Threaded operations are nice, but not a necessity if the design is more efficient to begin with. The amount of efficiency to gain is the amount of ALU's not utilized. You are not understanding that, there is a maximum amount of gain. Now do you see why Dev's are saying it is "hard" to get performance from Async. The number of idle ALU's are scattered amoungst different generations in IHV line up, and of course different IHV's.

I don't think any one here can say that AMD's GPU's are 30% or more under utilized? Does that even sound remotely real? What have they been doing this far? What have developers been doing on consoles? What I think that 30% number comes from is the transition to a low level API, Sony has 2 API a low level on and high level one and I think they are comparing the two to make that assessment. Not just the async shaders.

This is my main gripe with all the arguments, split into two main points
1. Assumption that async performance increase is universal, not circumstantial, and not contingent on specific per GPU tuning
2. Assumption that inability to perform asynchronous multi-engine is detrimental to performance. It is not.

There's a difference between performance gain and performance loss, and outside the scope of relativism a 10% gain for AMD isn't a 10% loss for nvidia; as we've seen at fp32 parity nvidia hardware can match AMD hardware without the use of async, what does that tell you ?

This stinks of huge PR hype.

AMD lauds hitman as best of use of Async shaders yet !

LOOK AT THOSE HUGE 30% INCREASES UNDER DX12 WITH ASYNC

Call me partisan if you so wish, but had NVIDIA hardware demonstrated these same exact results ( 970~ getting a boost in dx12 and 980Ti not) there'd have been a shitstorm with people claiming NVIDIA abandoned the 980ti

You also have a funny situation in which the game lauded by AMD for having the best implementation of their game-changing *chuckle* technology has a 390x performing like a fury

Anarchist4000 · Apr 5, 2016

noko said:
I have not delt with Maxwell 2 so I am not sure there, Maxwell 1 tanks using PhysX (Ti750 and 860M Gpu). I do believe Maxwell 2 and Cuda can actually do a compute and graphic operation at the same time so should do better there but start throwing in other compute workloads and I do believe it will start to tank.

Yes your right, strive to keep all units active at once - and that is the problem with Nvidia when you have both Graphic and Compute operations going on. As game workloads increase the amount of compute operations are going up with graphic operations - Nvdia hardware stopping the graphic operation to do compute operations starts to hit on that utilization. Nvidia hardware is highly efficient and designed around DX11 with limited threading. Hopefully future designs will be designed more for DX12 parallel and more threaded operations. As more DX 12 titles hit the streets and depending upon the number of compute operations going on, AMD hardware will be more effective then Nvdia hardware if Async Compute is used. I hope you see the distinction, it is not so much AMD is going up in performance with Async Compute it is more that Nvidia performance is tanking with the increase number of compute operations while AMD hardware is more effective in handling it and keeping all units working.

My thinking is Nvidia can schedule asynchronously/interleaved in hardware with compute only. Ideally you could dual issue warps/waves or at least work towards that effect. So hardware async for them is limited to cases where the number of warps/threads is known. For graphics estimating the number of threads from rasterization would be difficult. Graphics and compute can execute concurrently on Maxwell2, but that's simply software scheduling graphics and compute serially. The beginning and end would overlap to some degree. This wouldn't really achieve the pairing of disparate jobs that async is wanting without a ton of draw calls and luck.

I'm still thinking GCN's work dispatcher, at least on newer iterations, can do the balancing and scheduling you would want to get a performance benefit. Choosing graphics or compute based on current CU load. In theory this would allow full utilization with low occupancy across a wide range of loads. This pairing would benefit both vendors, but compute aside I don't think Nvidia can schedule it. Full utilization at low occupancy should benefit everyone, which is where I think things are going.

I'm also reasonably sure the ACEs actually handle the synchronization events which is a benefit for DX12.

CSI_PC said:
And if you want to be truly impartial, why did they not consider CR/ROV for aspects beyond compute?

Like Mahigan said, I'm not sure those features are really suited to that style of game. Both are much more suited to lighting in FPS environments. Even for FPS games they'd be nice features to tack on, but I highly doubt there is enough hardware support in the market place to make them core features. Features currently limited to Intel and Maxwell2 still seem like extras, as a fallback method would still be required for a lot of hardware. If a dev has the resources sure, but I don't see either of those as a baseline.

Ieldra said:
A circumstantial 10% gain would have ended the company with an 80% desktop market share ?

We've seen greater than 10% though. Secondary to performance gains would be devs ability to actually optimize the paths. If it takes devs longer that's significant. There have been some guidelines saying code for Nvidia because AMD just works.

razor1 said:
I don't think any one here can say that AMD's GPU's are 30% or more under utilized? Does that even sound remotely real?

It does if you consider comparable cards between vendors have a rather significant peak performance gap. If an AMD card has 40% higher peak math performance and the Nvidia card is used as a baseline achieving that 30% underutilized wouldn't be that hard.

Zion Halcyon · Apr 5, 2016

razor1 said:
See he did again too, async compute is part of DX spec, async shaders are not.

And this contradicts what you have stated before that AMD's architecture is inherently held back by DX11 because of driver overhead.

I can't say that AMD was held back by driver overhead. I admit that I am not entirely knowledgeable on that subject. But it would be my assumption, given that async has been a part of AMD's offerings since the 79xx cards, that the issue was that until now and DX12, AMD engineered technology that they couldn't for the life of them get game developers to utilize until now, meaning that parts of their card remained completely unused, at least from a gaming standpoint. Combine that with AMDs history of bad drivers and NVidia cockblocking them repeatedly with Gameworks, and its not hard to see why Nvidia has ruled the roost.

However, if anything and like someone said earlier, I think if NVidia falls behind here its due to a business decision rather than a technical one. They probably didn't take AMD as serious of a threat anymore, and had all their eggs in the lucrative DX11 basket. Hubris can affect anyone, even companies. Hence why based on their reactions, I do think that Async took them completely by surprise. For that, I think all credit needs to go to the head of the Radeon Technology Group, which has been handling this smart from the beginning - drivers have improved and are released in a FAR more regular cadence, and the strategic choices they are making are spot on, such as their whole open source stance to developing for their card - while NVidia charges to use gameworks.

AMD actually did the same thing to Intel when they first debuted the Athlon - for all their brilliant engineers, Intel's hubris and business decisions caused them to be in a war for a few years with a much smaller upstart, and they didn't get themselves right for a few years. I just have to say I can see history potentially repeating itself here, and I hope it does, because it means we benefit in terms of better cards at cheaper prices.

razor1 · Apr 5, 2016

Anarchist4000 said:
My thinking is Nvidia can schedule asynchronously/interleaved in hardware with compute only. Ideally you could dual issue warps/waves or at least work towards that effect. So hardware async for them is limited to cases where the number of warps/threads is known. For graphics estimating the number of threads from rasterization would be difficult. Graphics and compute can execute concurrently on Maxwell2, but that's simply software scheduling graphics and compute serially. The beginning and end would overlap to some degree. This wouldn't really achieve the pairing of disparate jobs that async is wanting without a ton of draw calls and luck.

Nah when profiling directcompute games, there is no concurrency what so ever, and added to that, we see wierd things going on where instructions aren't in the right queues.

It does if you consider comparable cards between vendors have a rather significant peak performance gap. If an AMD card has 40% higher peak math performance and the Nvidia card is used as a baseline achieving that 30% underutilized wouldn't be that hard.

30% isn't desktop we are talking about console, that is where that number came from.

razor1 · Apr 5, 2016

Zion Halcyon said:
I can't say that AMD was held back by driver overhead. I admit that I am not entirely knowledgeable on that subject. But it would be my assumption, given that async has been a part of AMD's offerings since the 79xx cards, that the issue was that until now and DX12, AMD engineered technology that they couldn't for the life of them get game developers to utilize until now, meaning that parts of their card remained completely unused, at least from a gaming standpoint. Combine that with AMDs history of bad drivers and NVidia cockblocking them repeatedly with Gameworks, and its not hard to see why Nvidia has ruled the roost.

However, if anything and like someone said earlier, I think if NVidia falls behind here its due to a business decision rather than a technical one. They probably didn't take AMD as serious of a threat anymore, and had all their eggs in the lucrative DX11 basket. Hubris can affect anyone, even companies. Hence why based on their reactions, I do think that Async took them completely by surprise. For that, I think all credit needs to go to the head of the Radeon Technology Group, which has been handling this smart from the beginning - drivers have improved and are released in a FAR more regular cadence, and the strategic choices they are making are spot on, such as their whole open source stance to developing for their card - while NVidia charges to use gameworks.

AMD actually did the same thing to Intel when they first debuted the Athlon - for all their brilliant engineers, Intel's hubris and business decisions caused them to be in a war for a few years with a much smaller upstart, and they didn't get themselves right for a few years. I just have to say I can see history potentially repeating itself here, and I hope it does, because it means we benefit in terms of better cards at cheaper prices.

AMD bought out nextgen to come out with athlon, yeah I am a huge nextgen fan......

Ieldra · Apr 5, 2016

"Code for Nvidia, AMD will do fine " comes from Ext3h afaik, if you look at gdc presentation there's a list of circumstances in which it is profitable for devs to use compute queue on nvidia, for amd it just says 'everything that doesn't need geometry' basically

Conversely however you can day that if a dev doesn't dedicate the time to implement and tune async, nvidia will have an advantage because their hardware simply doesn't give a shit

razor1 · Apr 5, 2016

Optimizing for async reminds me of hand tuning C++ code doing ASM, to get that last 10% of performance it takes 90% of the time.

Old is new, new is old.......

Anarchist4000 · Apr 5, 2016

razor1 said:
Nah when profiling directcompute games, there is no concurrency what so ever, and added to that, we see wierd things going on where instructions aren't in the right queues.

Which still might be related to the problem I was getting at. If DC was running with any graphics it might not be exclusively compute, which is what I was getting at. I'll need to find more evidence on this, but it seems to track with the issues we've seen. Like you've said before, Nvidia should be able to do it, but after all this time it's not working for whatever reason.

razor1 said:
30% isn't desktop we are talking about console, that is where that number came from.

In that case I'm off a bit, however if you consider PC and console effects to be somewhat transferable across platforms it still holds. Current shader techniques likely share bottlenecks leaving some resources underutilized.

Ieldra said:
"Code for Nvidia, AMD will do fine " comes from Ext3h afaik, if you look at gdc presentation there's a list of circumstances in which it is profitable for devs to use compute queue on nvidia, for amd it just says 'everything that doesn't need geometry' basically

Conversely however you can day that if a dev doesn't dedicate the time to implement and tune async, nvidia will have an advantage because their hardware simply doesn't give a shit

I've seen that chart. As for the tuning, I'm waiting for more examples. If the work dispatcher is doing what I suspect for GCN(at least newer versions), the tuning should happen automatically. A dev would likely shoot themselves in the foot with a lot of sync events trying to explicitly tune it when they just needed to make sure compute was available to be scheduled. The goal should just be to have both compute and graphics ready to schedule at the start of a frame. For AOTS I recall Dan Baker saying the shadows are running a frame behind, which would achieve this. He's also got the highest async gains we've seen so far.

Zion Halcyon · Apr 5, 2016

razor1 said:
AMD bought out nextgen to come out with athlon, yeah I am a huge nextgen fan......

Yet what I said still happened.

If you want to argue that AMDs leadership has oft been overmatched, and want to make the argument that they lucked into buying nextgen or somesuch, sure, that can make a credible argument, especially with AMDs history since. But that still doesn't negate the competition that actually happened between AMD and intel, or that intel was caught with its pants down for a while there.

Ieldra · Apr 5, 2016

Anarchist4000 said:
My thinking is Nvidia can schedule asynchronously/interleaved in hardware with compute only. Ideally you could dual issue warps/waves or at least work towards that effect. So hardware async for them is limited to cases where the number of warps/threads is known. For graphics estimating the number of threads from rasterization would be difficult. Graphics and compute can execute concurrently on Maxwell2, but that's simply software scheduling graphics and compute serially. The beginning and end would overlap to some degree. This wouldn't really achieve the pairing of disparate jobs that async is wanting without a ton of draw calls and luck.

I'm still thinking GCN's work dispatcher, at least on newer iterations, can do the balancing and scheduling you would want to get a performance benefit. Choosing graphics or compute based on current CU load. In theory this would allow full utilization with low occupancy across a wide range of loads. This pairing would benefit both vendors, but compute aside I don't think Nvidia can schedule it. Full utilization at low occupancy should benefit everyone, which is where I think things are going.

I'm also reasonably sure the ACEs actually handle the synchronization events which is a benefit for DX12.

Like Mahigan said, I'm not sure those features are really suited to that style of game. Both are much more suited to lighting in FPS environments. Even for FPS games they'd be nice features to tack on, but I highly doubt there is enough hardware support in the market place to make them core features. Features currently limited to Intel and Maxwell2 still seem like extras, as a fallback method would still be required for a lot of hardware. If a dev has the resources sure, but I don't see either of those as a baseline.

We've seen greater than 10% though. Secondary to performance gains would be devs ability to actually optimize the paths. If it takes devs longer that's significant. There have been some guidelines saying code for Nvidia because AMD just works.

It does if you consider comparable cards between vendors have a rather significant peak performance gap. If an AMD card has 40% higher peak math performance and the Nvidia card is used as a baseline achieving that 30% underutilized wouldn't be that hard.

A lot of games aren't necessarily compute bound however, the % of compute has been increasing in time, and I think this also contributes to Amd performance improving. I categorically reject the notion that this was an intentional play by AMD, nobody in their right minds releases expensive

Anarchist4000 said:
Which still might be related to the problem I was getting at. If DC was running with any graphics it might not be exclusively compute, which is what I was getting at. I'll need to find more evidence on this, but it seems to track with the issues we've seen. Like you've said before, Nvidia should be able to do it, but after all this time it's not working for whatever reason.

In that case I'm off a bit, however if you consider PC and console effects to be somewhat transferable across platforms it still holds. Current shader techniques likely share bottlenecks leaving some resources underutilized.

I've seen that chart. As for the tuning, I'm waiting for more examples. If the work dispatcher is doing what I suspect for GCN(at least newer versions), the tuning should happen automatically. A dev would likely shoot themselves in the foot with a lot of sync events trying to explicitly tune it when they just needed to make sure compute was available to be scheduled. The goal should just be to have both compute and graphics ready to schedule at the start of a frame. For AOTS I recall Dan Baker saying the shadows are running a frame behind, which would achieve this. He's also got the highest async gains we've seen so far.

Yeah AotS hits that 10% goal, Hitman doesn't, yet Hitman is lauded by amd. Weird huh?

IO interactive mentioned difficulty in tuning, even across hardware from same vendor.

Console performance gains are bigger because of the cpu bottleneck I believe

Anarchist4000 · Apr 5, 2016

Ieldra said:
Yeah AotS hits that 10% goal, Hitman doesn't, yet Hitman is lauded by amd. Weird huh?

IO interactive mentioned difficulty in tuning, even across hardware from same vendor.

Console performance gains are bigger because of the cpu bottleneck I believe

It did release before AOTS, so from a marketing standpoint it would be the best(only?) example of async to date... at the time. Still, it stands to reason the engine wasn't a prime example of DX12 and designed in a way to properly use the effects. That would likely go along with the tuning and other issues. They didn't mention which hardware was difficult to tune either.

DX12 being new, devs having some difficulties doesn't seem that unreasonable. All the GDC presentations I've seen on async so far just say what you want to do, not how to go about it. GPUOpen has had a lot of useful information on GCN optimizations, but I haven't really seen anything on async yet. Just 10% boost, maybe more/more research required. It's entirely possible their dev relations team didn't know how to do it well. Nvidia on the other hand I'm not expecting any async optimization strategies for a bit.

Ieldra · Apr 5, 2016

Anarchist4000 said:
It did release before AOTS, so from a marketing standpoint it would be the best(only?) example of async to date... at the time. Still, it stands to reason the engine wasn't a prime example of DX12 and designed in a way to properly use the effects. That would likely go along with the tuning and other issues. They didn't mention which hardware was difficult to tune either.

DX12 being new, devs having some difficulties doesn't seem that unreasonable. All the GDC presentations I've seen on async so far just say what you want to do, not how to go about it. GPUOpen has had a lot of useful information on GCN optimizations, but I haven't really seen anything on async yet. Just 10% boost, maybe more/more research required. It's entirely possible their dev relations team didn't know how to do it well. Nvidia on the other hand I'm not expecting any async optimization strategies for a bit.

I say this again, and again, and again.

If NVIDIA categorically excludes 'async' support, not just for Pascal, but for Volta, Einstein and Minkowski too, but guarantees performance will match or exceed that of competing GPUs that use async shaders, why should we, or anyone for that matter, give a shit ?

Ashes of the Singularity Day 1 Benchmark Preview @ [H]

[H]F Junkie

n00b

Supreme [H]ardness

2[H]4U

[H]F Junkie

Extremely [H]

n00b

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

Supreme [H]ardness

[H]ard|Gawd

Extremely [H]

Gawd

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

razor1 is my Lover

2[H]4U

Limp Gawd

[H]F Junkie

Limp Gawd

2[H]4U

I Promise to RTFM

[H]F Junkie

[H]F Junkie

I Promise to RTFM

[H]ard|Gawd

2[H]4U

[H]F Junkie

[H]F Junkie

I Promise to RTFM

[H]F Junkie

[H]ard|Gawd

2[H]4U

I Promise to RTFM

[H]ard|Gawd

I Promise to RTFM