AMD Zen Rumours Point to Earlier Than Expected Release

The XB1's CPU is about as powerful as the 360's was, and in the case of the PS4, the CPU is a downgrade over the PS3s CPU.

citation_needed.jpg
 
No but there are plenty of Mantle Star Swarm results which suggest that you can take your IPC and stick it ...
Didn't know that DX12 = Mantle, or Mantle = about to be used in some games in the future.
i3s beating 8 weak cores just shows how important cores are in the current DX 12 benchmarks and we can't deny it.

The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm
this just shows that you don't need cores. 4 = what you need , you have always needed and what you will need.

they don't compare different CPUs but just amount of cores which is interesting. You would always want 4 stronger cores over 8 moderate cores if AMD has to release something based on that
 
Ha ha. I just took the SteamVR test and my FX-9370 and R9 290 passed with flying colors. I even outscored most of the Intel CPU / GTX 970 combos. As much as I want a new CPU, this 5 year old build keeps on keeping on with minor upgrades yearly. Just on the video card front I have gone from GTX 460 SLI, to HD 7950, to HD 7950 Crossfire, to R9 290.

I can't complain, but damn I want to something new every so often. :)
Well those must be rather slow 970s since I got 8.2 (Very High). Okay... My 970 runs at 1,5 GHz but still (3770K @ 4,4 GHz). That test doesn't appear to be very CPU bound and even some i3s seem to be enough.


I could definitely find uses for 8 real cores with SMT. It's also very obvious that new games are actually benefiting from extra cores and it's going to be the future. If Zen has above Ivy Bridge IPC and ~4 GHz clockspeed (8 core variant) it will sell and it will benchmark higher than Intel's quadcores in upcoming games.
 
The XB1's CPU is about as powerful as the 360's was, and in the case of the PS4, the CPU is a downgrade over the PS3s CPU.
No. The 360 had a triple core in order PPC chip @ 3.2Ghz with a peak of 115.2 GFlops of performance (if wikipedia is to be believed). Divide by three and you get 38.4 Gflops per core
Xbox 360 technical specifications - Wikipedia, the free encyclopedia
The jaguar chip in the X1 comes in at 1310 Gflops (again via wiki). Divide by eight and you get 163.75 Gflops.
Jaguar (microarchitecture) - Wikipedia, the free encyclopedia

That's more than a 360 per core on the X1. The ps3 only had one general purpose core where the ps4 has eight.


On your other point about people not programming in assembly you're correct, but common uarchs make for easier ports by way of fewer bugs showing up during the port. High level code can compile to the same assembly and the same uops are available on both platforms.
 
Ha ha. I just took the SteamVR test and my FX-9370 and R9 290 passed with flying colors. I even outscored most of the Intel CPU / GTX 970 combos. As much as I want a new CPU, this 5 year old build keeps on keeping on with minor upgrades yearly. Just on the video card front I have gone from GTX 460 SLI, to HD 7950, to HD 7950 Crossfire, to R9 290.

I can't complain, but damn I want to something new every so often. :)

I know what you mean. I dont really feel like I need an upgrade because my rig's performance is still doing everything I need. Granted Im pretty much using this thing strictly for gaming but its handling everything so far with ease. This 8350 was the first golden chip Ive ever had. Every other chip Ive ever had I had to work and work for days to wring out every little extra MHz. This thing just cranked up to 4.6 on stock voltage and I had it running stable with Cool n Quiet enabled in just a couple hours with not that much tweaking. I hate to get rid of it just for that reason.

But I have had it a couple years now and the itch is starting to get to me. Even though I dont "need" a new CPU/motherboard, I want one just for something new to play with.
 
No. The 360 had a triple core in order PPC chip @ 3.2Ghz with a peak of 115.2 GFlops of performance (if wikipedia is to be believed). Divide by three and you get 38.4 Gflops per core
Xbox 360 technical specifications - Wikipedia, the free encyclopedia
The jaguar chip in the X1 comes in at 1310 Gflops (again via wiki). Divide by eight and you get 163.75 Gflops.
Jaguar (microarchitecture) - Wikipedia, the free encyclopedia

That's more than a 360 per core on the X1. The ps3 only had one general purpose core where the ps4 has eight.

You combined the CPU and GPU, which isn't fair. Jaguar as a chip is about 1.3 TFLOPs worth of performance, but 1.2 TFLOPs of that is the GPU. The CPU on the XB1 can only get about 105 GFLOPs.

The PS4 CPU is a clear downgrade in absolute performance; the Cell had a theoretical peak rate of 240 GFLOPs, though the typical usage rate was only about 170 GFLOPs due to the Cells design.

So re-doing your math, also throwing in desktop chips for comparisons sake:

360: 115 GFLOPs / 3 = 38.3 GLOPs per core [disregarding SMT effects]
PS3: 170 GFLOPs / 6 = 28.3 GFLOPs per core [Only 6 cores were available for developer use on the PS3]
XB1/PS4: 105 GFLOPs / 8 = 13.125 GFLOPs per core [Highlights why the XB1/PS4 are basically required to go out of there way to thread well]
i7 4770k: 177 GFLOPS / 4 = 29.25 GFLOPs per core [Disregarding HTT effects]
FX-8350: 147 GFLOPS / 8 = 18.375 GFLOPS per core [Disregarding CMT effects; this result is probably 20% too high as a result]

Point being, a single Haswell core is worth about two cores on a console. This highlights why PCs really don't benefit from additional scaling; the cores are already powerful enough on Intel platforms.
 
That is why Vulkan and DX12 are here to take it closer to what it should be. And that single cpu core is fast enough to feed the gpu then what is the program doing nothing much. When you move bottlenecks you can't complain about it not working like it used to that book is closed. If your game is not using gpu enough or does not need more cores to do the same thing it just means your game (engine) is not demanding enough. It would suffice doing that stuff in older API where all the old bottlenecks are still there..

The problem is that game developers sat on their hand pretty much with some exceptions (No one would have thought a decade ago that EA of all companies would play such a part in this)

Everything not having to do with rendering can be trivially done on a single CPU core. The actual "game" is maybe 50% of a single core on modern processors; it's not that hard. It's the rendering process, specifically the DX/OGL API management, that eats up performance. Which is exactly what DX12/Vulkan are addressing. Still doesn't change the fact that, on modern CPUs, the GPU is almost always the bottleneck, so the only real effect for most people here will be a reduction in CPU load, not an increase in performance.

As for game engines "not being demanding enough", the issue is adding new effects gets really expensive really quickly. Take physics, which is very dumbed down in games so it can be computed fast. Want multi-object dynamic physics effects calculated in realtime? Fully destructible environments [and I'm not talking the limited implementations we have now]? You'll need at least 20x the CPU horsepower you have now; physics gets really complex really quickly. Same problem with AI; anything much more advanced then we have now requires orders of magnitude more CPU power then we have.

Rendering has similar problems; notice how new effects give minor improvements in graphics, but eat a ton of performance? That's because all the easy stuff has already been done, and all the advanced graphical effects are an order of magnitude too complex for current GPUs to handle. So we get incremental improvements, new AA modes, and dynamic refresh rates for NVIDIA/AMD can justify continuing to put out new products.

Now if you really want a "demanding" game engine that badly, I can always throw a few sleep(10) commands through the code. Look, its demanding! [/sarcasm]
 
Gotta love how you guys argue each is right when you are actually arguing different sides of how it is done. Sorry Gamerk2 you just seem to be stuck in DX11 and earlier era. You talk about how it is done and how, and this is the problem with your argument, DX12 changes that which in reality would be minimal. You are somewhat correct. But the problem is, or rather the change that DX12 allows is far greater per frame information, batch calls allowed simply by lower CPU overhead.
Introduction to Direct 3D 12 by Ivan Nevraev
upload_2016-2-24_6-37-6.png

upload_2016-2-24_6-39-43.png


These next slides are my favorite and to what Peter refers and why what he is stating is in fact correct.
upload_2016-2-24_6-40-54.png

upload_2016-2-24_6-41-16.png

upload_2016-2-24_6-46-32.png


And the ignorance of the many is the problem. Mantle was quite the boon especially to those playing BF4 with CF. Unfortunately those ignorant masses assumed that meant higher frames and when they weren't a great deal higher on avg or at maximum most assumed it was a failure, more so as Nvidia was real close to those same arbitrary numbers on DX11. But the fact was Mantle just as DX12 will do greatly helps Minimums. BF4 CF mantle frame time graphs were the best looking even graphs ever seen. Maximums generally wont change as there is little to no bottleneck. Minimums will be greatly affected, as you can see in the graphs above.

Now Here is the part peter is alluding to, I assume: Because we can now reduce that CPU overhead we can increase batches, AI, objects, draw distance and get even playable results still maintain the frame rates we had in DX11 but with far greater immersion. So arguing how a game when no changes are made to those above mentioned additions doesn't really gain much with DX12 is asinine and intentionally misleading.
 
40% IPC over Excavater is like Ivy or Haswell, right? Excavator is about 25% more than piledriver, Sandy is 40-50 over piledriver.
 
I've specifically stated that new API's will benefit lower-tier CPUs, because that's where those latency delays kill you. For higher-tier CPUs, there's quite literally no difference.

You also ignore the fact that lots of small batches is VERY inefficient, especially when you consider the problem of transferring data across the PCI-E bus into a GPUs VRAM. A single large batch is easier to optimize for on current GPU architectures, though APUs in particular will benefit from the new model.
 
40% IPC over Excavater is like Ivy or Haswell, right? Excavator is about 25% more than piledriver, Sandy is 40-50 over piledriver.
I have seen some alluding to haswell performance using math on current offerings. Looks reasonable but for the first iteration getting close or just below haswell would be a good win. Ivy would be a better expectation. The Zen thread over on OCN has good information, being active every day for the last few months of it existence. Go there and look at the last few weeks to see this debate based on Carrizo released recently.
 
And the ignorance of the many is the problem. Mantle was quite the boon especially to those playing BF4 with CF. Unfortunately those ignorant masses assumed that meant higher frames and when they weren't a great deal higher on avg or at maximum most assumed it was a failure, more so as Nvidia was real close to those same arbitrary numbers on DX11. But the fact was Mantle just as DX12 will do greatly helps Minimums. BF4 CF mantle frame time graphs were the best looking even graphs ever seen. Maximums generally wont change as there is little to no bottleneck. Minimums will be greatly affected, as you can see in the graphs above.

Now Here is the part peter is alluding to, I assume: Because we can now reduce that CPU overhead we can increase batches, AI, objects, draw distance and get even playable results still maintain the frame rates we had in DX11 but with far greater immersion. So arguing how a game when no changes are made to those above mentioned additions doesn't really gain much with DX12 is asinine and intentionally misleading.

Just--honest question? How much programming do you do versus how much flipping through slides and regurgitating marketing bits? Like, actual writing code/engines and appreciating things as they really happen versus potentially *could* happen? As Gamerk2 writes, DX12/Vulkan benefit CPU-side on the low end (which is great for us with older hardware; it's a new lease on life), and allows for better GPU utilization from finer granularity. Amdahl's law strikes again. Just because you can make a trillion batches doesn't mean you should (cache/memory optimization).

Please cut out the judgmental tone--it's not doing you any good.
 
You combined the CPU and GPU, which isn't fair. Jaguar as a chip is about 1.3 TFLOPs worth of performance, but 1.2 TFLOPs of that is the GPU. The CPU on the XB1 can only get about 105 GFLOPs.
You're right I used the gpu flops not the cpu flops. My bad. It took some digging, but I found a japanese reviewer that did a flops test.
ホイール欲しい ハンドル欲しい » BayTrail vs Kabini (Celeron J1900 vs Athlon 5350)
Seems the GFlops per core (Single precision) 65.6 GFLOPS / 4 = 16.4
Total is 131.2 Gflops

Well spotted.

The PS4 CPU is a clear downgrade in absolute performance; the Cell had a theoretical peak rate of 240 GFLOPs, though the typical usage rate was only about 170 GFLOPs due to the Cells design.

Ok, but only one core of the PS3 could be used for general purpose code. One of the mini cores did nothing but networking. Giving that chips a full aggregate is nonsensical. I'd be willing to bet that in most game cases the ps3 was effectively a single core machine.
Cell (microprocessor) - Wikipedia, the free encyclopedia

Point being, a single Haswell core is worth about two cores on a console. This highlights why PCs really don't benefit from additional scaling; the cores are already powerful enough on Intel platforms.
Yes and? It's quite well known that a single haswell core is more powerful than a single jaguar core (see Athlon 5350 vs i3 whatever).




Unrelated Edit:
I hate this new forum software :mad:
 
Last edited:
You combined the CPU and GPU, which isn't fair. Jaguar as a chip is about 1.3 TFLOPs worth of performance, but 1.2 TFLOPs of that is the GPU. The CPU on the XB1 can only get about 105 GFLOPs.

The PS4 CPU is a clear downgrade in absolute performance; the Cell had a theoretical peak rate of 240 GFLOPs, though the typical usage rate was only about 170 GFLOPs due to the Cells design.

So re-doing your math, also throwing in desktop chips for comparisons sake:

360: 115 GFLOPs / 3 = 38.3 GLOPs per core [disregarding SMT effects]
PS3: 170 GFLOPs / 6 = 28.3 GFLOPs per core [Only 6 cores were available for developer use on the PS3]
XB1/PS4: 105 GFLOPs / 8 = 13.125 GFLOPs per core [Highlights why the XB1/PS4 are basically required to go out of there way to thread well]
i7 4770k: 177 GFLOPS / 4 = 29.25 GFLOPs per core [Disregarding HTT effects]
FX-8350: 147 GFLOPS / 8 = 18.375 GFLOPS per core [Disregarding CMT effects; this result is probably 20% too high as a result]

Point being, a single Haswell core is worth about two cores on a console. This highlights why PCs really don't benefit from additional scaling; the cores are already powerful enough on Intel platforms.

the PS4 cpu is in not a downgrade, actual performance can't be compared to "raw theoretical floating point performance" , eg, a 5870 has 2720 Gflops, a cpu is more than just it's gflops. :)
 
Just--honest question? How much programming do you do versus how much flipping through slides and regurgitating marketing bits? Like, actual writing code/engines and appreciating things as they really happen versus potentially *could* happen? As Gamerk2 writes, DX12/Vulkan benefit CPU-side on the low end (which is great for us with older hardware; it's a new lease on life), and allows for better GPU utilization from finer granularity. Amdahl's law strikes again. Just because you can make a trillion batches doesn't mean you should (cache/memory optimization).

Please cut out the judgmental tone--it's not doing you any good.
First I will start with those slides were from A senior software engineer at MS not AMD nor Nvidia. So not sure how it regurgitating marketing bits when it is an introduction to DX12 and programming with it in contrast to DX11. And I made no illusions to infinite or trillion batches just using the sved time in CPU load to increase other aspect allowing for greater immersion in games.

Second the judgmental part which I assume is in reference to my remarks on Mantle and BF4 was not in response to anyone here but to the common notion during that time and how it was wrong. Also trying to parallel DX12 benefits and how we should look at the benefits as being more than just frame rate increases which likely wont be the primary benefit of moving from DX11 to DX12.

Now I am all for debate and an exchange of ideas but seeing how I seem to be the only one linking articles and papers to what I based my theories and findings upon, I would ask that any further rebuttal come with links to any papers and articles. Sorry but I have no reason to believe anything you post as I don't expect you to for me either, hence why I post proof of said discussions on my part.
 
This is an interesting test with a press version of AOTS
AotS-DX12-Benchmarks.png


Where the DX11 results are clear for the top end Intel 5960x and low end AMD Phenom II X4 955. Then you get to see what is happening in DX12...

from Looking at DirectX 12 Performance in Ashes of the Singularity - SemiAccurate
Ok I have to admit I thought this would have been something from last year but was surprised it is recent and very interesting especially this part:
As a point of reference we ran benchmarks on a number of different systems using the public version of AotS. The important things to note here is that with GCN-based AMD GPUs DirectX12 basically doubles performance. Additionally, eight core CPUs both from Intel and AMD see a doubling in performance as well when tested with an infinitely fast GPU. The truly strange data point here is the 3.5x performance increase that AMD’s ancient Phenom II X4 955 sees between its unplayable DirectX 11 performance and its passable DirectX 12 performance. This is a pretty neat observation in that it gives us some hope that DirectX 12 will improve the relative standing of older multicore CPUs like this almost seven-year-old chip.

Now this is interesting because even fast Intels are seeing increases which apparently from the vastly knowledgeable posters here wont happen. But I guess it can. (End of salt in wound...)

In the end what have we learned about DirectX 12 from AotS? CPUs that were weak in DirectX 11 perform much better in DirectX 12. Frame rates pretty much double across the board in DirectX 12. The impact of Async Compute is significant but it pales in comparison to the impact of using DirectX 12 in the first place.

So it looks like DX12 will be quite the boon to processors. Now how will companies push upgrades when what you have works far longer than before?
 
Looks like DX12 doubles the top end Intel CPU's performance and does the same for the top end AMD CPU. The little Phenom gets a mighty boost to performance under DX12. That's pretty awesome! Thanks for the link!
 
So it looks like DX12 will be quite the boon to processors. Now how will companies push upgrades when what you have works far longer than before?

More cores in general and efficient cores seems to be the current trend. Dual cores won't be cutting it for high end gaming in the future as the shift will be towards sending more loads to the GPU. The more information that you can send to the GPU the better. So I expect to see more of this trend in the future. We might soon see those 10+ core Intel CPU on the desktop if Zen can perform with a stone's throw of a 8 core Intel.
 
Where can I get one of these purely synthetic, infinitely fast GPUs? How is it modeled? Did they just simply extinguish every outbound call to the GPU as done? (So not only an infinitely fast GPU, but an infinitely fast bus) There's a great little line in the bench itself about that.

Also, before we fawn over that chart, what did he actually test? (Hint: go to the benches themselves and look)
 
Where can I get one of these purely synthetic, infinitely fast GPUs? How is it modeled? Did they just simply extinguish every outbound call to the GPU as done? (So not only an infinitely fast GPU, but an infinitely fast bus) There's a great little line in the bench itself about that.

Also, before we fawn over that chart, what did he actually test? (Hint: go to the benches themselves and look)
The answer is in your own post. Most games today are GPU bound. Keeping that in mind it means the CPU can take on more work and tasks, as I stated before. An infinitely fast GPU allows for the option of seeing how much, rather how far the CPU can go in DX12. Akin to running benchmarks at 800x600 to see what the CPU framerate max/avg is, not very practical in an age where 1080p is generally the bottom line for anyone playing games today.
 
No, it's an estimation/approximation/extrapolation. To be trusted as far as it can be thrown.

Long and short of it: 5960X + Fury X, 8370 + R290, X4 955 + 7950 all see impressive jumps going from DX11 to DX12. That's all. Given no crossing of CPU's and GPU's was even done, trying to tease out GPU from CPU effects is a non-starter.
 
I've specifically stated that new API's will benefit lower-tier CPUs, because that's where those latency delays kill you. For higher-tier CPUs, there's quite literally no difference.

Sorry but you're kinda wrong about that as well: Looking at DirectX 12 Performance in Ashes of the Singularity - SemiAccurate

In this more updated version of Ashes of the Singularity, even the 5960x sees a literal doubling of performance in DX12. How is that no difference? I know that everybody saw the formative early benches for DX12 and wrote the API off but this is the reality of what you can get with proper care taken in regards to optimization with DX12 and Vulkan. Ryan himself notes how the Phenom II quad gets better performance in DX12 than the 5960x did in DX11. That speaks volumes not only about what can be achieved with high-end processors but also older, weaker processors.

Also shows Fury X beating the 980 Ti, especially with Async compute, and other sites had the same findings.
 
Sorry but you're kinda wrong about that as well: Looking at DirectX 12 Performance in Ashes of the Singularity - SemiAccurate

In this more updated version of Ashes of the Singularity, even the 5960x sees a literal doubling of performance in DX12. How is that no difference? I know that everybody saw the formative early benches for DX12 and wrote the API off but this is the reality of what you can get with proper care taken in regards to optimization with DX12 and Vulkan. Ryan himself notes how the Phenom II quad gets better performance in DX12 than the 5960x did in DX11. That speaks volumes not only about what can be achieved with high-end processors but also older, weaker processors.

Also shows Fury X beating the 980 Ti, especially with Async compute, and other sites had the same findings.
That literal doubling though is with an infinitely fast GPU. If you are GPU bound a lesser CPU could be equal. It'd be nice to see some cpu benchmarks of that game just to see how well they're loading all the cores. Like they mentioned in the article the upper limit of FPS seems to be ~70 from processing gameplay. I have to admit that CPU test to see the upper limit of the framerate is rather nice.

The interesting part of that was that the Fury was winning WITHOUT async. The benchmarks seem to show gains from async only amounting to 5-10% which would go along with the very limited use of the feature Oxide mentioned. The rest of the gains were simply DX12 over DX11.
 
Ok I have to admit I thought this would have been something from last year but was surprised it is recent and very interesting especially this part:
Now this is interesting because even fast Intels are seeing increases which apparently from the vastly knowledgeable posters here wont happen. But I guess it can. (End of salt in wound...)

I remember seeing that first AoS benchmark, and was the only person to understand what it was really showing. What you forget to mention is that while AMDs performance doubled, it was still slightly slower then NVIDIA's 900 series GPU in the same benchmark...running DX11. What you're seeing, both in Mantle and DX12 benchmarks, is how bad AMDs DX11 driver is compared to NVIDIA.

That's what no one ever got because no one really understands benchmarking anymore, let alone understands whats going on under the hood. Mantle never made AMD GPUs faster, it just avoided their crap DX11 driver. Same deal with DX12; it allows AMD to catch up to NVIDA in cases where their DX11 driver is supoptimal.
 
I remember seeing that first AoS benchmark, and was the only person to understand what it was really showing. What you forget to mention is that while AMDs performance doubled, it was still slightly slower then NVIDIA's 900 series GPU in the same benchmark...running DX11. What you're seeing, both in Mantle and DX12 benchmarks, is how bad AMDs DX11 driver is compared to NVIDIA.

That's what no one ever got because no one really understands benchmarking anymore, let alone understands whats going on under the hood. Mantle never made AMD GPUs faster, it just avoided their crap DX11 driver. Same deal with DX12; it allows AMD to catch up to NVIDA in cases where their DX11 driver is supoptimal.
Lets try this: What is it that Nvidia did that reduce CPU overhead or made their DX11 drivers so much better?

Cant remember if it was this thread or another but I did post a report from Anandtech that went into detail why GCN and DX11 were having issues and why DX12 alleviated them.
Here it is:

AMD Dives Deep On Asynchronous Shading
Why? Well the short answer is that in another stake in the heart of DirectX 11, DirectX 11 wasn’t well suited for asynchronous shading. The same heavily abstracted, driver & OS controlled rendering path that gave DX11 its relatively high CPU overhead and poor multi-core command buffer submission also enforced very stringent processing requirements. DX11 was a serial API through and through, both for command buffer execution and as it turned out shader execution.

Its that last part that I was mentioning in whatever thread I was debating this. Hence why I wasn't entirely sure drivers would alleviate the issue.

Granted I have not got far enough along in my reading and understanding of what happens in a driver in regards to handling OS and middleware such as DX11 but... my concern has always been to what was actually being done. For example in last years games tessellation was a big thing which we know AMD has some performance issues with at least in comparison to Nvidias Maxwell. So when they release a driver with a gameprofile they manually set the Tessellation factor to a lower number which in return increases their framerate to a more competitive level. Without arguing whether certain levels of tessellation is necessary, the problem here is that the tests then become apples to oranges and never again apples to apples. Of course this doesn't say Nvidia has done or isn't doing the same thing just making a point about how I am quite skeptical of such drivers.
 
Ok I know some don't care much for him but Mahigan does give an explanation on OCN for this very subject:

[Anand] Ashes of the Singularity Revisited: A Beta Look at DirectX 12 & Asynchronous Shading - Page 8
AMD do support multi-threaded command listing as well as deferred rendering. That's not really the issue, the issue is the command buffer size. It is tailored to match the queue size in the Command Processor Q0. That means 64 threads wide. So basically, GCN is constantly fetching commands from the DirectX Command Buffer (kept in system memory), hitting thread0 of the CPU hard.

GCN is very parallel and thus needs a lot of work items being scheduled in order to get full utilization. So if the CPU is instructed to run some complex simulation, the GCN hardware schedule sends a fetch command but the CPU is busy so you get a GPU hardware stall. The GPU pretty much waits there for more commands. This of course hits AMDs draw call rate hard under GCN and DX11.

With DX12, the DirectX driver is spread amongst many cores. This helps ensure that if one CPU core is busy with other work, another will be able to process the fetch commands and transfer new commands to the Command Processor.

This is a mix of an API overhead and hardware issue (Queue size is too small on the Command Processor).

NVIDIA have an edge here because they've been using a Static Scheduler ever since Kepler (GTX 680). This means that a large segment of NVIDIAs scheduler is in the driver (software). Their scheduler is multi-threaded (what AMD call a hidden CPU thread). So when the GPU signals for more work another thread will process the command. Ontop of that, NVIDIA's Gigathread Engine can hold many more threads than AMDs command processor. So the hardware doesn't need to fetch commands as often.

So if CPU thread 0 is stuck doing other work, CPU thread 1 will be used by NVIDIA's scheduler. So NVIDIA's Kepler and newer architectures never skip a beat.

Evidently, if you move to DX12, NVIDIA won't see much in the way of a performance boost over DX11 but AMD will.

The other side of things is that under DX12 it is now NVIDIA who incur a larger CPU overhead as their Scheduler is in software and taking up cycles from the CPU that AMD GCN aren't.

Like a roll reversal but not nearly as bad as what AMD suffered under DX11 in draw call intensive titles.
Now it looks like some could be right about the driver part, or at least that AMD could have done like Nvidia. Only question then is why not? I mean if they had would they be far ahead of Nvidia like they are in DX12? And if so then that leads back to why they didn't. Seriously does anyone actually believe AMD cant write drivers or software? So I am left still looking for info as to why.
 
It doesn't work the same way in Dx11 as it does in Dx12, he is assuming the queues are setup the same way in both API's, which they aren't because the CPU cores aren't accessed the same way at the same time. I don't understand why this is so hard to understand, and why Mahigan seems to miss this entirely in his statement. Even before the driver or hardware is even a factor the API restricts this.

So in essence what happens with AMD hardware is you won't be able to fill up all the queues simultaneously since each ACE can interact with single active CPU core at the same time, which is the same limit nV would have.

So the CPU overhead is the same theoretically, but now nV seems to be able to lower their cpu overhead when doing draw calls with DX11, why can't AMD do the same if the limitation is the same, they should be able to. Even if the ACE's are being fully utilized, which they shouldn't be anyways as the queues are setup differently. And this shouldn't cause extra CPU overhead since the DX11 API will not allow it.

Now he also misses the fact that if CPU 0 is utilized, CPU 1 can't be accessed until CPU 0 is free..... So his statement is not correct there the scheduler can't not be doing any work in DX11 in the situation he just stated if its "software" based until CPU 0 is done as in DX11 it can't communicate with more than one core at once.

Also this is what I was saying earlier, threads and queues don't line up, you can't compare the two in DX11 or DX12, because its not a straight forward we have ABCD in a queue and these threads are running, a will go in thread 1, b will go into thread 2, etc. Doesn't work that way. So talking about threads and trying to focus on how the queues and threads align with the way the aces function in different API's is just what it is more confounding posts, that don't make any sense

PS I think we posted this is the wrong thread ;)
 
Last edited:
It doesn't work the same way in Dx11 as it does in Dx12, he is assuming the queues are setup the same way in both API's, which they aren't because the CPU cores aren't accessed the same way at the same time. I don't understand why this is so hard to understand, and why Mahigan seems to miss this entirely in his statement. Even before the driver or hardware is even a factor the API restricts this.

So in essence what happens with AMD hardware is you won't be able to fill up all the queues simultaneously since each ACE can interact with single active CPU core at the same time, which is the same limit nV would have.

So the CPU overhead is the same theoretically, but now nV seems to be able to lower their cpu overhead when doing draw calls with DX11, why can't AMD do the same if the limitation is the same, they should be able to. Even if the ACE's are being fully utilized, which they shouldn't be anyways as the queues are setup differently. And this shouldn't cause extra CPU overhead since the DX11 API will not allow it.

Now he also misses the fact that if CPU 0 is utilized, CPU 1 can't be accessed until CPU 0 is free..... So his statement is not correct there the scheduler can't not be doing any work in DX11 in the situation he just stated if its "software" based until CPU 0 is done as in DX11 it can't communicate with more than one core at once.

Also this is what I was saying earlier, threads and queues don't line up, you can't compare the two in DX11 or DX12, because its not a straight forward we have ABCD in a queue and these threads are running, a will go in thread 1, b will go into thread 2, etc. Doesn't work that way. So talking about threads and trying to focus on how the queues and threads align with the way the aces function in different API's is just what it is more confounding posts, that don't make any sense

PS I think we posted this is the wrong thread ;)

You misunderstood what I was saying,

Leave ACEs out of it, I didn't bring them up as they're not accessible under DX11.

I'm specifically talking about the Command Processor and its queue (Q0). The Command Processor queue holds pending work items to be executed. These items are executed sequentially. These work items include both compute and graphics commands.
Before we look into command lists in detail, we’ll take a look at what a command actually is. Every time you issue an API call like DrawIndexedInstanced, the driver has to translate this into one or more hardware-specific commands. In the case of GCN, the internal command language is called PM4. The majority of the PM4 commands we send tend to be hardware register writes; others run small programs on our embedded command processor to encapsulate more complicated functionality into a simple, terse packet. This is one part of a draw call – the translation. The next part is buffering: for maximum efficiency, you want to batch up many commands so the GPU can blast through them while you prepare new commands on the CPU.
Source: Performance Tweets Series: Command lists – GPUOpen

Now at first, this sounds pretty basic, until you ask the question "How many work items fit into the queue?". A single batch or 64 commands. These can be graphic and compute commands.

Now the hint, as to the culprit of AMD GCNs single-threaded woes came from an AMD engineer quote in a Polaris video on AMDs YouTube channel. This engineer stated that AMD increased the size of their command buffer in order to boost single threaded performance (DX11 performance). Source:

The Command Buffer is a set of commands kept in System Memory (RAM). These commands are fetched by the GPU. Fetching these commands requires the CPU.
Increasing the size of the buffer wouldn't help if that's all AMD did. You'd still be fetching a maximum of 64 commands. You see the more commands you can fetch at once, the least amount of times you hit the CPU. This reduces the amount of times the GPU attempts to fetch commands but the CPU is busy with a complex simulation or something else. If the CPU is busy then you get a stall. The GPU waits for the CPU to be done with its simulation before it can receive new commands. By fetching larger amounts of commands at once, you alleviate this issue. Increasing the size of the command buffer therefore alludes to changes being required in the Command Processor in order to accomodate more queued commands (like NVIDIA can do with its Gigathread Engine).

Question is, did AMD change its Command Processor for Polaris? The answer is yes, they did.
Polaris_block_diagram-617x406.jpg


So the idea is that AMD likely increase the queue size or queue count of the Command Processor for Polaris.

This is likely why AMD recommended threading the applicatiom rather than using deferred contexts:
D4TM1uW.jpg


AMD drivers do support Deferred Contexts and Multi-threaded Command listing, people with AMD cards can test that here: Direct3D Multithreaded Rendering Win32 Sample in C++ for Visual Studio 2012
 
Last edited:
The queues are still set up, and AMD can have 64 total instructions in the 8 queues which is greater than the 32 that nV has, they just have multiple queues, so it is driver related to refill the 8 queues, which might hint at more cpu cycles, but it should be minimized through the driver as the instructions in the queues are being processes, the driver will have access to which queues will need to be filled back up. This actually will give a benefit in DX 11 for AMD as they can get more performance due to the flexibility of their design on the GPU side. So a small hit (which can definitely be minimized by driver optimization on a per application basis) on the CPU vs a greater flexibility and performance from the GPU side. Now the second thread in DX 11 on nV hardware, doesn't make any sense, if they were able to do that and go around the DX11 API, AMD would be able to do the same.

Multithreaded rendering is an issue in DX11, as it still has the same issue, as only one core at a time can be accessed at any given time.
 
But again that leads to my point: Do you really think AMD isn't capable? I would say no so then why haven't they? Could be that there is some inherent issue with how the drivers are setup or their architecture (using the term loosely here). Havent had time to search for what is believed Nvidia is doing in the driver to alleviate CPU stress. Mahigan is the first to state what it is, so I only have that and unfortunately unsubstantiated. However given the first few lines of this post, there must be some truth to it. I spent most of last night reading about, or at least attempting (crappy search engines when you really want something), Directflip and DWM compositing as it related to FCAT and AotS in Guru3D latest article. Hopefully maybe I will get time to delve into searching for CPU overhead and DX11 sometime this weekend.
 
The problem with that, is you don't see those issues with FCAT in all games, some games, GCN has lower FCAT latency, why is it only happening in certain games in DX11? If its truly an issue that can't be fixed through drivers due to architecture limitations, AKA poorly designed for DX11 it will happen in all DX11 games, which it does not.
 
The problem with that, is you don't see those issues with FCAT in all games, some games, GCN has lower FCAT latency, why is it only happening in certain games in DX11? If its truly an issue that can't be fixed through drivers due to architecture limitations, AKA poorly designed for DX11 it will happen in all DX11 games, which it does not.

Depends on where the bottlenecks are. Some titles are more dependent on memory bandwidth then pure computational performance, and coincidentally, those are the games AMD tends to beat NVIDIA in. Some games, just by virtue of their design, are going to be biased toward either AMD or NVIDIA.
 
The problem with that, is you don't see those issues with FCAT in all games, some games, GCN has lower FCAT latency, why is it only happening in certain games in DX11? If its truly an issue that can't be fixed through drivers due to architecture limitations, AKA poorly designed for DX11 it will happen in all DX11 games, which it does not.
Actually the fcat part was unrelated just what I spent the bulk of the night on.
And about the driver I meant maybe making the change like NVIDIA did would require a bigger change to the whole driver so maybe too big a project. Just assumptions and guesses.
 
Depends on where the bottlenecks are. Some titles are more dependent on memory bandwidth then pure computational performance, and coincidentally, those are the games AMD tends to beat NVIDIA in. Some games, just by virtue of their design, are going to be biased toward either AMD or NVIDIA.


I agree but with different settings the FCAT amounts should change based on the bottleneck changes, and this will show up, but it doesn't. Also the CPU bottleneck doesn't seem to shift much either with different settings, which really resounds as in issues with drivers.
 
Back
Top