Vega Rumors

razor1 · Sep 29, 2017

JustReason said:
Actually on the way home I realized it wouldn't likely help CPU restrictions aka: Skyrim and such. Although Fallout 4 has an adaptive shadow distance thing meant to maintain a frame rate in Concord and boston where shadows hit very hard. Seems to work.

Sorry didn't mean the CPU limitation for all games that exhibit resolution higher more performance difference to Pascal vs Vega, bad grammar on my part should have separated the two into paragraphs.

Forza seems to be the only game that is CPU limited on nV hardware, which similarly older DX12 games showed this specifically AMD sponsored titles (Hit Man was notorious for this as well as AOTS, both of them now have been fixed through driver updates), now I don't think its something on AMD's side telling dev's to do something a certain way, its just the way intrinsic shaders are set up, If using intrinsic shaders for AMD hardware and porting them over to nV hardware, its not straight forward, the same/similar shader will abuse draw calls on nV hardware by not using all cores, this comes down to drivers. The hardware in inherently different and this is why code has to be tailored per hardware.

Having said that, I still do expect Vega to make some gains anyhow, but this specific game its not Vega looking good, its more likely Pascal just looking bad by being held back.

Back to resolution changes and performance differences, this has more to do with fillrates (pixel and texture), Vega's fillrates didn't increase from Fiji as much as Pascal's did from Maxwell, raw shader capabilities also didn't increase as much for Vega as Pascal did from Maxwell either % wise. Where Pascal was able to shift the pixel shader bottleneck Vega's bottlenecks shifted towards parts which traditionally increased with the drop of nodes and didn't happen with Vega. This is probably due to the die size being so large and after decoupling all the units, they could only do so much.

Anarchist4000 · Sep 29, 2017

razor1 said:
Pretty sure it doesn't use RPM, was looking at the cfg in the demo haven't seen anything on RPM. Not only that, scaling from low end Polaris all the way up to Vega, doesn't show what you are saying.

RPM wouldn't show up in any config as it doesn't require explicit coding. The dev would just need to use the half types, which is highly likely on console, and use a recent shader compiler for the compute.

Unless heavily ALU bound, the RPM wouldn't necessarily make a huge difference. It would increase performance by using less power and letting the card clock higher. Usage of just FP16 would be a relatively constant boost effecting nearly all cards benchmarked.

razor1 said:
I am pretty certain one or two cores when using nV hardware are getting pushed to the limits, instead of using more cores evenly, which was fairly common on early DX12 games too.

It's DX12 which spreads the load well. From some videos, CPU usage was hovering 30-50% somewhat evenly with one core pegged. A Forza community manager said that was just the result of frequent input polling to reduce latency. So just the result of one thread not sleeping as opposed to serious load. Using jag cores, CPU load from consoles should be rather lite.

Nvidia's issue is likely choking on async based on frame times. Game likely using decoupled rendering as Vega's times were too consistent. That 12ms +/- a few percent shouldn't be possible as geometry varies around the track. Tuned sure, but Nvidia's performance should show similar results.

I'd also hazard a guess this is a Microsoft VR title in the making. That's a really solid 90FPS with details or MSAA dialed down just a bit. Of course XBVR doesn't really exist currently, but a racing game is a prime candidate for it.

razor1 said:
And if you start looking at performance between Xbox One S and Xbox One X, with this game it has similar deltas between other games too, so RPM not being used.

Those deltas may not be tied to FLOPs though. It does line up well to clockspeeds. As I said above, RPM would have the result of lowering temps, and in turn increasing clocks, when not the bottleneck. Vega seemed to manage stock clocks at 4k, which is at odds with most recent testing I've seen without some modifications.

Vega may not have a power issue, it just wasn't designed with maxing all the hardware simultaneously. RPM being part of that equation. In compute; ROPs, TMUs, etc obviously aren't using much power.

DooKey · Sep 29, 2017

Looks like the AMD parody account is back.

razor1 · Sep 29, 2017

Anarchist4000 said:
RPM wouldn't show up in any config as it doesn't require explicit coding. The dev would just need to use the half types, which is highly likely on console, and use a recent shader compiler for the compute.

RPM requires specific code, and on that note it will show up in a cgf, they will have fall backs for 32 bit for older AMD cards and Polaris. They have to have different paths and that will be there.

Unless heavily ALU bound, the RPM wouldn't necessarily make a huge difference. It would increase performance by using less power and letting the card clock higher. Usage of just FP16 would be a relatively constant boost effecting nearly all cards benchmarked.

Bottlenecks are in ever single part of the gpu depending on the scene, each frame can have multiple bottlenecks, its not a all or none principle.... If it can come in handy they will use it. If they felt it wasn't that important then they won't use it.

It's DX12 which spreads the load well. From some videos, CPU usage was hovering 30-50% somewhat evenly with one core pegged. A Forza community manager said that was just the result of frequent input polling to reduce latency. So just the result of one thread not sleeping as opposed to serious load. Using jag cores, CPU load from consoles should be rather lite.

DX12 doesn't spread the load, where did you learn programming from? Its how the shaders are coded then the engine takes that and with DX12 splits the load. So if you have bad or incorrect code to begin with it will screw up.

Nvidia's issue is likely choking on async based on frame times. Game likely using decoupled rendering as Vega's times were too consistent. That 12ms +/- a few percent shouldn't be possible as geometry varies around the track. Tuned sure, but Nvidia's performance should show similar results.

Still talking about async, async compute doesn't cause this with frame times man, first off you still seem to neglect the fact there is no problem with Async compute with Pascal. And if you want to look at this all you have to do is look at frame times of various games with async on and off. That problem was still there frames times going all over the place with async compute off with earlier DX12 games, had nothing to do with async compute.

I'd also hazard a guess this is a Microsoft VR title in the making. That's a really solid 90FPS with details or MSAA dialed down just a bit. Of course XBVR doesn't really exist currently, but a racing game is a prime candidate for it.

Those deltas may not be tied to FLOPs though. It does line up well to clockspeeds. As I said above, RPM would have the result of lowering temps, and in turn increasing clocks, when not the bottleneck. Vega seemed to manage stock clocks at 4k, which is at odds with most recent testing I've seen without some modifications.

Vega may not have a power issue, it just wasn't designed with maxing all the hardware simultaneously. RPM being part of that equation. In compute; ROPs, TMUs, etc obviously aren't using much power.

What are you talking about, you think using FP16 vs FP32 will drop power usage? Doesn't work that way, RPM uses the same ALU's, all the ALU's still need to be powered.

ROP's and TMU's are still being used the same was before, Forza's engine isn't that compute heavy lol.......

You just stated it yourself, the deltas line up with clockspeed,

Oh btw

Forza 7 runs mostly one 1 CPU core lol.

What did I say about CPU usage and frame times.

straight from the dev

Some users may notice that the game utilizes nearly 100% of one of their processor cores. This is expected behavior; we intentionally run in this manner so we can react as fast as possible in order to minimize input latency. Users on power-constrained devices, such as laptops and tablets, might want to use a Performance Target of “30 FPS (V-SYNC),” which will reduce processor usage and minimize power consumption.

So it looks like a dev choice that is screwing around with nV hardware, lets see if they can fix it via drivers shell we. As I stated before in this post, DX12 doesn't spread the load automagically, in this case the dev's are using shaders that will not spread across all the cores for drawcalls, because it looks like they did it to reduce input lag.

noko · Sep 29, 2017

cageymaru said:
Here are the newest Forza Motorsport 7 benchmarks. noko This might make you happy. Personally I hope that Nvidia releases a new driver or works with the developer to get more performance out of Pascal. The 1080Ti shouldn't be that far behind in performance running on "Game Ready" drivers released specifically for the game.

Forza 7 Benchmark: Vega has more gasoline in the blood than Pascal.
https://www.computerbase.de/2017-09/forza-7-benchmark/2/#diagramm-forza-7-1920-1080

View attachment 37945 View attachment 37946 View attachment 37947

Great! Hope other titles perform well. So far I am having a very good experience with the Vega 64 at 4K and VR.

Anarchist4000 · Sep 30, 2017

DooKey said:
Looks like the AMD parody account is back.

The "parody" account that suggested Vega would end up 20-30% faster than 1080ti? Just because the math and facts make sense. Because the graph above, with Game Ready drivers from Nvidia, doesn't show exactly that? Frankly I feel sorry for you. Getting even the simplest of things wrong in life.

razor1 said:
RPM requires specific code, and on that note it will show up in a cgf, they will have fall backs for 32 bit for older AMD cards and Polaris. They have to have different paths and that will be there.

Why would any sort of vectorization require specific code? Compilers can do that work trivially. It can of course be done manually, but simply replacing float with half on any sort of vector would be exceedingly easy to vectorize for a compiler.

RPM and FP16 are distinctly different here. Only FP16 is required to be coded for and it would be one of the first console optimizations the dev would have used. It works for AMD and Nvidia as well saving registers, cache, and bandwidth.

razor1 said:
DX12 doesn't spread the load, where did you learn programming from? Its how the shaders are coded then the engine takes that and with DX12 splits the load. So if you have bad or incorrect code to begin with it will screw up.

Writing engines in high school and college, then engineering at one of the top ten schools in the US, and some grad work with clusters and tools from national labs.

DX12 was designed to have multiple threads submitting work. Multiple threads by it's very definition spreads the load on the CPU. Shader coding has very little to nothing to do with CPU load unless all shaders need fully compiled. Which shouldn't be happing very often or at the very least cached as it should be in some intermediate state. Requiring only changes to addresses and minor tweaks.

In the case of FP16, simply code FP16 and the compiler will promote to FP32 if the compiler target requires it. That configuration could be in a file, but the engine could also just preselect paths based on detected hardware. That's assuming Vega doesn't have hardware to handle the packing either. Leaving the compiler to reorder FP16 instructions and hardware detect similar registers for packing. I can't think of any situations where coding FP16 and adjusting the target to promote to FP32 would break something.

razor1 said:
Still talking about async, async compute doesn't cause this with frame times man, first off you still seem to neglect the fact there is no problem with Async compute with Pascal. And if you want to look at this all you have to do is look at frame times of various games with async on and off. That problem was still there frames times going all over the place with async compute off with earlier DX12 games, had nothing to do with async compute.

Inefficient scheduling won't cause bad frame times? Along with increased CPU load? Perhaps you would care to explain why the AMD cards are doing so well then. The frame times for GCN are nearly perfect here and Nvidia included a game ready driver.

I pointed this out to you years ago at this point. Current async implementations have been very limited on PC. Sticking to just overlap of graphics and compute. Barely touching on the "async shading" aspect of async that is used for multi-engine, decoupled rendering, and the ideal async techniques for VR. This is the very stuff I've been suggesting is required for modern games and especially VR. Forza 7 would seem a perfect candidate for VR, which Microsoft hasn't really pushed yet. Going by the extremely tight frame times measured for Forza, AMD should be doing rather well in comparison.

This title was developed for consoles with comparitively weak CPUs. All sorts of tasks then farmed off to the GPU for acceleration and even leading towards GPU driven rendering. That leans heavily on the multi-engine side of things. Monday Computerbase should have CPU tests, and I doubt much changes. The game just isn't designed for powerful GPUs, but I suppose an eight core might help a bit. Regardless AMD already has near perfect performance based on those consistent frame times.

razor1 said:
What are you talking about, you think using FP16 vs FP32 will drop power usage? Doesn't work that way, RPM uses the same ALU's, all the ALU's still need to be powered.

ROP's and TMU's are still being used the same was before, Forza's engine isn't that compute heavy lol.......

You just stated it yourself, the deltas line up with clockspeed,

Oh btw

Forza 7 runs mostly one 1 CPU core lol.

What did I say about CPU usage and frame times.

You're setting up some really easy rebuttals here. RPM specifically can influence power. Same ALUs yes, but what happens when you finish the work in half the time and they go idle? Barring an insane amount of packed math, the bottleneck easily falls elsewhere. Less energy spent on ALUs means higher boost clocks affecting other parts of the chip. Not to mention an idle ALU isn't contending for cache and memory bandwidth. It's really as simple as idle vs loaded processor power usage with some throttling thrown in.

The deltas line up with clockspeed, but how often has an air cooled Vega been maintaining those clocks? Power numbers on Forza would be interesting to see here. With RPM I fully expect performance to track ROP and TMU usage, barring obvious CPU and memory bottlenecks. It would almost always remove ALUs from the equation. At least until someone goes crazy with HDR.

razor1 said:
So it looks like a dev choice that is screwing around with nV hardware, lets see if they can fix it via drivers shell we. As I stated before in this post, DX12 doesn't spread the load automagically, in this case the dev's are using shaders that will not spread across all the cores for drawcalls, because it looks like they did it to reduce input lag.

Nvidia already released a game ready driver according to Computerbase and performance is as expected. As your quote stated, and I was referencing that quote earlier, the CPU load isn't an issue. Put a sleep or wait in that input polling thread and I doubt any core is more than 50%. I've seen people parroting a poorly implemented game, but they have no idea what they are talking about. As the dev said, they are just polling input as frequently as possible to reduce any perceived latency. React to keyboard input as fast as possible. That would have little bearing on GPU dispatch. As I said, it's designed for console and relatively weak Jag cores. Testing demonstrated well balanced loads across all cores. Save the one polling input repeatedly.

This looks to be an extremely well optimized title in comparison to past releases, so I doubt much changes. This game would be representative of the DX12/async landscape of the future. It's what devs have been wanting and precisely what I've been saying all this time.

CPU load is limited, memory bandwidth less a concern(going off Fury and Vega), and performance scaling nicely by GPU throughput. It's just getting started.

Tup3x · Sep 30, 2017

Sorry, but game ready driver doesn't mean that there will be no improvements. I think Hitman was one of those games that received massive boost later for example.

razor1 · Sep 30, 2017

Anarchist4000 said:
The "parody" account that suggested Vega would end up 20-30% faster than 1080ti? Just because the math and facts make sense. Because the graph above, with Game Ready drivers from Nvidia, doesn't show exactly that? Frankly I feel sorry for you. Getting even the simplest of things wrong in life.

Why would any sort of vectorization require specific code? Compilers can do that work trivially. It can of course be done manually, but simply replacing float with half on any sort of vector would be exceedingly easy to vectorize for a compiler.

RPM and FP16 are distinctly different here. Only FP16 is required to be coded for and it would be one of the first console optimizations the dev would have used. It works for AMD and Nvidia as well saving registers, cache, and bandwidth.

Writing engines in high school and college, then engineering at one of the top ten schools in the US, and some grad work with clusters and tools from national labs.

DX12 was designed to have multiple threads submitting work. Multiple threads by it's very definition spreads the load on the CPU. Shader coding has very little to nothing to do with CPU load unless all shaders need fully compiled. Which shouldn't be happing very often or at the very least cached as it should be in some intermediate state. Requiring only changes to addresses and minor tweaks.

In the case of FP16, simply code FP16 and the compiler will promote to FP32 if the compiler target requires it. That configuration could be in a file, but the engine could also just preselect paths based on detected hardware. That's assuming Vega doesn't have hardware to handle the packing either. Leaving the compiler to reorder FP16 instructions and hardware detect similar registers for packing. I can't think of any situations where coding FP16 and adjusting the target to promote to FP32 would break something.

Inefficient scheduling won't cause bad frame times? Along with increased CPU load? Perhaps you would care to explain why the AMD cards are doing so well then. The frame times for GCN are nearly perfect here and Nvidia included a game ready driver.

I pointed this out to you years ago at this point. Current async implementations have been very limited on PC. Sticking to just overlap of graphics and compute. Barely touching on the "async shading" aspect of async that is used for multi-engine, decoupled rendering, and the ideal async techniques for VR. This is the very stuff I've been suggesting is required for modern games and especially VR. Forza 7 would seem a perfect candidate for VR, which Microsoft hasn't really pushed yet. Going by the extremely tight frame times measured for Forza, AMD should be doing rather well in comparison.

This title was developed for consoles with comparitively weak CPUs. All sorts of tasks then farmed off to the GPU for acceleration and even leading towards GPU driven rendering. That leans heavily on the multi-engine side of things. Monday Computerbase should have CPU tests, and I doubt much changes. The game just isn't designed for powerful GPUs, but I suppose an eight core might help a bit. Regardless AMD already has near perfect performance based on those consistent frame times.

You're setting up some really easy rebuttals here. RPM specifically can influence power. Same ALUs yes, but what happens when you finish the work in half the time and they go idle? Barring an insane amount of packed math, the bottleneck easily falls elsewhere. Less energy spent on ALUs means higher boost clocks affecting other parts of the chip. Not to mention an idle ALU isn't contending for cache and memory bandwidth. It's really as simple as idle vs loaded processor power usage with some throttling thrown in.

The deltas line up with clockspeed, but how often has an air cooled Vega been maintaining those clocks? Power numbers on Forza would be interesting to see here. With RPM I fully expect performance to track ROP and TMU usage, barring obvious CPU and memory bottlenecks. It would almost always remove ALUs from the equation. At least until someone goes crazy with HDR.

Nvidia already released a game ready driver according to Computerbase and performance is as expected. As your quote stated, and I was referencing that quote earlier, the CPU load isn't an issue. Put a sleep or wait in that input polling thread and I doubt any core is more than 50%. I've seen people parroting a poorly implemented game, but they have no idea what they are talking about. As the dev said, they are just polling input as frequently as possible to reduce any perceived latency. React to keyboard input as fast as possible. That would have little bearing on GPU dispatch. As I said, it's designed for console and relatively weak Jag cores. Testing demonstrated well balanced loads across all cores. Save the one polling input repeatedly.

This looks to be an extremely well optimized title in comparison to past releases, so I doubt much changes. This game would be representative of the DX12/async landscape of the future. It's what devs have been wanting and precisely what I've been saying all this time.

CPU load is limited, memory bandwidth less a concern(going off Fury and Vega), and performance scaling nicely by GPU throughput. It's just getting started.

So you have no idea of what you are talking about. I am not going to get into the details with you, nV's scheduling is years ahead of AMD's, we can see that with nV's scaling and AMD's scaling, as their chips gets bigger. Only time we see nV's scaling fail is when the developer screw up.

Async compute FP16 and 32, go back to class.

FP 16 and 32 is not vectorization lol, different shaders are necessary the data in the engine is different, yeah math makes sense when you know how the math works, not when you think FP16 is vectorization of FP32, what crock is that.

The rest of your post is gibberish, as I stated you have no clue of what you are talking about, you are not a programmer and you keep making things up to fit your "theories". STOP DOING THAT! This is why you are the laughing stock of Vega supporters here, even the die hard fans know you are spouting BS.

The DEVELOPER STATED something, and you can't seem to get it through your head why it was done and that would affect Pascal. They were specifically talking about desktop paths, that is why they mentioned v-sync, laptops and others. You take that and put it towards consoles WTF. It was right there in the same sentence.

This is like the same crap you came up with that Vega can do Tensor functions with a swizzle and what else MPS is like async compute. MPS is service that distributes loads on different applications and monitors them, just like Windows has for CPU cores across different CPU's, not on the same application, it functions like NUMA. So both or more programs and run efficiently and not hurt each others performance. You were a programmer, that is a NO, you can't understand these simple things, and you were an engine programmer, I remember your first few posts at B3D, you were trying to be an engine programmer, pretty rudimentary questions, which you didn't seem to understand the answers given to you. Guess you didn't get very far. Back in 2004 you dabbled in engine programming, don't call that experienced, or even capable of making an engine.

Dayaks · Sep 30, 2017

Anarchist4000 said:
The "parody" account that suggested Vega would end up 20-30% faster than 1080ti?

Months before Vega launched, using Fury X's teraflops vs. the leaked (which were true) Vega 64 teraflops I predicted it right around the 1080. I figured it was based in reality considering AMD's history and it was pretty close.

What assumptions did you make to get 20-30% faster than a 1080ti? That's substantially faster than where it's actually at. You predicted it to be 45-55% faster than it is....

You have to understand how off that prediction was and you keep coming back with equally crazy predictions. That's why some people react the way they do.

razor1 · Sep 30, 2017

Dayaks said:
Months before Vega launched, using Fury X's teraflops vs. the leaked (which were true) Vega 64 teraflops I predicted it right around the 1080. I figured it was based in reality considering AMD's history and it was pretty close.

What assumptions did you make to get 20-30% faster than a 1080ti? That's substantially faster than where it's actually at. You predicted it to be 49% faster than it is....

You have to understand how off that prediction was and you keep coming back with equally crazy predictions. That's why some people react the way they do.

The guy doesn't even believe what the developer said and spins it to what he says is true, WTF is he doing?

Forza 3 had the same issues when it was launched too, it took a couple of patches later to fix it on nV hardware.......

Writing fp 16 and 32 in the same path (NOT EVEN THE SAME SHADER) is very difficult and needs a lot of experience in how the pipelines work. Its not a novice task, its even hard for experienced programmers to do it, but we have Anarchist just saying its vectorization. Any programmer that knows a lick of what they are talking about knows this. There will be ALU contention problems that stem from multiple precision on separate ALU blocks that will cause under utilization. Using FP16 and FP32 on the same shader (vretex shader fp 16 and pixel shader fp 32) introduces FP calculations errors that must be removed, this is not a trivial task either. Ya need to know how FP math behaves when going from FP 16 to FP 32 to remove the errors.

Sebbi even mentioned this, and he is an extremely experienced programmer.

Hameeeedo · Sep 30, 2017

He is going to hob around everytime some lone game does better on Vega, while ignoring the scores of games where Vega is behind the regular GTX 1080, or even barely above GTX 1070.

razor1 · Sep 30, 2017

Hameeeedo said:
He is going to hob around everytime some lone game does better on Vega, while ignoring the scores of games where Vega is behind the regular GTX 1080, or even barely above GTX 1070.

of course that is the only way to go when he doesn't understand what he is saying

Now I have more time, Anarchist outside of hardware difference needs (what I stated before in different shaders) to convert FP 16 to FP 32 like using Vector processing FP16 and Pixel shaders FP 32 in the same shader, there are a number of steps to be done, its not just vectorization

Zero to Zero mapping

Normalized numbers mapping

Infinites mapping

Exponent all ones and mantissa zeros mapping

Denormalized numbers mapping

Sign Bits

All of these components must be mapped correctly from FP 16 to FP 32, otherwise YOU WILL get errors. After you do all this then you have to make sure you don't get any double rounding errors! Without having experience in each of these components and how they will be visually represented on a pixel level, there is no way a novice programmer will be able to get the results he wants on the first try, even experienced programmers will have difficulty with this.

Lets see, Anarchist, will you still sit here and BS with me about everything? Vectorization my ass lol. Come on, you think Sebbi would have said it not easy to do if he didn't mean it? First off most junior programmers won't even know how the different pipelines work, as noted by you already, experienced programmers will know how the pipelines work on per architecture level, but they still need the experience of doing the same shaders over and over again to know where the problems can be when going from FP16 to FP32 in the same shader (all of this mapping will change based on different shaders, how many shaders do games employ? 100's maybe even thousands depending on the game on how the shaders are setup.)

noko · Oct 1, 2017

The first RPM game coming appears to be Wolfenstein, this month. The video's look rather nice in the rendering, I am getting more excited about this title. This game maybe will show how useful is Rapid Pack Math with games. My Vega 64 is ready . . .

IdiotInCharge · Oct 1, 2017

noko said:
The first RPM game coming appears to be Wolfenstein, this month. The video's look rather nice in the rendering, I am getting more excited about this title. This game maybe will show how useful is Rapid Pack Math with games. My Vega 64 is ready . . .

If it's using the same engine as Doom (I don't know personally), won't it already run great on pretty much everything?

I mean, I'd like to see what all the fuss about these new features AMD is putting in their GPUs are about too, but are they going to be demonstrable in such a game?

cageymaru · Oct 1, 2017

IdiotInCharge said:
If it's using the same engine as Doom (I don't know personally), won't it already run great on pretty much everything?

I mean, I'd like to see what all the fuss about these new features AMD is putting in their GPUs are about too, but are they going to be demonstrable in such a game?

Well the purpose of the new features is to make gaming smoother and better for everyone; not just AMD users.

That's why they released so much documentation of their features on their website. In other words Doom doesn't run so well because the visuals are trashy, low texture crap; it runs so well because the engine is optimized and has well thought out features.

noko · Oct 1, 2017

IdiotInCharge said:
If it's using the same engine as Doom (I don't know personally), won't it already run great on pretty much everything?

I mean, I'd like to see what all the fuss about these new features AMD is putting in their GPUs are about too, but are they going to be demonstrable in such a game?

Really don't know, we will see. The visuals, at least to me look more complex, higher textures and more natural looking lighting. It will be using AMD Intrinsics and RPM, not sure how well it will be optimized for Nvidia. From the videos it really looks pretty awesome but those can be misleading. I have this game pre-ordered since it came with the Vega 64.

In FutureMark Serra (not released and may never be released) RPM gave a 25% boost in certain things (does not sound like an overall 25% boost from using RPM). Hopefully RPM will be able to be turned on and off from inside Wolfenstein to see performance benefits and any IQ loss (which should not be the case).

Anarchist4000 · Oct 2, 2017

Dayaks said:
What assumptions did you make to get 20-30% faster than a 1080ti? That's substantially faster than where it's actually at. You predicted

I didn't make any assumptions. I took the results presented by developers and IHVs in various papers for different features Vega added and combined them conservatively. Culling and binning gains largely equalizing performance. Then applying measured gains from RPM. I didn't even account for async hammering Nvidia's CPU performance in comparison and left out many features that we're hard to quantify.

A title shows up with those critical features running as expected and landed almost perfectly with my predictions. Still waiting on the RPM confirmation, but coming from console seems highly probable. Bottom line both cards are performing inline with theoretical numbers.

Dayaks said:
You have to understand how off that prediction was and you keep coming back with equally crazy predictions. That's why some people react the way they do.

How was it off? Forza7 shows EXACTLY what I predicted. Throw in a liquid cooled Vega64 and even that 30% is really close. Forza likely isn't making the most of all the features either. I really wouldn't be surprised to see Vega go well beyond 20-30% as time progresses. Pascal just doesn't have the hardware to efficiently run some paths. Not unlike a 780ti over time, where even midrange parts outperform it.

My "crazy" predictions were spot on and almost everyone around here missed it. Whether that's bias or lack of critical thinking ability by many I couldn't say. Every site I've seen has confirmed not all Vega features are enabled and that will change the picture. Keep in mind, I was predicting parity with Titan and eventually surpassing it as RPM, primitive shaders, and GPU driven approaches land. Already there are Bethesda devs quoting 80%+ async compute workloads on upcoming games, so it's starting.

noko said:
The first RPM game coming appears to be Wolfenstein, this month. The video's look rather nice in the rendering, I am getting more excited about this title. This game maybe will show how useful is Rapid Pack Math with games. My Vega 64 is ready . . .

The first they are marketing anyways. That doesn't mean other games like Forza couldn't use it already. It really is just a matter of using FP16 in compute shaders, which should be common in recent games.

IdiotInCharge said:
I mean, I'd like to see what all the fuss about these new features AMD is putting in their GPUs are about too, but are they going to be demonstrable in such a game?

Most are transparent and of more use for devs. Maybe some if the Tier3 features for conservative raster, etc; however they are likely a few years off and I wouldn't expect huge visual differences there. As a said above, a Bethesda guy stated 80%+ async compute, so RPM could be hitting really hard.

Dayaks · Oct 2, 2017

Anarchist4000 said:
I didn't make any assumptions. I took the results presented by developers and IHVs in various papers for different features Vega added and combined them conservatively. Culling and binning gains largely equalizing performance. Then applying measured gains from RPM. I didn't even account for async hammering Nvidia's CPU performance in comparison and left out many features that we're hard to quantify.

A title shows up with those critical features running as expected and landed almost perfectly with my predictions. Still waiting on the RPM confirmation, but coming from console seems highly probable. Bottom line both cards are performing inline with theoretical numbers.

How was it off? Forza7 shows EXACTLY what I predicted. Throw in a liquid cooled Vega64 and even that 30% is really close. Forza likely isn't making the most of all the features either. I really wouldn't be surprised to see Vega go well beyond 20-30% as time progresses. Pascal just doesn't have the hardware to efficiently run some paths. Not unlike a 780ti over time, where even midrange parts outperform it.

My "crazy" predictions were spot on and almost everyone around here missed it. Whether that's bias or lack of critical thinking ability by many I couldn't say. Every site I've seen has confirmed not all Vega features are enabled and that will change the picture. Keep in mind, I was predicting parity with Titan and eventually surpassing it as RPM, primitive shaders, and GPU driven approaches land. Already there are Bethesda devs quoting 80%+ async compute workloads on upcoming games, so it's starting.

The first they are marketing anyways. That doesn't mean other games like Forza couldn't use it already. It really is just a matter of using FP16 in compute shaders, which should be common in recent games.

Most are transparent and of more use for devs. Maybe some if the Tier3 features for conservative raster, etc; however they are likely a few years off and I wouldn't expect huge visual differences there. As a said above, a Bethesda guy stated 80%+ async compute, so RPM could be hitting really hard.

Forza isn't even out yet and the "leaked" benches show something really wrong with the nVidia side not something really great on the AMD side... it's not confirmation of what you've been saying. It's a typical Microsoft launch.

Anarchist4000 · Oct 2, 2017

razor1 said:
FP 16 and 32 is not vectorization lol

Arguably AMD’s marquee feature from a compute standpoint for Vega is Rapid Packed Math. Which is AMD’s name for packing two FP16 operations inside of a single FP32 operation in a vec2 style.
https://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/4

If you say so. Included a less technical explanation for you. Yes FP16/32 alone isn't vectorization, but registers are generally standardized around 32 bits. So FP16 results in packing 2:1.

razor1 said:
The rest of your post is gibberish, as I stated you have no clue of what you are talking about, you are not a programmer and you keep making things up to fit your "theories". STOP DOING THAT! This is why you are the laughing stock of Vega supporters here, even the die hard fans know you are spouting BS.

Gibberish is just about everything you post. You copy in technical material in an attempt to show you know wtf you are talking about, but with no understanding of what you're saying and hoping nobody else can or will bother to actually parse it. This shit isn't difficult to understand either. So how is it I'm the laughing stock, yet you're the one that f'd up? If I'm the laughing stock it reflects rather poorly on those laughing, as it means they're too dumb to actually understand. Essentially easily manipulated sheep lacking the ability to think critically. Franky I find this hilarious that you followed the same theory as everyone else and it fell flat. While I took a unique view and was spot on. Unless you don't think the Forza benchmark showing 22% faster when I predicted parity to 20-30% faster.

razor1 said:
The DEVELOPER STATED something, and you can't seem to get it through your head why it was done and that would affect Pascal. They were specifically talking about desktop paths, that is why they mentioned v-sync, laptops and others. You take that and put it towards consoles WTF. It was right there in the same sentence.

Err, the developer said exactly what I did. They mentioned vsync for battery life. Anyways, they had to release a PR statement clarifying it because apparently people didn't understand it.

Forza Motorsport 7 is not limited to running on one core. There seems to have been a miscommunication along the way. “Forza Motorsport 7” uses as many cores as are available on whatever system it runs on, whether that is a 4- to 16-core PC or the 7 cores available on Xbox One.
http://wccftech.com/turn-10-forza-motorsport-7-one-core/

razor1 said:
Lets see, Anarchist, will you still sit here and BS with me about everything? Vectorization my ass lol. Come on, you think Sebbi would have said it not easy to do if he didn't mean it? First off most junior programmers won't even know how the different pipelines work, as noted by you already, experienced programmers will know how the pipelines work on per architecture level, but they still need the experience of doing the same shaders over and over again to know where the problems can be when going from FP16 to FP32 in the same shader (all of this mapping will change based on different shaders, how many shaders do games employ? 100's maybe even thousands depending on the game on how the shaders are setup.)

I'm unsure where he would have said it was hard. Any reasonably educated programmer would understand floating point math. Especially anyone writing shaders. Going off the papers devs keep presenting, they seem to understand well enough. The less educated leaning on the big engines and skilled devs.

Already linked you the vectorization, but "packing" rapid "packed" math into Vec2 is pretty simple. Not all that difficult for a compiler either. As vec3/4 is somewhat common in 3D space, that mapping is rather straightforward. That's already common for anyone familiar with compiling on PC. Junior programmers don't even need to understand the pipelines to convert to FP16, and casting isn't difficult to figure out if needed. Consoles obviously easier as everything suitable is FP16 already. Even without that step, limiting to Polaris and Maxwell2(I think) would be sufficient for FP16 support.

Dayaks said:
Forza isn't even out yet and the "leaked" benches show something really wrong with the nVidia side not something really great on the AMD side... it's not confirmation of what you've been saying. It's a typical Microsoft launch.

"Leaked" may not be the right word when provided to the site, a driver released specifically for it from Nvidia, and performance looking rather solid. Low CPU usage and spread across all cores, good FPS and frametimes, and stable from everything I've seen. Those Vega frame times were nearly flawless. The only Nvidia issue I can see is choking on the async submission, but without the "performance critical" MPS hardware or ACEs of GCN, that will take a lot of work to even out. There is a reason devs always disable async on Nvidia.

JustReason · Oct 2, 2017

Dayaks said:
Forza isn't even out yet and the "leaked" benches show something really wrong with the nVidia side not something really great on the AMD side... it's not confirmation of what you've been saying. It's a typical Microsoft launch.

IT IS OUT FOR THE LOVE OF ALL THAT IS HOLY!!!

It released last week for those that purchased the ULTIMATE EDITION. It releases this week for the base game purchasers.

Dayaks · Oct 2, 2017

JustReason said:
IT IS OUT FOR THE LOVE OF ALL THAT IS HOLY!!!

It released last week for those that purchased the ULTIMATE EDITION. It releases this week for the base game purchasers.

Oh, my bad. When I checked the store it said "preorder."

Still doesn't change the point that it's shit optimization for nVidia and not a magical what can be for AMD.

razor1 · Oct 2, 2017

Anarchist4000 said:
If you say so. Included a less technical explanation for you. Yes FP16/32 alone isn't vectorization, but registers are generally standardized around 32 bits. So FP16 results in packing 2:1.

Oh you are only talking about the packed math portion? That isn't where the problem lies, as I told you just how difficult it is to convert a vertex shader from FP 16 to a pixel shader to FP32........

Gibberish is just about everything you post. You copy in technical material in an attempt to show you know wtf you are talking about, but with no understanding of what you're saying and hoping nobody else can or will bother to actually parse it. This shit isn't difficult to understand either. So how is it I'm the laughing stock, yet you're the one that f'd up? If I'm the laughing stock it reflects rather poorly on those laughing, as it means they're too dumb to actually understand. Essentially easily manipulated sheep lacking the ability to think critically. Franky I find this hilarious that you followed the same theory as everyone else and it fell flat. While I took a unique view and was spot on. Unless you don't think the Forza benchmark showing 22% faster when I predicted parity to 20-30% faster.

LOL yeah lets wait and see how it turns out in a month or two, if drivers or game updates goes back to what Forza 3 or 6 did with nV architecture, what will you say then. I will point it out.

Nothing is difficult for you, you are the pinnacle of arrogance and everyone that does this for a living sucks at it because you say so......

Err, the developer said exactly what I did. They mentioned vsync for battery life. Anyways, they had to release a PR statement clarifying it because apparently people didn't understand it.

That was the ENGINE Developer, the game developer said what I quoted. THIS WAS THE exact same problem with Forza 3 and 6 on nV cards! I play the Forza series and have noticed it in those two games too when released, shortly after a game update solved the problem. Go look up those games and you will see the problem existed for nV cards! Tons of info on them forums, shuttering, frame times all over the place etc.

I'm unsure where he would have said it was hard. Any reasonably educated programmer would understand floating point math. Especially anyone writing shaders. Going off the papers devs keep presenting, they seem to understand well enough. The less educated leaning on the big engines and skilled devs.

Its not just the math, the math part is the easier of the two, its what the math does and how the errors show up at a pixel level that takes experience.

Tell me, if I give you a fp 32 shader, pixel and vertex example we can take one from the web, like for normal maps or better yet AO, you want to show me how to port it over to vertex shader FP 16, I can guarantee you will not know how to do it, from the math all the way through. Cause you didn't know the steps till I pointed it out. Give ya 3 hours to do this.

Already linked you the vectorization, but "packing" rapid "packed" math into Vec2 is pretty simple. Not all that difficult for a compiler either. As vec3/4 is somewhat common in 3D space, that mapping is rather straightforward. That's already common for anyone familiar with compiling on PC. Junior programmers don't even need to understand the pipelines to convert to FP16, and casting isn't difficult to figure out if needed. Consoles obviously easier as everything suitable is FP16 already. Even without that step, limiting to Polaris and Maxwell2(I think) would be sufficient for FP16 support.

I don't give a shit about the packed portion of the problem, cause that is easily done through drivers, no programmer intervention. When you have different FP's in the same block and shader its creates headaches.

"Leaked" may not be the right word when provided to the site, a driver released specifically for it from Nvidia, and performance looking rather solid. Low CPU usage and spread across all cores, good FPS and frametimes, and stable from everything I've seen. Those Vega frame times were nearly flawless. The only Nvidia issue I can see is choking on the async submission, but without the "performance critical" MPS hardware or ACEs of GCN, that will take a lot of work to even out. There is a reason devs always disable async on Nvidia.

yeah, why not profile the game and see if that is what it is, the demo is available, its not that......... and if you have any problems profiling the game let me know, I'll walk you through it......

I haven't done this shit in close to 20 years but you don't think I haven't talked to my programmers working on complex shaders and know what problems they are coming across? That's my job to know. That is what you don't understand. People here aren't dumb or simpletons, so they post about things they don't know. We know it and we know what is hard and what is easy. That is why when you say something like a swizzle for tensor functionality or MPS is aka async, is laughable. You are trying to make GCN the be all of architectures, which by NO MEANS is it. It doesn't have all the capabilities of Pascal, nor Pascal have all the capabilities of GCN, both have their strengths and weakness, but when it comes to what is being used today and end metrics of the architectures, GCN is pretty much going the way of the Dodo.

So.....

how much time do you need to run the profiler? 15 mins suffice?, if you can't run a profiler, I would like to see you talk about programming FP 16 to 32 again......

You think what everyone does for a living is so damn easy to do, you minimize the work that is needed and make yourself sound like you are best without knowing anything about what you post, not cool man cause now I am going to challenge you, put you brain where your mouth is, cause I just asked you to do things that you think are simple. If you can't do them even after being spelled out for you, what should we think of you then?

here is an example code just for you, with parts missing, so you can start off

Code:

here is an example code just for you, with parts missing, so you can start off

// float32
// Martin Kallman
//
// Fast half-precision to single-precision floating point conversion
// - Supports signed zero and denormals-as-zero (DAZ)
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~6 clock cycles on modern x86-64
void float32(float* __restrict out, const uint16_t in) {
uint32_t t1;
uint32_t t2;
uint32_t t3;

t1 = in & 0x7fff; // Non-sign bits
t2 = in & 0x8000; // Sign bit
t3 = in & 0x7c00; // Exponent

t1 <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position

t1 += 0x38000000; // Adjust bias

t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero

t1 |= t2; // Re-insert sign bit

*((uint32_t*)out) = t1;
};

// float16
// Martin Kallman
//
// Fast single-precision to half-precision floating point conversion
// - Supports signed zero, denormals-as-zero (DAZ), flush-to-zero (FTZ),
// clamp-to-max
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~10 clock cycles on modern x86-64
void float16(uint16_t* __restrict out, const float in) {
uint32_t inu = *((uint32_t*)&in);
uint32_t t1;
uint32_t t2;
uint32_t t3;

t1 = inu & 0x7fffffff; // Non-sign bits
t2 = inu & 0x80000000; // Sign bit
t3 = inu & 0x7f800000; // Exponent

t1 >>= 13; // Align mantissa on MSB
t2 >>= 16; // Shift sign bit into position

t1 -= 0x1c000; // Adjust bias

t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero

t1 |= t2; // Re-insert sign bit

*((uint16_t*)out) = t1;
};

This is not my code but low and behold everything I talked about is mentioned or is in the actual code. This needs to be done for ever single FP 32 pixel shader that would need to use the FP16 vertex shaders, pretty much all in game shaders. (this is in reverse too)

Even better

https://fgiesen.wordpress.com/2012/03/28/half-to-float-done-quic/

Yeah it mentions everything I did! In the correct order!

Wow I must be an idiot for thinking this stuff is quite complex, same with Sebbi right?

There is an example in that link that will take you through the entire process, it should be easy for you to take and AO shader now and rewrite it 3 hours enough for you?

Hameeeedo · Oct 2, 2017

Anarchist4000 said:
How was it off? Forza7 shows EXACTLY what I predicted. Throw in a liquid cooled Vega64 and even that 30% is really close. Forza likely isn't making the most of all the features either. I really wouldn't be surprised to see Vega go well beyond 20-30% as time progresses. Pascal just doesn't have the hardware to efficiently run some paths. Not unlike a 780ti over time, where even midrange parts outperform it.

This is laughable man. THIS HAS NOTHING TO DO WITH VEGA, even RX 580 is giving better fps than 1080Ti, this is an anomaly, something is massively holding NVIDIA back in this game. the 1080Ti is only 7% faster than 1080, there is an obvious issue here. Rest assured it will be fixed, just like Hitman and Ashes of the singularity.

Oh and stop grasping at straws to prove your failed predictions, if this is really the best example you can come up with then you are truly desperate!

Anarchist4000 · Oct 2, 2017

Dayaks said:
Oh, my bad. When I checked the store it said "preorder."

Still doesn't change the point that it's shit optimization for nVidia and not a magical what can be for AMD.

Nvidia released a game ready driver for it already and said it performed as expected in the benchmark. The lead developer straight up said work was distributed and steady frame times show it's well optimized for DX12. So why is Nvidia having so much difficulty with a relatively simple game? It's using the DX11 feature set, following DX12 submission rules, and far from CPU limited.

As for AMD's "magic", the game has been released, benchmarked and played. The results are readily apparent for anyone that bothers and verified by IHVs as accurate. I don't see how you can call it magic when the evidence is right there. Or is the argument now that Vegas drivers are ahead of Pascal's? Forza isn't even partnered with AMD it using any exclusive features as far as I'm aware. For the Bethesda titles maybe there's an argument, but Forza isn't advertising intriniscs, deals, or heavy optimization specific to AMD.

Dayaks · Oct 2, 2017

Anarchist4000 said:
Nvidia released a game ready driver for it already and said it performed as expected in the benchmark. The lead developer straight up said work was distributed and steady frame times show it's well optimized for DX12. So why is Nvidia having so much difficulty with a relatively simple game? It's using the DX11 feature set, following DX12 submission rules, and far from CPU limited.

As for AMD's "magic", the game has been released, benchmarked and played. The results are readily apparent for anyone that bothers and verified by IHVs as accurate. I don't see how you can call it magic when the evidence is right there. Or is the argument now that Vegas drivers are ahead of Pascal's? Forza isn't even partnered with AMD it using any exclusive features as far as I'm aware. For the Bethesda titles maybe there's an argument, but Forza isn't advertising intriniscs, deals, or heavy optimization specific to AMD.

I actually prefer Harmeeeedo's response:

Hameeeedo said:
This is laughable man. THIS HAS NOTHING TO DO WITH VEGA, even RX 580 is giving better fps than 1080Ti, this is an anomaly, something is massively holding NVIDIA back in this game. the 1080Ti is only 7% faster than 1080, there is an obvious issue here. Rest assured it will be fixed, just like Hitman and Ashes of the singularity.

Oh and stop grasping at straws to prove your failed predictions, if this is really the best example you can come up with then you are truly desperate!

razor1 · Oct 2, 2017

Anarchist4000 said:
Nvidia released a game ready driver for it already and said it performed as expected in the benchmark. The lead developer straight up said work was distributed and steady frame times show it's well optimized for DX12. So why is Nvidia having so much difficulty with a relatively simple game? It's using the DX11 feature set, following DX12 submission rules, and far from CPU limited.

As for AMD's "magic", the game has been released, benchmarked and played. The results are readily apparent for anyone that bothers and verified by IHVs as accurate. I don't see how you can call it magic when the evidence is right there. Or is the argument now that Vegas drivers are ahead of Pascal's? Forza isn't even partnered with AMD it using any exclusive features as far as I'm aware. For the Bethesda titles maybe there's an argument, but Forza isn't advertising intriniscs, deals, or heavy optimization specific to AMD.

How do you explain AMD cards being CPU bottlenecked in this very game, not as much as nV's but they are too, they consistantly drop to the 75 to 85 % usage range where nV's drop to 60% range!

Maybe both Vega and Pascal have problems with async compute and its putting too much load on the CPU. That % difference is what is the difference between Vega and Pascal should be at.

Come on man make sense. Too much to the contrary to what you say or believe.

Do you see the work being disturbed across the CPU cores evenly here?

I don't, I see one core at 100% and the others much less.........

So which one was right the engine dev or the game dev? The game dev stated they use one core predominately, that is what this is showing right here.

I can pull up many videos of Forza 7 that show the same exact things.

So were able to do the profile of Forza 7 yet?

PS before you think that is 2 cores at 100% this is a 8 core 16 thread, CPU those 2 100%'s are 2 threads, that is 1 core.

razor1 · Oct 2, 2017

Hameeeedo said:
This is laughable man. THIS HAS NOTHING TO DO WITH VEGA, even RX 580 is giving better fps than 1080Ti, this is an anomaly, something is massively holding NVIDIA back in this game. the 1080Ti is only 7% faster than 1080, there is an obvious issue here. Rest assured it will be fixed, just like Hitman and Ashes of the singularity.

Oh and stop grasping at straws to prove your failed predictions, if this is really the best example you can come up with then you are truly desperate!

The 1050ti matches the gtx 1060 in this game too, so easy to see something is really fubared in this game.

Anarchist4000 · Oct 2, 2017

razor1 said:
Nothing is difficult for you, you are the pinnacle of arrogance and everyone that does this for a living sucks at it because you say so......

Stuff isn't difficult once you understand it. That's not really arrogance, and if you can't explain something in simple terms, you probably don't understand it.

razor1 said:
That was the ENGINE Developer, the game developer said what I quoted. THIS WAS THE exact same problem with Forza 3 and 6 on nV cards! I play the Forza series and have noticed it in those two games too when released, shortly after a game update solved the problem. Go look up those games and you will see the problem existed for nV cards! Tons of info on them forums, shuttering, frame times all over the place etc.

The context of the original statement was that a single core being loaded wasn't the same problem from prior games. Everyone assumed one core being the reason for Nvidia's performance and the guy refuted that notion. Stating it was just a thread polling input. Which was further taken as Nvidia only using one thread, at which point the lead developer further clarified it wasn't the case. That problem could have very easily been fixed a while ago. Hell, Nvidia's game ready driver could have picked a different core even if it was. Both devs said loading of a single core wasn't the issue, it's that simple.

razor1 said:
Tell me, if I give you a fp 32 shader, pixel and vertex example we can take one from the web, like for normal maps or better yet AO, you want to show me how to port it over to vertex shader FP 16, I can guarantee you will not know how to do it, from the math all the way through. Cause you didn't know the steps till I pointed it out. Give ya 3 hours to do this.

Sure, however I'm not sure those are the best areas for packed math. FP16 is largely in compute, which isn't pixel and vertex shaders. If using vertex normals sure, but the better use would be a wholesale conversion to FP16 of the vertices for an early culling pass. That's been the recent async compute approach anyways.

razor1 said:
I don't give a shit about the packed portion of the problem, cause that is easily done through drivers, no programmer intervention. When you have different FP's in the same block and shader its creates headaches.

If you had to convert them. Even then it's a single hardware instruction. The compiler should handle the conversion automatically if you try to pass FP16 into FP32.

razor1 said:
That is why when you say something like a swizzle for tensor functionality or MPS is aka async, is laughable. You are trying to make GCN the be all of architectures, which by NO MEANS is it.

What exactly do you think makes a tensor so complex? It's a giant SIMD... I've never said GCN is the be all of architectures, just that what's occurring isn't anything new. It's just vectorizing one operation instead of x parallel operations from different threads.

MPS manages input and balancing from multiple processes. Hence multi process service. Asynchronous tasks, as they are considered unrelated. In the case of ACEs, AMD uses them under a different name to distribute asynchronous work from one or more processes. As Nvidia defined them, they are "performance critical" when dealing with these asynchronous tasks.

razor1 said:
It doesn't have all the capabilities of Pascal, nor Pascal have all the capabilities of GCN, both have their strengths and weakness, but when it comes to what is being used today and end metrics of the architectures, GCN is pretty much going the way of the Dodo.

Kind of odd to be going the way of the dodo when the major consoles are based on GCN, upcoming SM6 uses GCN2 as a foundation, and GCN was designed for async compute, which is the foundation of DX12/Vulkan and derived from Mantle. Which again was designed around GCN. Seems more on the way in than out.

razor1 said:
This is not my code but low and behold everything I talked about is mentioned or is in the actual code. This needs to be done for ever single FP 32 pixel shader that would need to use the FP16 vertex shaders, pretty much all in game shaders. (this is in reverse too)

Can't just use the f16tof32() instruction in HLSL? Shave what, 80% of the instructions in the process? That conversion can be pipelined in, so best left to the compiler. FP32 to FP16 could be tricky, but the conversion isn't really the concern there as you hack off so much precision. Regardless, I think all conversions should have hardware instructions as copies are really easy.

Dayaks said:
I actually prefer Harmeeeedo's response:

So waiting on magic drivers to fix Nvidia's performance then?

razor1 said:
you see the work being disturbed across the CPU cores evenly here?

I don't, I see one core at 100% and the others much less.........

So which one was right the engine dev or the game dev? The game dev stated they use one core predominately, that is what this is showing right here.

Looks well distributed, but not really using SMT. Considering the load on the cores that's probably sufficient.

Both devs were right. One core is 100%, but as has been explained multiple times now, isn't doing anything critical. It's just an optimization to make the game more responsive. Using the spare CPU cycles. The main thread is probably the second one that occasionally hits 100%. The spikes could just be data transfers, but hard to tell.

razor1 said:
So were able to do the profile of Forza 7 yet?

No. Not sure I'm running the correct Linux kernel for DX12.

razor1 · Oct 2, 2017

Anarchist4000 said:
Stuff isn't difficult once you understand it. That's not really arrogance, and if you can't explain something in simple terms, you probably don't understand it.

Its arrogance when you have 10+ year experienced programmers and devs say its not easy to do.

The context of the original statement was that a single core being loaded wasn't the same problem from prior games. Everyone assumed one core being the reason for Nvidia's performance and the guy refuted that notion. Stating it was just a thread polling input. Which was further taken as Nvidia only using one thread, at which point the lead developer further clarified it wasn't the case. That problem could have very easily been fixed a while ago. Hell, Nvidia's game ready driver could have picked a different core even if it was. Both devs said loading of a single core wasn't the issue, it's that simple.

You want to bet, I can show you videos of Forza 6 and 3 on youtube done by many many people with the same problem!

here is one first one that popped up in google.

See the same problem?

Sure, however I'm not sure those are the best areas for packed math. FP16 is largely in compute, which isn't pixel and vertex shaders. If using vertex normals sure, but the better use would be a wholesale conversion to FP16 of the vertices for an early culling pass. That's been the recent async compute approach anyways.

not talking about the steps involved for doing vertices, its after all steps are done then you need to convert.

Go ahead show me.

If you had to convert them. Even then it's a single hardware instruction. The compiler should handle the conversion automatically if you try to pass FP16 into FP32.

NO IT DOES NOT do it automatically, It can't you will get errors because it won't know what to shave off. Every pixel shader for different effects and will need different inputs but will need to use the vertex shader as a base input

What exactly do you think makes a tensor so complex? It's a giant SIMD... I've never said GCN is the be all of architectures, just that what's occurring isn't anything new. It's just vectorizing one operation instead of x parallel operations from different threads.

IT CAN'T DO TENSOR CORE multiply matrices and accumulate. That is the main benefit of tensor cores. That is what gives it the speed.

MPS manages input and balancing from multiple processes. Hence multi process service. Asynchronous tasks, as they are considered unrelated. In the case of ACEs, AMD uses them under a different name to distribute asynchronous work from one or more processes. As Nvidia defined them, they are "performance critical" when dealing with these asynchronous tasks.

ignored totally missed the point again.

Kind of odd to be going the way of the dodo when the major consoles are based on GCN, upcoming SM6 uses GCN2 as a foundation, and GCN was designed for async compute, which is the foundation of DX12/Vulkan and derived from Mantle. Which again was designed around GCN. Seems more on the way in than out.

Its the way of the Dodo, its hot power hungry slow, nothing left in the tank to keep it afloat.

Can't just use the f16tof32() instruction in HLSL? Shave what, 80% of the instructions in the process? That conversion can be pipelined in, so best left to the compiler. FP32 to FP16 could be tricky, but the conversion isn't really the concern there as you hack off so much precision. Regardless, I think all conversions should have hardware instructions as copies are really easy.

I'm not talking about what instruction is being used, I'm talking about fp 16 instruction to a fp 32 instruction. It is not easy to do, well easy in that its easy to understand the methology, but it has ramifications on archiecture, cache etc. and visual anomalies.

So waiting on magic drivers to fix Nvidia's performance then?

More like dev patch, did you see Vega 56 having CPU bottlenecks too? I can show you more than one video of this too.

Looks well distributed, but not really using SMT. Considering the load on the cores that's probably sufficient.

That is not well distributed, that is having one main thread and having other threads spill over to other cpu threads that is what is happening.

Both devs were right. One core is 100%, but as has been explained multiple times now, isn't doing anything critical. It's just an optimization to make the game more responsive. Using the spare CPU cycles. The main thread is probably the second one that occasionally hits 100%. The spikes could just be data transfers, but hard to tell.

It is crucial when Vega has the same problems (just not a severe)

No. Not sure I'm running the correct Linux kernel for DX12.

Oh so you haven't even played the demo yet? Do you even play games?

cybereality · Oct 2, 2017

I'm downloading the game now and will confirm if I'm seeing the performance from the benchmark.

noko · Oct 3, 2017

Vega maybe just a sleeper card, once real DX 12 games hit it explodes. Well one can hope.

As for Forza 7, the trees look terrible, using cheap planar trees. Project Cars 2 trees look better and more of them with branches sticking out into the raceway at times. With PC2 Nvidia is overtaking AMD by a large amount. Of course DX 11 here and not DX 12 but PC2 also does VR. Between the two games PC2 looks more interesting for me with better weather effects including wind while Forza 7 has way more cars like 700 compared to 190 something for PC2. 190. Being stellar in one game does not mean much at this point, it needs to consistently better in more games then not.

Verado · Oct 3, 2017

This thread is gift that just keeps giving and giving.

OutOfPhase · Oct 3, 2017

Upon seeing us gleefully step back on the merrygoround of the same arguments -

And I feel like the guy from Spaceballs re-enacting the Alien breakfast scene, and he looks up and says "Oh no, not again"

razor1 · Oct 3, 2017

Got a question for ya PhaseNoise,

Can drivers or hardware automatically convert from FP 16 to FP 32 without errors?

I don't think its possible unless the hardware is smart enough to know where the errors might come from.

Pieter3dnow · Oct 3, 2017

Hameeeedo said:
This is laughable man. THIS HAS NOTHING TO DO WITH VEGA, even RX 580 is giving better fps than 1080Ti, this is an anomaly, something is massively holding NVIDIA back in this game. the 1080Ti is only 7% faster than 1080, there is an obvious issue here. Rest assured it will be fixed, just like Hitman and Ashes of the singularity.

Oh and stop grasping at straws to prove your failed predictions, if this is really the best example you can come up with then you are truly desperate!

https://translate.google.nl/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=https://www.computerbase.de/2017-09/forza-7-benchmark/2/&edit-text=&act=url

Nvidia confirms the backlog
The ranking in Forza 7 is very unusual. Nvidia has confirmed ComputerBase, however, that the results are so correct, so there is no problem with the system in the editorial regarding GeForce.

You do know right that every DX12 game uses their own way to access the graphics card that would make your assessment about previous games pretty moot.

OutOfPhase · Oct 3, 2017

razor1 said:
Got a question for ya PhaseNoise,

Can drivers or hardware automatically convert from FP 16 to FP 32 without errors?

I don't think its possible unless the hardware is smart enough to know where the errors might come from.

CPUs certainly can, I don't know how GPUs deal with FP16 though. In x86/x64 half precision functionality and converstion instructions are a part of SSE.

You still want to avoid conversions though, as it consumes time. "Automatic" things to take advantage of FP16 will probably not be stellar because of this. You're saving a little time on the math, but you have to convert and pack data types first. It may be faster, may be only slightly faster, or may actually be worse. It would be hard for a driver to know. A developer would know, so I agree with your points it is realistically going to require developer support.

Oh, and anyone who thinks compilers do an even remotely okay job of automatic vectorization - no, they really do not.
I work in math libraries all day long, and the automatic vectorization is minimal, at best. Partially because it's insanely hard to detect when you can effectively use it from a static code inspection standpoint except in absolutely trivial cases where lengths are known at compile time.

razor1 · Oct 3, 2017

PhaseNoise said:
CPUs certainly can, I don't know how GPUs deal with FP16 though. In x86/x64 half precision functionality and converstion instructions are a part of SSE.

You still want to avoid conversions though, as it consumes time. "Automatic" things to take advantage of FP16 will probably not be stellar because of this. You're saving a little time on the math, but you have to convert and pack data types first. It may be faster, may be only slightly faster, or may actually be worse. It would be hard for a driver to know. A developer would know, so I agree with your points it is realistically going to require developer support.

Oh, and anyone who thinks compilers do an even remotely okay job of automatic vectorization - no, they really do not.
I work in math libraries all day long, and the automatic vectorization is minimal, at best. Partially because it's insanely hard to detect when you can effectively use it from a static code inspection standpoint except in absolutely trivial cases where lengths are known at compile time.

Ah yeah that makes sense, hence why my co workers do it by hand

thx!

OutOfPhase · Oct 3, 2017

razor1 said:
Ah yeah that makes sense, hence why my co workers do it by hand thx!

And the big point you were driving towards which I forgot to mention - you really can't "automatically" do FP32 as FP16 on the fly, in the driver. While you can cleanly convert upward (added bits to mantissa and exponent are fine), you truncate going to FP16. That would cause all sorts of odd issues depending upon the shader. Best case, banding in colors. Alternately, big blocky effects, geometry snapping to quantization, etc. Dogs, cats, living together - complete pandemonium.
Can't really do it in a general way. There is a reason FP32 is a suitable general purpose data type and FP16 is considered special case.

So yeah, your coworkers are doing it by hand because they can then apply it to the exact cases where FP16 is not only safe (versus FP32), but provides a benefit. I don't see how a general purpose driver could handle both of those needs without knowing precisely what the nature of the math was in the first place.
In theory, a "game ready" driver could know what the title was doing with each call and replace them selectively on the fly. That's kinda insane, and requires a massive driver team to even consider. And it's is still using custom designed libraries, not some heuristic and algorithmic approach.

noko · Oct 4, 2017

So fp16 if driver implemented would be a game by game bases? Compute shader replacement for items that it would give benefit? Anyways I thought RPM was for AI, compute learning stuff than for games while it could be useful for some game compute shader stuff. I have no idea the performance advantage overall on a game using RPM? 5%-10%? Less? Well two games advertised with RPM is coming, Wolfenstein and FarCry 5, maybe we get to see what it can do then.

Hameeeedo · Oct 9, 2017

So just like I stated, NVIDIA released a driver that boosts Forza 7's performance 15 to 25% depending on the GPU. Sound logic pays off in the end, unlike fanstasies and dreams.

I expect more optimizations to come for NVIDIA cards as well, from the developer and from NVIDIA.

Vega Rumors

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

Limp Gawd

[H]F Junkie

Supreme [H]ardness

NVIDIA SHILL

Fully [H]

Supreme [H]ardness

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

razor1 is my Lover

[H]F Junkie

[H]F Junkie

Limp Gawd

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

Limp Gawd

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

Supreme [H]ardness

Limp Gawd