Vega Rumors

Of course not, Razor1 should have been . It runs slow and is not a good solution. Even with two 1070s in Witcher 3 it sucks the frames down for almost zero reason at times. AMD solution works great on NVidia hardware, hairworthless doesn't work great on anything.

actually hair works vs the first iteration of tressfx, hiarworks looked better. Hairworks did have advantages of doing fur and grass which tressFx 1.0 could not do realistically and tressfx 1.0 also didn't have lod (adaptive tessellation) also had issues with self shadowing and shading of the hair strands too. Also didn't work on all renderers (only worked on forward rendering).

Yes not worth talking about. AMD said no tesselation or GS. Vertex in vertex out.


tressfx 1.0 did use tessellation, I think its was 2.0 when they removed that too and had the vertex shader take care of it. Although 2.0 fixed a lot of issues tress fx had, it had a down side too, performance, geometry expansion had issues at times and would cause slow downs in specific tasks with LOD differentials.

It wasn't till tressfx 3.0 where it could really compete with hairworks visuals, performance, and flexibility, now this is why tressfx didn't get much uptake. While most renderers were deffered rendering, tressfx was not useful till tressfx 2.0, which came out so much later than hairworks it didn't matter. Now if you are doing the same amount of vertices though, forget about it, doesn't matter, both hairworks and tressfx have similar performance, now you just won't see the AMD performance cliff with tressfx.

for more info about the differences of the tressfx versions

http://markusrapp.de/wordpress/wp-c...usRapp-MasterThesis-RealTimeHairRendering.pdf
 
Last edited:
Of course not, Razor1 should have been . It runs slow and is not a good solution. Even with two 1070s in Witcher 3 it sucks the frames down for almost zero reason at times. AMD solution works great on NVidia hardware, hairworthless doesn't work great on anything.
So, like TressFX?
 
So anyone else recently apart from Crystal Dynamics/Square Enix implement Purehair or equivalent?
This is what has in essence replaced TressFX, probably for good reason.
Cheers
 
Last edited:
So, like TressFX?

TressFX never hurt my performance or was a very minor difference. Now I know when it first came out Nvidia hardware hated it which was then later patched for that. Tho I feel both are a giant waste on anything other then a still photo shot.
 
Using ALU resources to perform front-end functions traditionally handled by fixed function hardware is a rather odd solution given the focus on async compute allowing GCN to circumvent that bottleneck by overlapping processing of frames.
Not when the goal is to allow a programmable replacement. Fixed function does one thing really well. As for async that works just as well to balance out nothing but compute shaders and handle odd timing behavior. Techniques we still haven't seen because Nvidia can't run it effectively. At least Volta looks to finally be getting the hardware scheduling back.

So anyone else recently apart from Crystal Dynamics/Square Enix implement Purehair or equivalent?
This is what has in essence replaced TressFX, probably for good reason.
Cheers
Purehair was a modified version of TressFX. So not replaced as much as adapted to suit the developer's needs.
 
Purehair was a modified version of TressFX. So not replaced as much as adapted to suit the developer's needs.
It is quite a big 'mod', especially as they (studio) also part of creating TressFX in the 1st place with AMD, and that is part of my point.
How many others have implemented TressFX 3 recently apart from Crystal Dynamics (involved heavily in TressFX with AMD)/Square Enix?

Without PureHair TressFX is pretty limited:

So curious, any other new games by other studios using TressFX?
Cheers
 
Last edited:
Not when the goal is to allow a programmable replacement. Fixed function does one thing really well. As for async that works just as well to balance out nothing but compute shaders and handle odd timing behavior. Techniques we still haven't seen because Nvidia can't run it effectively. At least Volta looks to finally be getting the hardware scheduling back.

Who stated Volta is getting the scheduling hardware back, the same hardware that Kelper removed? From what we have heard about Volta, it doesn't sound like its the same thing. The pipeline seems to be changed for independent thread scheduling at a very fine grain, but its seems to work on entire threads, not the instructions them selves.

Pascal was able to get async performance increases (not as much as AMD) without the "dedicated" hardware. nV seems to just ensure a finer grain for better split up and reduce latency to get better performance for Volta. More likely the caching and register system has to be modified more than anything else. Which is also changed in Volta.
 
It is quite a big 'mod', especially as they (studio) also part of creating TressFX in the 1st place with AMD, and that is part of my point.
How many others have implemented TressFX 3 recently apart from Crystal Dynamics (involved heavily in TressFX with AMD)/Square Enix?

Without PureHair TressFX is pretty limited:

So curious, any other new games by other studios using TressFX?
Cheers



haven't seen any others for now.
 
Not when the goal is to allow a programmable replacement. Fixed function does one thing really well. As for async that works just as well to balance out nothing but compute shaders and handle odd timing behavior. Techniques we still haven't seen because Nvidia can't run it effectively. At least Volta looks to finally be getting the hardware scheduling back.


Purehair was a modified version of TressFX. So not replaced as much as adapted to suit the developer's needs.

Are you seriously replying with both "nvidia can't do async" and "nvidia doesnt have a hardware scheduler" in the same post? You are usually less prone to using memes. I confident you know quite well that async compute is supported from Maxwell onwards and what limitations there are. Likewise I am fairly confident you also know what was removed in terms of hardware scheduling starting with Kepler.
 
It is quite a big 'mod', especially as they (studio) also part of creating TressFX in the 1st place with AMD, and that is part of my point.
How many others have implemented TressFX 3 recently apart from Crystal Dynamics (involved heavily in TressFX with AMD)/Square Enix?
Keep in mind with GPUOpen, and most open source models, the idea is for others to use it as a building block. So it's put there as a foundation for others to build off of which isn't a bad thing. This is something that only the big studios are likely to have the programming talent to undertake anyways.

As for other companies implementing it, I can't think of any "hairy" games recently. Not after Tomb Raider and Witcher3. Mass Effect Andromeda perhaps, but I'm not sure anyone wants associated with "making the characters look more real" on that one. Not having played it, I don't think that game had dynamic hair anyways.

No cards for a while for sure, the Chinese supply chain would be pissing out details like a prostate cancer survivor - spurts and squirts.
Yet LiquidSky, to my understanding, is already using them for gaming? HBM containing parts are assembled elsewhere, so the traditional Chinese leakers might not have access.

Who stated Volta is getting the scheduling hardware back, the same hardware that Kelper removed? From what we have heard about Volta, it doesn't sound like its the same thing. The pipeline seems to be changed for independent thread scheduling at a very fine grain, but its seems to work on entire threads, not the instructions them selves.

Pascal was able to get async performance increases (not as much as AMD) without the "dedicated" hardware. nV seems to just ensure a finer grain for better split up and reduce latency to get better performance for Volta. More likely the caching and register system has to be modified more than anything else. Which is also changed in Volta.
https://devblogs.nvidia.com/parallelforall/inside-volta/#comment-3308952662
Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
Enhanced L1 Data Cache and Shared Memory

The new combined L1 data cache and shared memory subsystem of the Volta SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance.
Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.
Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds. This combination is unique to Volta and delivers more accessible performance than in the past.

Figure 9: Volta’s L1 data cache narrows the gap between applications that are manually tuned to keep data in shared memory and those that access data in device memory directly. 1.0 represents the performance of an application tuned with shared memory, while the green bars represent the performance of equivalent applications that do not use shared memory.
A key reason to merge the L1 data cache with shared memory in GV100 is to allow L1 cache operations to attain the benefits of shared memory performance. Shared memory provides high bandwidth and low latency, but the CUDA programmer needs to explicitly manage this memory. Volta narrows the gap between applications that explicitly manage shared memory and those that access data in device memory directly. To demonstrate this, we modified a suite of programs by replacing shared memory arrays with device memory arrays so that accesses would go through L1 cache. As Figure 9 shows, on Volta these codes saw only a 7% performance loss running without using shared memory, compared to a 30% performance loss on Pascal. While shared memory remains the best choice for maximum performance, the new Volta L1 design enables programmers to get excellent performance quickly, with less programming effort.
Based on those two features. Volta has the SIMTs organized a bit differently though. They seem to have done away with the wave concept in favor of a SIMT operating on thread blocks with the "Independent Thread Scheduling" packing the waves. Best I can tell it's taking up to 1024 threads of the same kernel and assembling batches of 32 at the same point. It remains to be seen how readily those partitions can change, but given the "Enhanced L1 and Shared Memory" feature listed above it should be possible. Pascal did manage some increases with async code, but it took a lot of care and profiling to get right while being limited to rather specific scenarios where the ratios remained consistent. Basically it was doable so long as the rendering task was relatively simple without a lot of steps affecting those ratios. An app bouncing between majority graphics and compute with no overlap would crush performance. Along with heavy synchronization events having to fall back to the CPU. It's why GCN "just worked" while Nvidia took a lot of care to get right while keep the techniques relatively simple.

The highlighted green part is almost precisely what the ACEs do for AMD. The exact capabilities are unclear.

Also form an Nvidia guy:
Q: MPS(Multi-Process Service) has a few restrictions. One of the most mysterious one is unsupport of dynamic parallelism. Is it still prohibited on the Volta generation?

A: Not supported in CUDA 9 due to schedule constraints, but should be supportable on Volta MPS in the future.

Are you seriously replying with both "nvidia can't do async" and "nvidia doesnt have a hardware scheduler" in the same post? You are usually less prone to using memes. I confident you know quite well that async compute is supported from Maxwell onwards and what limitations there are. Likewise I am fairly confident you also know what was removed in terms of hardware scheduling starting with Kepler.
Yes. It would be difficult for MPS to be a "new" feature with Volta if Pascal already had it. It's those limitations that are the concern. Async can be executed serially just like any other workload, so simply executing it isn't the issue. Limitations of "dont' use this if you want playable performance" kind of speak against support. Maxwell partitioned the GPU by SMs doing compute or graphics. Pascal partitioned within a SM. Volta MIGHT be agnostic of compute or graphics like GCN given the shared memory features. Like I mentioned above, they're doing something different with scheduling. So even if not agnostic of compute or graphics it could work out well. It's sort of a SMT approach to an entire thread block best I can tell, but broken into integer and floating point. Behavior if the entire thread block stalls is still unclear. The agnostic part with GCN would simply switch to another wave if one stalled. Volta may do the same at a thread block level, paging out the active block in favor of something. The GCN model is a whole lot more forgiving and capable, so it wouldn't be surprising for Volta to incorporate it. Marketing aside, there is a big difference for programmers.

With the consoles moving towards indirect execution, the hardware scheduling will be required sooner rather than later. It won't be long before we have a game loop running entirely on the GPU with the CPU only involved when the scene changed. Indirect is pretty much just a compute shader going through game objects to figure out what is on the screen and queuing them for execution. Remove user input from that and a GPU running the physics could run a simulation indefinitely. Simulating a world until acted on by an external influence.

Also of interest is the following:
Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
That's roughly what AMD's HBCC is doing, but precise capabilities for each are still unknown. It's those per page access counters I was worried about along with the definition of "page" from previous unified memory models. A page could be an entire buffer or 4k element of a larger buffer. HBCC to my understanding is the latter and it's still unclear how virtual pointers are handled on each. Or if the ability extends to consumer products.
 
Last edited:
Also of interest is the following:

That's roughly what AMD's HBCC is doing, but precise capabilities for each are still unknown. It's those per page access counters I was worried about along with the definition of "page" from previous unified memory models. A page could be an entire buffer or 4k element of a larger buffer. HBCC to my understanding is the latter and it's still unclear how virtual pointers are handled on each. Or if the ability extends to consumer products.

There's a lot of claims, like GP100's 512TB virtual memory system.

When in reality, this is what performance looks like once a dataset exceeds GP100's VRAM capacity:

unified_memory_oversubscription_hpgmg_perf_no_hints_jan2017.png


With a dataset that's only 58.6GB, P100's performance tanks badly to barely above CPU processing. It essentially becomes bottlenecked by the connection speed of PCI-e or NVLink (IBM CPUs only).

The demos of Vega and it's HBCC, goes far beyond this in capabilities. A 2TB real-time ray tracing scene, in fluid motion, likewise for fluid 8K video editing of a multi-Terabyte raw file. Or Rise of the Tomb Raider 4K with only 2GB VRAM, fluid performance.

This is why I think there's a major fundamental difference to the HBCC vs simply page buffers or virtual pointers (GCN 1, Tahiti had this).

Before this demonstration from NV, I had thought GP100's unified memory system was awesome, but being limited in performance once dataset exceed the GPU's VRAM, is really back to square 1 and prevents GPUs from being meaningfully able to accelerate huge datasets that have been restricted to CPU only due to system RAM sizes vs GPU VRAM sizes.
 
The demos of Vega and it's HBCC, goes far beyond this in capabilities
WHERE ARE YOUR NUMBERS TO BACK THAT CLAIM UP?

A 2TB real-time ray tracing scene, in fluid motion, likewise for fluid 8K video editing of a multi-Terabyte raw file
Oh, so you're citing 2 examples of PCI-E bound operation on pre-optimized software as your proof? Not a single number to compare with even? ROFLMAO.

Or Rise of the Tomb Raider 4K with only 2GB VRAM, fluid performance.
"On unknown settings".
 
There's a lot of claims, like GP100's 512TB virtual memory system.

When in reality, this is what performance looks like once a dataset exceeds GP100's VRAM capacity:

unified_memory_oversubscription_hpgmg_perf_no_hints_jan2017.png


With a dataset that's only 58.6GB, P100's performance tanks badly to barely above CPU processing. It essentially becomes bottlenecked by the connection speed of PCI-e or NVLink (IBM CPUs only).

The demos of Vega and it's HBCC, goes far beyond this in capabilities. A 2TB real-time ray tracing scene, in fluid motion, likewise for fluid 8K video editing of a multi-Terabyte raw file. Or Rise of the Tomb Raider 4K with only 2GB VRAM, fluid performance.

This is why I think there's a major fundamental difference to the HBCC vs simply page buffers or virtual pointers (GCN 1, Tahiti had this).

Before this demonstration from NV, I had thought GP100's unified memory system was awesome, but being limited in performance once dataset exceed the GPU's VRAM, is really back to square 1 and prevents GPUs from being meaningfully able to accelerate huge datasets that have been restricted to CPU only due to system RAM sizes vs GPU VRAM sizes.
No wonder Nvidia is rushing out Volta, P100 for any memory large task is worthless :LOL:. Supposedly Vega X2 version will have SSD capability, still think Vega X2 will really hit V100 hard.
 
WHERE ARE YOUR NUMBERS TO BACK THAT CLAIM UP?

Oh, so you're citing 2 examples of PCI-E bound operation on pre-optimized software as your proof? Not a single number to compare with even? ROFLMAO.

"On unknown settings".

See, this is where you people get confused. Nowhere did I claim, to need proof. I said: "This is why I think there's a major fundamental difference to the HBCC"

It's an opinion. It's not a statement of fact like you guys like to enjoy throwing around.

So settle the frak down.
 
No wonder Nvidia is rushing out Volta, P100 for any memory large task is worthless :LOL:. Supposedly Vega X2 version will have SSD capability, still think Vega X2 will really hit V100 hard.

Volta has a faster NVLink, it should help them with larger than VRAM datasets, with IBM CPUs anyway.
 
Volta has a faster NVLink, it should help them with larger than VRAM datasets, with IBM CPUs anyway.
Yet Vega can had local cache beyond the VRam which would be way faster then NVLink. Will be interesting.
 
Yet Vega can had local cache beyond the VRam which would be way faster then NVLink. Will be interesting.

We don't know how HBCC works. The 3 demos are pretty interesting showcase of it's potential though. That's about all we can say.

I only brought up the P100 performance issues (drops off like a cliff beyond it's VRAM limit) for large datasets, because of the claims about unified memory and 512TB, all sounds great on paper, until reality hits home.

Likewise, the same could be true for Vega's features including HBCC.
 
See, this is where you people get confused. Nowhere did I claim, to need proof. I said: "This is why I think there's a major fundamental difference to the HBCC"
Now, you did say right that:
The demos of Vega and it's HBCC, goes far beyond this in capabilities.
You have stated it as a fact, that's where it starts and ends. Trying to refute it by using a quote originating from the next paragraph is just futile.
 
So you had an issue with my description? lol

"The demos of Vega and it's HBCC, goes far beyond this in capabilities. A 2TB real-time ray tracing scene, in fluid motion, likewise for fluid 8K video editing of a multi-Terabyte raw file. Or Rise of the Tomb Raider 4K with only 2GB VRAM, fluid performance."

Is an accurate description of what was shown, side-by-side with HBCC on/off. It's not insignificant gains (as compared to P100 when it drops off a cliff with 58.6GB dataset) as data exceeds the GPU's VRAM, but Vega with HBCC has major gains in performance for fluid visuals vs stuttery for the first 2 demos, and with huge FPS gains for Rise of the Tomb Raider (which you can quantify since the numbers were displayed).

The other statement, was my opinion, that I think HBCC is unlike the typical Unified Memory technology.
 
There's a lot of claims, like GP100's 512TB virtual memory system.

When in reality, this is what performance looks like once a dataset exceeds GP100's VRAM capacity:

unified_memory_oversubscription_hpgmg_perf_no_hints_jan2017.png


With a dataset that's only 58.6GB, P100's performance tanks badly to barely above CPU processing. It essentially becomes bottlenecked by the connection speed of PCI-e or NVLink (IBM CPUs only).

The demos of Vega and it's HBCC, goes far beyond this in capabilities. A 2TB real-time ray tracing scene, in fluid motion, likewise for fluid 8K video editing of a multi-Terabyte raw file. Or Rise of the Tomb Raider 4K with only 2GB VRAM, fluid performance.

This is why I think there's a major fundamental difference to the HBCC vs simply page buffers or virtual pointers (GCN 1, Tahiti had this).

Before this demonstration from NV, I had thought GP100's unified memory system was awesome, but being limited in performance once dataset exceed the GPU's VRAM, is really back to square 1 and prevents GPUs from being meaningfully able to accelerate huge datasets that have been restricted to CPU only due to system RAM sizes vs GPU VRAM sizes.

The use of that chart is being heavily skewed and misrepresented.
That is a representation by Nvidia of the most brutal oversubscription possible with simplified coding while running 5 different sets at the same time, where it drops it is still running 2 sets and then 1 set.
Really should be used in context, and not a real world situation (maybe if one is really bad with their workflow and coding design).
Cheers
 
Keep in mind with GPUOpen, and most open source models, the idea is for others to use it as a building block. So it's put there as a foundation for others to build off of which isn't a bad thing. This is something that only the big studios are likely to have the programming talent to undertake anyways.

As for other companies implementing it, I can't think of any "hairy" games recently. Not after Tomb Raider and Witcher3. Mass Effect Andromeda perhaps, but I'm not sure anyone wants associated with "making the characters look more real" on that one. Not having played it, I don't think that game had dynamic hair anyways.


Yet LiquidSky, to my understanding, is already using them for gaming? HBM containing parts are assembled elsewhere, so the traditional Chinese leakers might not have access.


https://devblogs.nvidia.com/parallelforall/inside-volta/#comment-3308952662


Based on those two features. Volta has the SIMTs organized a bit differently though. They seem to have done away with the wave concept in favor of a SIMT operating on thread blocks with the "Independent Thread Scheduling" packing the waves. Best I can tell it's taking up to 1024 threads of the same kernel and assembling batches of 32 at the same point. It remains to be seen how readily those partitions can change, but given the "Enhanced L1 and Shared Memory" feature listed above it should be possible. Pascal did manage some increases with async code, but it took a lot of care and profiling to get right while being limited to rather specific scenarios where the ratios remained consistent. Basically it was doable so long as the rendering task was relatively simple without a lot of steps affecting those ratios. An app bouncing between majority graphics and compute with no overlap would crush performance. Along with heavy synchronization events having to fall back to the CPU. It's why GCN "just worked" while Nvidia took a lot of care to get right while keep the techniques relatively simple.

The highlighted green part is almost precisely what the ACEs do for AMD. The exact capabilities are unclear.

Also form an Nvidia guy:



Yes. It would be difficult for MPS to be a "new" feature with Volta if Pascal already had it. It's those limitations that are the concern. Async can be executed serially just like any other workload, so simply executing it isn't the issue. Limitations of "dont' use this if you want playable performance" kind of speak against support. Maxwell partitioned the GPU by SMs doing compute or graphics. Pascal partitioned within a SM. Volta MIGHT be agnostic of compute or graphics like GCN given the shared memory features. Like I mentioned above, they're doing something different with scheduling. So even if not agnostic of compute or graphics it could work out well. It's sort of a SMT approach to an entire thread block best I can tell, but broken into integer and floating point. Behavior if the entire thread block stalls is still unclear. The agnostic part with GCN would simply switch to another wave if one stalled. Volta may do the same at a thread block level, paging out the active block in favor of something. The GCN model is a whole lot more forgiving and capable, so it wouldn't be surprising for Volta to incorporate it. Marketing aside, there is a big difference for programmers.

With the consoles moving towards indirect execution, the hardware scheduling will be required sooner rather than later. It won't be long before we have a game loop running entirely on the GPU with the CPU only involved when the scene changed. Indirect is pretty much just a compute shader going through game objects to figure out what is on the screen and queuing them for execution. Remove user input from that and a GPU running the physics could run a simulation indefinitely. Simulating a world until acted on by an external influence.

Also of interest is the following:

That's roughly what AMD's HBCC is doing, but precise capabilities for each are still unknown. It's those per page access counters I was worried about along with the definition of "page" from previous unified memory models. A page could be an entire buffer or 4k element of a larger buffer. HBCC to my understanding is the latter and it's still unclear how virtual pointers are handled on each. Or if the ability extends to consumer products.

My point that you sort of just agreed with is that no-one else is using TressFX 3, and like I said probably because to make it work well one has to have PureHair or equivalent and this was created by the same studio that was heavily involved with TressFX from the early days all the way through to current version.
You mention AMD has put the onus on the devs/studios for this and fits my point, the only studio to successfully implement this in any meaningful way were involved with TressFX creation and evolution.
Also Crystal Dynamics implemented more than just hair in RoTR, it influenced the texture of the snow on AMD GPUs.
http://www.pcgameshardware.de/commoncfm/comparison/indexb2.cfm?id=129757
The texture-impression of the snow is improved with Pure Hair but unfortunately only on AMD machines, left side was AMD and right Nvidia.

Regarding hardware scheduler.
I think it is too early to say just how Nvidia has implemented this and I doubt it has any similarity to historical implementations whether hw or more recently sw.
Nvidia could had done the scheduler slightly differently with Maxwell and Pascal but it would had impacted the efficiency of the design at the time, here is what Nvidia says on the subject of with regards to Volta:
Part of this can be seen the use of the new Syncwarp() as well being integral to the new arch.
Nvidia at HPCWire said:
Catanzaro, who returned to Nvidia from Baidu six months ago, emphasized how the architectural changes wrought greater flexibility and power efficiency.

“It’s worth noting that Volta has the biggest change to the GPU threading model basically since I can remember and I’ve been programming GPUs for a while,” he said. “With Volta we can actually have forward progress guarantees for threads inside the same warp even if they need to synchronize, which we have never been able to do before. This is going to enable a lot more interesting algorithms to be written using the GPU, so a lot of code that you just couldn’t write before because it potentially would hang the GPU based on that thread scheduling model is now possible. I’m pretty excited about that, especially for some sparser kinds of data analytics workloads there’s a lot of use cases where we want to be collaborating between threads in more complicated ways and Volta has a thread scheduler can accommodate that.

“It’s actually pretty remarkable to me that we were able to get more flexibility and better performance-per-watt. Because I was really concerned when I heard that they were going to change the Volta thread scheduler that it was going to give up performance-per-watt, because the reason that the old one wasn’t as flexible is you get a lot of energy efficiency by ganging up threads together and having the capability to let the threads be more independent then makes me worried that performance-per-watt is going to be worse, but actually it got better, so that’s pretty exciting.”

Added Alben: “This was done through a combination of process and architectural changes but primarily architecture. This was a very significant rewrite of the processor architecture. The Tensor Core part is obviously very [significant] but even if you look at FP32 and FP64, we’re talking about 50 percent more performance in the same power budget as where we’re at with Pascal. Every few years, we say, hey we discovered something really cool. We basically discovered a new architectural approach we could pursue that unlocks even more power efficiency than we had previously. The Volta SM is a really ambitious design; there’s a lot of different elements in there, obviously Tensor Core is one part, but the architectural power efficiency is a big part of this design.”

Both Alben and Catanzaro are heavily involved with the direction and development of Nvidia GPUs.
Too early to tell either way if it is HW or SW or hybrid mix - my view is they have not regressed and not entirely convinced we are seeing a hardware scheduler like one may think of in the past as before Kepler.
Cheers
 
Last edited:
Nvidia has always had hardware scheduling of warps. Scheduling within warps has been relegated to the software stack since the introduction of a fixed latency math pipeline in Kepler. Because it's redundant.

You know this @antichrist.

Similarly, you know that there's no imposition of serial execution with regards to compute and graphics kernels, even on Maxwell.

In that same vein I'm confident you're aware that Pascal doesn't partition within an SM.
 
Warp Scheduling is HW, it's in each SM where there's the dispatch units. Always has been.

What's the GigaThread Engine (Global Scheduler) though in recent NV GPUs? I read somewhere a long time ago, it went to a hybrid approach post Fermi, no longer like Fermi's HW GigaThread, but it uses software compiling/drivers to determine which SMs the threads gets sent to, to be then distributed by the Warp Scheduler (hw). Is this still the case or I recall wrong?
 
Warp Scheduling is HW, it's in each SM where there's the dispatch units. Always has been.

What's the GigaThread Engine (Global Scheduler) though in recent NV GPUs? I read somewhere a long time ago, it went to a hybrid approach post Fermi, no longer like Fermi's HW GigaThread, but it uses software compiling/drivers to determine which SMs the threads gets sent to, to be then distributed by the Warp Scheduler (hw). Is this still the case or I recall wrong?

The scheduling of warps is a hardware function, always has been. The scheduling of instructions within a warp is done by the driver stack since Kepler. Fixed latency, no need for scheduling.
 
Is an accurate description of what was shown, side-by-side with HBCC on/off.
Once again, you have made a comparison without point of comparison. That is just a basic fallacy.
It's not insignificant gains (as compared to P100 when it drops off a cliff with 58.6GB dataset)
And what do you have to compare it with? Did you run the same test on Vega? No? Then why the fuck are you talking about "as compared to"?
 
Nvidia has always had hardware scheduling of warps. Scheduling within warps has been relegated to the software stack since the introduction of a fixed latency math pipeline in Kepler. Because it's redundant.

You know this @antichrist.

Similarly, you know that there's no imposition of serial execution with regards to compute and graphics kernels, even on Maxwell.

In that same vein I'm confident you're aware that Pascal doesn't partition within an SM.
First you know you misspelled his name, I gather intentionally but in case not maybe you should change it.

Second you always contested Maxwell COULD do asynchronous compute graphics = parallel. After Pascal I even pointed out further that Pascal could and with the changes it further enforced the fact Maxwell could not. So it only took a 1year and a half to admit that simple fact.
 
The scheduling of warps is a hardware function, always has been. The scheduling of instructions within a warp is done by the driver stack since Kepler. Fixed latency, no need for scheduling.

Yeah in essence static scheduling and using the compiler with regards to scheduling instructions and known latency.
Like I mentioned earlier, Nvidia could had been more aggressive with their 'SW' approach in Maxwell/Pascal but did not as their focus was efficiency, with Volta looks like they manage to get the benefit of both worlds and handling of independent threads, and like I hinted syncwarp() with CUDA 9 now has improvements as part of threads can branch/recover-converge at sub warp level (and cross-warp), and also the starvation-free algorithm model (goes with independent thread scheduling mechanism).

Just to add.
Anandtech did a nice bit on the differences implemented with Kepler: http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3
Cheers
 
Last edited:
Second you always contested Maxwell COULD do asynchronous compute graphics = parallel.

I'm struggling to make sense of this. Asynchronous compute graphics = parallel? What does this even mean.

Maxwell *can* execute graphics and compute kernels in parallel.

After Pascal I even pointed out further that Pascal could and with the changes it further enforced the fact Maxwell could not.

You pointed out absolutely nothing, instead were inexplicably absent from the discussion following my posting of a comprehensive async compute write-up.

So it only took a 1year and a half to admit that simple fact.

Admit what simple fact? That Maxwell **can** execute graphics and compute kernels in parallel ?

The only difference between pascal and maxwell in this regard is that the former can re-partition resources on the fly, and not at drawcall boundaries. You would know this if you had the common decency to read about a topic before engaging in discussion about it.
 
https://devblogs.nvidia.com/parallelforall/inside-volta/#comment-3308952662


Based on those two features. Volta has the SIMTs organized a bit differently though. They seem to have done away with the wave concept in favor of a SIMT operating on thread blocks with the "Independent Thread Scheduling" packing the waves. Best I can tell it's taking up to 1024 threads of the same kernel and assembling batches of 32 at the same point. It remains to be seen how readily those partitions can change, but given the "Enhanced L1 and Shared Memory" feature listed above it should be possible. Pascal did manage some increases with async code, but it took a lot of care and profiling to get right while being limited to rather specific scenarios where the ratios remained consistent. Basically it was doable so long as the rendering task was relatively simple without a lot of steps affecting those ratios. An app bouncing between majority graphics and compute with no overlap would crush performance. Along with heavy synchronization events having to fall back to the CPU. It's why GCN "just worked" while Nvidia took a lot of care to get right while keep the techniques relatively simple.

The highlighted green part is almost precisely what the ACEs do for AMD. The exact capabilities are unclear.


That is not what ACE's do ;)

Also form an Nvidia guy:

Yes. It would be difficult for MPS to be a "new" feature with Volta if Pascal already had it. It's those limitations that are the concern. Async can be executed serially just like any other workload, so simply executing it isn't the issue. Limitations of "dont' use this if you want playable performance" kind of speak against support. Maxwell partitioned the GPU by SMs doing compute or graphics. Pascal partitioned within a SM. Volta MIGHT be agnostic of compute or graphics like GCN given the shared memory features. Like I mentioned above, they're doing something different with scheduling. So even if not agnostic of compute or graphics it could work out well. It's sort of a SMT approach to an entire thread block best I can tell, but broken into integer and floating point. Behavior if the entire thread block stalls is still unclear. The agnostic part with GCN would simply switch to another wave if one stalled. Volta may do the same at a thread block level, paging out the active block in favor of something. The GCN model is a whole lot more forgiving and capable, so it wouldn't be surprising for Volta to incorporate it. Marketing aside, there is a big difference for programmers.

With the consoles moving towards indirect execution, the hardware scheduling will be required sooner rather than later. It won't be long before we have a game loop running entirely on the GPU with the CPU only involved when the scene changed. Indirect is pretty much just a compute shader going through game objects to figure out what is on the screen and queuing them for execution. Remove user input from that and a GPU running the physics could run a simulation indefinitely. Simulating a world until acted on by an external influence.


They are doing something totally different at an instruction level with Volta but the warp level is similiar maybe what the nV guy is saying.
 
Last edited:
I'm struggling to make sense of this. Asynchronous compute graphics = parallel? What does this even mean.

Maxwell *can* execute graphics and compute kernels in parallel.

You pointed out absolutely nothing, instead were inexplicably absent from the discussion following my posting of a comprehensive async compute write-up.


Admit what simple fact? That Maxwell **can** execute graphics and compute kernels in parallel ?


The only difference between pascal and maxwell in this regard is that the former can re-partition resources on the fly, and not at drawcall boundaries. You would know this if you had the common decency to read about a topic before engaging in discussion about it.

Looks to me people have forgotten what parallel and concurrency are lol. AMD marketing did its job well on the masses.
 
Looks to me people have forgotten what parallel and concurrency are lol. AMD marketing did its job well on the masses.

As I said, JustReason was notably absent from my async compute thread so I am not surprised he is missing quite a lot of information. This is why playing Ostrich is rarely a good idea in the long term.
 
I'm struggling to make sense of this. Asynchronous compute graphics = parallel? What does this even mean.

Maxwell *can* execute graphics and compute kernels in parallel.



You pointed out absolutely nothing, instead were inexplicably absent from the discussion following my posting of a comprehensive async compute write-up.



Admit what simple fact? That Maxwell **can** execute graphics and compute kernels in parallel ?

The only difference between pascal and maxwell in this regard is that the former can re-partition resources on the fly, and not at drawcall boundaries. You would know this if you had the common decency to read about a topic before engaging in discussion about it.
I was trying to simplify it for you but that obviously did not work. So lets try this.

Prove that the total GPU array can run both graphics and compute, in a game not some CUDA programming instance, at the same time back to back over say 100 simultaneous times. I saw the tests originally run and only on the first run could it run both but after said first it would have context switching, switching the entire GPU from graphics to compute and back again. With Pascal creating a quadrant setup ( pascal 1080) it could now setup 3 of them with graphics and the forth as compute or any number of configurations. That was the change and the fact that Maxwell could not do this.
 
I was trying to simplify it for you but that obviously did not work. So lets try this.

Prove that the total GPU array can run both graphics and compute, in a game not some CUDA programming instance, at the same time back to back over say 100 simultaneous times. I saw the tests originally run and only on the first run could it run both but after said first it would have context switching, switching the entire GPU from graphics to compute and back again. With Pascal creating a quadrant setup ( pascal 1080) it could now setup 3 of them with graphics and the forth as compute or any number of configurations. That was the change and the fact that Maxwell could not do this.

err, the problem is not an entire GPU array, its a per SM basis issue. Partitioning is done on a SM basis, and its a static partition for Maxwell.

You are using the wrong words too, what you just described there is concurrency not parallelism, concurrency of kernels in the entire Array is not an issue. Concurrency per SM is an issue with Maxwell, because it has a static partitioning, so once partitioned it must stick to that partition or flush the SM and repartition.

This is what has changed from Maxwell to Pascal, the static partitioning is now dynamic partitioning. No need to flush the SM out anymore to reallocate resources based on needs.
 
I was trying to simplify it for you but that obviously did not work. So lets try this.

Prove that the total GPU array can run both graphics and compute, in a game not some CUDA programming instance, at the same time back to back over say 100 simultaneous times. I saw the tests originally run and only on the first run could it run both but after said first it would have context switching, switching the entire GPU from graphics to compute and back again. With Pascal creating a quadrant setup ( pascal 1080) it could now setup 3 of them with graphics and the forth as compute or any number of configurations. That was the change and the fact that Maxwell could not do this.

You have no idea what you're talking about.

Quadrant setup? Whole GPU context switch? Is this a random jumble of words?
 
The issue, as it turns out, is that while Maxwell 2 supported a sufficient number of queues, how Maxwell 2 allocated work wasn’t very friendly for async concurrency. Under Maxwell 2 and earlier architectures, GPU resource allocation had to be decided ahead of execution. Maxwell 2 could vary how the SMs were partitioned between the graphics queue and the compute queues, but it couldn’t dynamically alter them on-the-fly. As a result, it was very easy on Maxwell 2 to hurt performance by partitioning poorly, leaving SM resources idle because they couldn’t be used by the other queues.

There you go
JustReason in, posting this for the billionth time in a vain attempt to combat your militant sciolism
 
Is it out yet?
lol, no and this is what we get above in the mean time.

Bottom line after all the above chatter is performance in the end using the different api's. If Vega can show some kind of astonishing performance doing DX 12 stuff over Pascal then it is a mute point what each hardware does inside if both are about the same.
 
Back
Top