Vega Rumors

well..

I just went to NCIX.com and their 'championship sale' has ONLY NVIDIA cards listed. All the top tiers down to the mid-bottom.... Hmmmmmmmm............ Even the Ti HYBRID is on sale... I have a feeling VEGA is going to be better than everyone thinks............ nvidia trying to get a bit more margin in before the release??

http://www.ncix.com/promo/Championship2017.htm#&minorcatid=108

I have never witnessed this in my time.
 
hmm not much of a sale, with conversion rates the way they are, the cards still look to be priced right around MSRP.
 
well..

I just went to NCIX.com and their 'championship sale' has ONLY NVIDIA cards listed. All the top tiers down to the mid-bottom.... Hmmmmmmmm............ Even the Ti HYBRID is on sale... I have a feeling VEGA is going to be better than everyone thinks............ nvidia trying to get a bit more margin in before the release??

http://www.ncix.com/promo/Championship2017.htm#&minorcatid=108

I have never witnessed this in my time.

They barely have stock on any 570 or 580 gear, miners are scooping them , so no dice seeing those on sale anytime soon.
 
The point I was getting at is that many devs seem to be refactoring their engines for DX12/Vulkan and we haven't seen much hair as releases have been limited. I wouldn't go as far as saying the onus is on the devs, but creativity is ultimately the responsibility of an artist. The tool just provides a foundation to be expanded upon.


I'd agree we definitely need to know more. That said the Volta blog did roughly lay out the functionality of the MPS(~ACE in AMD terms). The HWS equivalent is still a bit unclear, and that's where it gets a bit more interesting. The thread scheduler and syncwarp() being discussed is a bit ambiguous. It looks like it's within a single kernel, but on the other hand a kernel that diverges immediately into radically different paths could be considered two separate kernels. So the graphics versus compute distribution is still unclear, but it's not applicable to HPC either which was the focus of the blog and V100.


Hardware scheduling IN warps/blocks is a more accurate description.
We're talking about an entirely different level of scheduling here. With async the scheduling concern is the warp/thread block distribution, not the instructions. Thread management versus execution. The execution part has divergence, the threading part does not to put it simply. Further, async behavior is probably better defined as the ability to adapt to changes in real time. That's not something Nvidia hardware is very proficient at, as the scheduling requires known distributions in advance. Reacting to changes as opposed to anticipating them. One of those is based on guesswork of an increasing number of variables.


That's precisely the problem. The whole point of async and concurrency is to have complementary tasks scheduled together(in addition to prioritization) to better utilize all the execution units. It's basically hyperthreading. The problem with ILP is by definition all threads are doing the same thing. This varies a bit as they won't all be running in lockstep. Hence they all hit the same bottleneck and probably leave some hardware idle. Or they get stuck on one task and can't quickly shift in another.
Well who else is going to create equivalent to PureHair that was also done by the studio heavily involved with TressFX from beginning to current version?
I would say there is a good reason we are not seeing it implemented well by anyone else.

Yeah we need more info on the scheduler/thread, but I quoted information suggesting opposite your thinking; it is a change to existing scheduler that has no comparison to historical HW implementation, like I said they could had also done this with Maxwell/Pascal to some extent but the focus was efficiency, now they have re-written the processor aspect to be thread independent and integral to all of this is the sync operation and way the branch and convergence can occur sub and cross warp levels.
It might be some kind of hybrid but even then I would say it has more in common with how they moved away from the previous HW implementation and to static scheduling with a compiler and known latency.
They even talk about a starvation-free algorithm model.

But they moved away from HW based scheduler for good reason, and what we have now IMO is the next evolution from their understanding and experience.
If I remember they said they would done things differently with the iteration back to Fermi and we then saw a jump to Maxwell-Pascal, now is the next jump in this context IMO.
I guess it is ones perspective/POV; and maybe the approach Nvidia has found now blurs the boundary a bit but still SW in context of the discussion, IMO anyway *shrug*
Cheers
 
Last edited:
Well who else is going to create equivalent to PureHair that was also done by the studio heavily involved with TressFX from beginning to current version?
I would say there is a good reason we are not seeing it implemented well by anyone else.

Yeah we need more info on the scheduler/thread, but I quoted information suggesting opposite your thinking; it is a change to existing scheduler that has no comparison to historical HW implementation, like I said they could had also done this with Maxwell/Pascal to some extent but the focus was efficiency, now they have re-written the processor aspect to be thread independent and integral to all of this is the sync operation and way the branch and convergence can occur sub and cross warp levels.
It might be some kind of hybrid but even then I would say it has more in common with how they moved away from the previous HW implementation and to static scheduling with a compiler and known latency.
They even talk about a starvation-free algorithm model.

But they moved away from HW based scheduler for good reason, and what we have now IMO is the next evolution from their understanding and experience.
If I remember they said they would done things differently with the iteration back to Fermi and we then saw a jump to Maxwell-Pascal, now is the next jump in this context IMO.
I guess it is ones perspective/POV; and maybe the approach Nvidia has found now blurs the boundary a bit but still SW in context of the discussion, IMO anyway *shrug*
Cheers

Cross warp convergence is very very interesting as raw throughput has potential to increase massively *if* the algorithm being used allows for the latencies (dependencies) involved. Potential for diverging threads from different threadblocks to be executed as one warp to maximize occupancy etc etc
 
Actaully with the introduction of the Fermi, Kepler, and Maxwell pipelines, Divergence was a major problem for them, something that annoyed many CUDA developers, GCN doesn't have this problem inherently (well not as much I should say, its all relative, GCN does have the problem just not as severe based on how the code is written). Those pipelines restricted the flexibility of the code developers could write. Volta looks to solve the problem entirely, and this is what is getting developers talking about.

Any case we should keep this thread about Vega lol.
 
Cross warp convergence is very very interesting as raw throughput has potential to increase massively *if* the algorithm being used allows for the latencies (dependencies) involved. Potential for diverging threads from different threadblocks to be executed as one warp to maximize occupancy etc etc
Yep which comes back to knowing and with fixed latencies are and this is possible if they do not go back to HW scheduler pre Kepler, just like both of us have suggested one difference between now and back then.
Also Nvidia's approach reduced latency.
So does put more weight behind the concept of a big evolution of their existing SW-'compiler' scheduler.
Cheers
 
MPS has nothing to do with ACEs, it operates on an upper level in terms of hierarchy. We are talking about multiple applications using one GPU. This has nothing to do with asynchronous execution of two different kernels.
Which is exactly what ACEs do on GCN. Multiple applications and multiple simultaneous kernels are effectively the same thing. While not a guarantee of asynchronous execution, it's an important step in combination with other abilities.

The scheduling *required* (past tense) known distributions in advance to be effective, on Maxwell. On Pascal there is no need to accurately estimate execution times and partition the SMs accordingly, If my compute shader has finished while my graphics kernels are still in execution, Pascal will assign those now idle SMs to the graphics pipeline. The NV equivalent of the ACEs would be the GMU and it's 32 hardware queues.
It still requires those units to flush and presents a limitation that needs designed around. Usage of GMUs with graphics kernels presents some issues as well.

Allowing concurrent execution of warps from different pipelines within the SM is clearly not needed to saturate nvidia SMs. If this were not the case then how do you explain Maxwell, Pascal and now surely Volta outperforming GCN in terms of effective performance/theoretical flop ?
Unrelated aspect determining performance? The current examples of 'async' are all designed around the limitations with lots of added development work and tuning. Hitting a geometry or CPU bottleneck isn't a great example when most execution units end up idling anyways.

Yep which comes back to knowing and with fixed latencies are and this is possible if they do not go back to HW scheduler pre Kepler, just like both of us have suggested one difference between now and back then.
Also Nvidia's approach reduced latency.
So does put more weight behind the concept of a big evolution of their existing SW-'compiler' scheduler.
Cheers
The problem is that in a complex system latencies can't be reliably known. It also won't guarantee reduced latency. It only works with simple, predictable patterns. The same software scheduling Nvidia currently performs can largely be done in hardware or rendered irrelevant.

AMD's approach is the one used by streaming supercomputer designs. I'd expect Nvidia to move back towards a similar design as it makes enterprise work far easier. The blog seems to indicate that is the case, although still a bit vague. That design still works with the thread convergence model.
 
Which is exactly what ACEs do on GCN. Multiple applications and multiple simultaneous kernels are effectively the same thing. While not a guarantee of asynchronous execution, it's an important step in combination with other abilities..

The GPU global scheduler makes no distinctions about what application spawned which kernel, it simply manages execution of each. MPS has absolutely nothing to do with this, MPS is a high level driver component at the moment. With Volta it has hardware accelerated functions; virtualization and HPC oriented feature.
It still requires those units to flush and presents a limitation that needs designed around. Usage of GMUs with graphics kernels presents some issues as well..

Why would you need to flush an SM when it has finished execution? Even on GCN if you were to pipe in a high priority compute kernel you would have to either preempt the running threads (not certain if this can be at SIMD granularity aka 4 per CU) or wait for them to finish executing. Pascal has pixel and instruction level preemption for graphics and compute kernels respectively. I didn't understand the part in bold, GMU is just the global scheduler, an ARM design chip which has been redesigned for Volta. I remember watching an nvidia engineer talk about their plans for the next generation (vs maxwell) microcontroller.
Unrelated aspect determining performance? The current examples of 'async' are all designed around the limitations with lots of added development work and tuning. Hitting a geometry or CPU bottleneck isn't a great example when most execution units end up idling anyways..

I don't understand the question, and you can argue that current implementations are effectively just using async compute to work around architectural limitations, this is mostly relegated to GCN, with Pascal exhibiting very small gains from it anyway. I agree that there could, and hopefully will be, more interesting use cases of the feature but I struggle to see how it will change anything as in all likelihood it will continue to be used to effectively fill large gaps in shader utilization that simply don't exist on the nvidia side. Vega being the wildcard here.
The problem is that in a complex system latencies can't be reliably known. It also won't guarantee reduced latency. It only works with simple, predictable patterns. The same software scheduling Nvidia currently performs can largely be done in hardware or rendered irrelevant..

Latencies are fixed, the only variation will come from branching. With Volta what has been confirmed is that state information for each running thread is separate, unlike previous designs where state information was kept per warp. The new scheduling logic exists to converge those branching threads into warps that can run with maximum occupancy instead of having half the warp masked out for example. The scheduling logic is there to arbitrate dispatches and account for dependencies in scheduled warps. This has nothing to do with async compute, or MPS, and there is no AMD equivalent to my knolwedge.
AMD's approach is the one used by streaming supercomputer designs. I'd expect Nvidia to move back towards a similar design as it makes enterprise work far easier. The blog seems to indicate that is the case, although still a bit vague. That design still works with the thread convergence model.

Can you elaborate?[/QUOTE]
 
20170526_122028.jpg
well it came in.

View attachment 25899

rx 580's soon to Vega's if Vega is x2 the flops and reasonable price range (under 650 bucks)
 

Attachments

  • 20170526_122009.jpg
    20170526_122009.jpg
    159.6 KB · Views: 73
Last edited:
http://www.amd.com/en-us/press-releases/Pages/amd-to-highlight-2017may25.aspx


AMD To Highlight New Products and PC Designs During Computex 2017 Press Conference and Webcast

SUNNYVALE, Calif. 5/25/2017
AMD (NASDAQ: AMD) today announced it will hold a press conference and live webcast during Computex 2017 in Taipei, Taiwan. The event will begin on Wednesday, May 31, 2017 at 10 a.m. local time / Tuesday, May 30, 2017 at 10 p.m. EDT. The one-hour event will feature appearances by AMD technology partners and updates on current and upcoming AMD products, including never-before-seen hardware demonstrations shown by AMD President and CEO Dr. Lisa Su and Senior Vice President and General Manager, Computing and Graphics Business Group, Jim Anderson.

A live webcast of the event will be accessible on the AMD Computex webpage. A replay of the webcast can be accessed a few hours after the conclusion of the live event and will be available for one year after the event.
 
http://www.amd.com/en-us/press-releases/Pages/amd-to-highlight-2017may25.aspx


AMD To Highlight New Products and PC Designs During Computex 2017 Press Conference and Webcast

SUNNYVALE, Calif. 5/25/2017
AMD (NASDAQ: AMD) today announced it will hold a press conference and live webcast during Computex 2017 in Taipei, Taiwan. The event will begin on Wednesday, May 31, 2017 at 10 a.m. local time / Tuesday, May 30, 2017 at 10 p.m. EDT. The one-hour event will feature appearances by AMD technology partners and updates on current and upcoming AMD products, including never-before-seen hardware demonstrations shown by AMD President and CEO Dr. Lisa Su and Senior Vice President and General Manager, Computing and Graphics Business Group, Jim Anderson.

A live webcast of the event will be accessible on the AMD Computex webpage. A replay of the webcast can be accessed a few hours after the conclusion of the live event and will be available for one year after the event.

No Raja? I am thinking they are removing him from stage, the guy is really bad at presenting a product. Also I am thinking we might see Vega launched at e3 instead of computex. This does not look like a launch event. They seem to be hiding raja lol.
 
No Raja? I am thinking they are removing him from stage, the guy is really bad at presenting a product. Also I am thinking we might see Vega launched at e3 instead of computex. This does not look like a launch event. They seem to be hiding raja lol.

More like they have only a few working samples and they'd rather not give Raja the opportunity drop it
 
No Raja? I am thinking they are removing him from stage, the guy is really bad at presenting a product. Also I am thinking we might see Vega launched at e3 instead of computex. This does not look like a launch event. They seem to be hiding raja lol.

Most tech nerds I know (myself included) can be very bad at presentations (hey, PR exists for a reason). The nVidia releases have been cringe-worthy as well.
 
More like they have only a few working samples and they'd rather not give Raja the opportunity drop it
No it's more like they are launching FE edition first to get top dollar from fanboys. Then month later launch gaming versions. Sad part is it will likely work lol.
 
Forgot to add, "Learn more about AMD Ryzen™ processors, the new Radeon™ Vega Frontier Edition graphics card..."

They don't say learn about the new RX Vega graphics card.

I'm thinking it's because there is no RX Vega product page for them to link to. Raja was pretty clear in that Reddit Q+A that there would be RX Vega information at Computex. In fact, I'm pretty sure it was the first thing he wrote to stem the tide of RX Vega questions.
 
No it's more like they are launching FE edition first to get top dollar from fanboys. Then month later launch gaming versions. Sad part is it will likely work lol.
I doubt even the most diehard AMD fanboys are going to buy 1000$+++ FE
 
I doubt even the most diehard AMD fanboys are going to buy 1000$+++ FE

It's not like they are making a million of these and there are more than enough stupid people in the world. People bought titan x for 1200 didn't they? Don't tell me enough people won't buy this for thousand for a pro card. There are always people like that man.
 
It's not like they are making a million of these and there are more than enough stupid people in the world. People bought titan x for 1200 didn't they? Don't tell me enough people won't buy this for thousand for a pro card. There are always people like that man.
I expect this to cost a good bit more than 1000$ considering the prices NV's equivalents command, but it all depends on how well it competes. It would be pretty sad to see AMD's high-end pro solution costing as much as a Titan X tbh.
 
I expect this to cost a good bit more than 1000$ considering the prices NV's equivalents command, but it all depends on how well it competes. It would be pretty sad to see AMD's high-end pro solution costing as much as a Titan X tbh.
Down the pike, whenever that will be, there should be Vega X2 with SSD - those will probably be the rather big $$$$$ cards.

FuryX and Nano as far as I recalled was all reference designs. Only the Fury's did AIB had dibs on redesigning. I wonder if Vega will be similar? I rather AMD open up the GPU's except maybe the HPC ones to AIBs. AIBs variations can hit the spot to more people due to the differences from size, clock speeds, price and adds competition getting us better pricing and of course more fun. Plus you have marketing from way more different companies all pushing these GPU's which is better for AMD. When 3dFX decided to make their own and exclude the AIBs that was like the proverbial nail in the coffin for them. Those AIBs just went to their competition, Nvidia and ATI and started to compete against vice work with 3dFX.
 
Last edited:
Down the pike, whenever that will be, there should be Vega X2 with SSD - those will probably be the rather big $$$$$ cards.

FuryX and Nano as far as I recalled was all reference designs. Only the Fury's did AIB had dibs on redesigning. I wonder if Vega will be similar? I rather AMD open up the GPU's except maybe the HPC ones to AIBs. AIBs variations can hit the spot to more people due to the differences from size, clock speeds, price and adds competition getting us better pricing and of course more fun. Plus you have marketing from way more different companies all pushing these GPU's which is better for AMD. When 3dFX decided to make their own and exclude the AIBs that was like the proverbial nail in the coffin for them. Those AIBs just went to their competition, Nvidia and ATI and started to compete against vice work with 3dFX.


well with the 3 different versions of Vega coming out, there probably will be ones that can be modified by AIB's, I expect the watercooled and top end Vega to stay with the reference design, just for brand recognition.
 
  • Like
Reactions: noko
like this
Will I be able to order a Vega GPU on Tuesday?
If not, then I only have one question left and you already know what it is.

where's vega?
 
well with the 3 different versions of Vega coming out, there probably will be ones that can be modified by AIB's, I expect the watercooled and top end Vega to stay with the reference design, just for brand recognition.
I hope AIBs will be able to customize the full Vega chip and not a cut down version like on Fiji. If there was more versions of FuryX vice AMD version only - it may have done better.
 
MPS has absolutely nothing to do with this, MPS is a high level driver component at the moment. With Volta it has hardware accelerated functions; virtualization and HPC oriented feature.
Which is the async problem. On AMD they track dependencies between command lists. Dispatch kernel A when condition is met and so on. It's serial execution with the ability to look ahead and pull a task forward. It avoids the GPU having to notify the CPU a job finished and kernels can run out of order with unpredictable behavior. A kernel with an early out for example could complete quickly or run forever, making prediction difficult. The Nvidia model would queue a kernel behind it that hopefully covered that worst case, which won't always be possible.

Why would you need to flush an SM when it has finished execution? Even on GCN if you were to pipe in a high priority compute kernel you would have to either preempt the running threads (not certain if this can be at SIMD granularity aka 4 per CU) or wait for them to finish executing. Pascal has pixel and instruction level preemption for graphics and compute kernels respectively. I didn't understand the part in bold, GMU is just the global scheduler, an ARM design chip which has been redesigned for Volta. I remember watching an nvidia engineer talk about their plans for the next generation (vs maxwell) microcontroller
Because a small or depleted thread group could be scheduled leading to inefficient use. A single pixel could hold up the entire unit. If executing a graphics kernel with a single warp remaining, a following compute kernel couldn't schedule work to keep occupancy high. Even worse that single pixel could be stuck on a series of dependent memory accesses at which point it can't execute most cycles so just sits there. The same behavior could occur with synchronization where it stalls awaiting a condition.

On GCN a SIMD is up to 10 waves being cycled through execution. Thats the streaming supercomputer model with all work packetized. Every fourth cycle a new wave is executing so preemption is pointless unless the CU was fully occupied. A prioritization process determines that order in hardware. In the case preemption was required with full occupancy, waves could be suspended and spilled to memory to make room, but a more efficient method would be to reserve space in anticipation. Thats assuming pausing dispatch of one kernel and dispatching the higher priority is insufficient. Really depends on the high priority task.

The async issue is that Nvidia is a hard transition while AMD a soft one under common circumstances. That's also where the concurrency benefits come from with the ability to naturally schedule complementary bottlenecks. One kernel stalling likely results in another getting scheduled to reach an equilibrium.

I don't understand the question, and you can argue that current implementations are effectively just using async compute to work around architectural limitations, this is mostly relegated to GCN, with Pascal exhibiting very small gains from it anyway.
As I explained above, it's more than just architectural limitations. Nearly all workloads won't map naturally to hardware. Kernels could be ALU, bandwidth, cache, etc bound. One kernel waiting on memory fetches means another just wanting to execute a lot of math from cache is free to do so. Nvidia's issue is these ratios need to be anticipated while AMDs HWS in conjunction with ACEs will do it more naturally. Poll real metrics from a CU and schedule a kernel that should balance out the work.

Latencies are fixed, the only variation will come from branching. With Volta what has been confirmed is that state information for each running thread is separate, unlike previous designs where state information was kept per warp. The new scheduling logic exists to converge those branching threads into warps that can run with maximum occupancy instead of having half the warp masked out for example. The scheduling logic is there to arbitrate dispatches and account for dependencies in scheduled warps. This has nothing to do with async compute, or MPS, and there is no AMD equivalent to my knolwedge.
Instruction latencies are fixed, but memory latencies due to resource contention will be unpredictable.

Volta to my understanding is more like operating on a 1024 wide warp with a 32 wide SIMD and the ability to shift lanes to repack everything. Far less remainders from divergence in only a 32 thread warp. If you make 32 thread groups it would behave like previous generations.

What's still unclear is if it alternates groups from different kernels like GCN while doing that thread convergence thing. As I tried to point out before, that should tie into the hardware MPS feature to present the different kernels. That could offer a really nice mix of capabilities as inherent bottlenecks would balance out and execution resources used more efficiently.
 
Which is the async problem. On AMD they track dependencies between command lists. Dispatch kernel A when condition is met and so on. It's serial execution with the ability to look ahead and pull a task forward. It avoids the GPU having to notify the CPU a job finished and kernels can run out of order with unpredictable behavior. A kernel with an early out for example could complete quickly or run forever, making prediction difficult. The Nvidia model would queue a kernel behind it that hopefully covered that worst case, which won't always be possible.

MPS DOESNT EVEN WORK in Windows! Its not even analogous to Aync compute its for multi application functions, its a driver level function also that has been there since Fermi, so I don't know why the hell you are still talking about!


Because a small or depleted thread group could be scheduled leading to inefficient use. A single pixel could hold up the entire unit. If executing a graphics kernel with a single warp remaining, a following compute kernel couldn't schedule work to keep occupancy high. Even worse that single pixel could be stuck on a series of dependent memory accesses at which point it can't execute most cycles so just sits there. The same behavior could occur with synchronization where it stalls awaiting a condition.

If that is happening, the program was written wrong

The async issue is that Nvidia is a hard transition while AMD a soft one under common circumstances. That's also where the concurrency benefits come from with the ability to naturally schedule complementary bottlenecks. One kernel stalling likely results in another getting scheduled to reach an equilibrium.

With Pascal there is no hard transition, the latency of the re partitioning for Pascal is what gives GCN an advantage. Keep this in mind, GCN has about 1/2 the latency difference and also 1/2 of the resources needed too because of their finer grain.

As I explained above, it's more than just architectural limitations. Nearly all workloads won't map naturally to hardware. Kernels could be ALU, bandwidth, cache, etc bound. One kernel waiting on memory fetches means another just wanting to execute a lot of math from cache is free to do so. Nvidia's issue is these ratios need to be anticipated while AMDs HWS in conjunction with ACEs will do it more naturally. Poll real metrics from a CU and schedule a kernel that should balance out the work.

No you did not explain it, because that is not what was going on.


Instruction latencies are fixed, but memory latencies due to resource contention will be unpredictable.

Volta, nada, that isn't how they stated it. So can't even assume that

Volta to my understanding is more like operating on a 1024 wide warp with a 32 wide SIMD and the ability to shift lanes to repack everything. Far less remainders from divergence in only a 32 thread warp. If you make 32 thread groups it would behave like previous generations.

Where did that come from?

What's still unclear is if it alternates groups from different kernels like GCN while doing that thread convergence thing. As I tried to point out before, that should tie into the hardware MPS feature to present the different kernels. That could offer a really nice mix of capabilities as inherent bottlenecks would balance out and execution resources used more efficiently.

And as I pointed out and showed you documentation, MPS is nothing like what is on GCN, nor does it have to with async compute.
 
If that is happening, the program was written wrong
Asynchronously with DX12/Vulkan? Yeah I could see why you would call that written wrong.

MPS DOESNT EVEN WORK in Windows! Its not even analogous to Aync compute its for multi application functions, its a driver level function also that has been there since Fermi, so I don't know why the hell you are still talking about!
So multiple applications are different from multiple independent kernels in the same program how exactly? It really looks like you haven't the faintest clue wtf you're talking about here. This shit isn't that difficult to understand either.

With Pascal there is no hard transition, the latency of the re partitioning for Pascal is what gives GCN an advantage. Keep this in mind, GCN has about 1/2 the latency difference and also 1/2 of the resources needed too because of their finer grain.
So you now have evidence that Pascal can perform graphics and compute jobs concurrently on a SIMT? Because I haven't seen anyone in the entire industry manage that yet. If that can't occur it will be a hard transition. They run either compute or graphics until they finish and move on. The SM granularity is better than Maxwell, but still leaves open the possibility of unfortunate code paths alternating between compute and graphics. Even heterogeneous compute tasks would be difficult without explicit coding.

Where did that come from?
Volta transforms this picture by enabling equal concurrency between all threads, regardless of warp. It does this by maintaining execution state per thread, including the program counter and call stack, as Figure 11 shows.
Nvidia's blog on Volta? It should be that Independent Thread Scheduling, but I'll take a look at it again. There are only so many ways all threads can execute the same instruction in SIMT while running independently and diverging. Threads from different warps with the same kernel will share instruction paths while only the addressing differs. The only other option is the SIMT width is much narrower, bordering on a bunch of scalars. Maybe 4 or 8 wide SIMTs, but given the "regardless of warp" the joining seems more likely.

And as I pointed out and showed you documentation, MPS is nothing like what is on GCN, nor does it have to with async compute.
Thanks for finally acknowledging Pascal doesn't have the hardware to properly support async.

MPS is not needed to have multiple kernels executing concurrently
Out of order and without context switching it would be, and that's the issue. Waiting for one queue to flush to being another to achieve concurrency is insufficient. It would be on par with targeting concurrency between the last 1% and first 1% of two kernels as they overlap. Assuming two kernels of a million threads each, only so many will be in flight at any given time.
 
Asynchronously with DX12/Vulkan? Yeah I could see why you would call that written wrong.

MPS is not for that! Do you understand what multi applications needs are? Have you every looked at how and OS and applications are spread over using the same GPU resources over multiple applications?

Just an example, when running and AI application that renders, you will need to use two different applications one a satellite of the other, but using the same resources (using this example because it was shown of by nV so its easy to understand)

Another example, trading floor and using AI to give suggestions in stocks to buy. Predication is done by one application, Prediction by another, number crunching is done by another, Three separate databases with Three separate programs with the same resource allocation across the same GPU or GPU's. Data locality is important here so resources must be shared based on that.

Those are the things MPS is for.


So multiple applications are different from multiple independent kernels in the same program how exactly? It really looks like you haven't the faintest clue wtf you're talking about here. This shit isn't that difficult to understand either.

As above.


So you now have evidence that Pascal can perform graphics and compute jobs concurrently on a SIMT? Because I haven't seen anyone in the entire industry manage that yet. If that can't occur it will be a hard transition. They run either compute or graphics until they finish and move on. The SM granularity is better than Maxwell, but still leaves open the possibility of unfortunate code paths alternating between compute and graphics. Even heterogeneous compute tasks would be difficult without explicit coding.


Fire up doom or Ashes or GOW4, and test profile them you will see they are lol. Even MDolenc's, concurrent tester shows it does too.


Nvidia's blog on Volta? It should be that Independent Thread Scheduling, but I'll take a look at it again. There are only so many ways all threads can execute the same instruction in SIMT while running independently and diverging. Threads from different warps with the same kernel will share instruction paths while only the addressing differs. The only other option is the SIMT width is much narrower, bordering on a bunch of scalars. Maybe 4 or 8 wide SIMTs, but given the "regardless of warp" the joining seems more likely.

You didn't hear the part nV's Volta doesn't need sync points? We don't know enough about Volta yet to make any sort of conclusions. The pipeline for the shader array for compute seems to be vastly more flexible and superior to anything nV has done before.

Thanks for finally acknowledging Pascal doesn't have the hardware to properly support async.

Read above. You are making very astute conclusions based off of very wrong information ;). You are trying to assert that nV's hardware functions like AMD's hardware or nV's pathway's have to be similar in function to AMD's pathways to function properly, THEY DO NOT, nor do they even have to. A premise that you are taking that is biasing you understanding of what may be going on and taking you in a wrong direction.

Out of order and without context switching it would be, and that's the issue. Waiting for one queue to flush to being another to achieve concurrency is insufficient. It would be on par with targeting concurrency between the last 1% and first 1% of two kernels as they overlap. Assuming two kernels of a million threads each, only so many will be in flight at any given time.

NOPE that is not it.
 
Last edited:
And to think this all started because some said Volta has gone back to being HW scheduler just like pre-Kepler :)
Not sure why it is being considered to GCN in terms of implementation as well by some for scheduler-threads.
Cheers
 
Back
Top