Techspot Tests a Multi GPU Setup in 2018

The question is what is good enough to make it appear as one GPU. They already store textures on each card, which is a lot of the bandwidth.

I don’t know the answer but what I know for certain is I won’t touch mGPU until it appears as one GPU and “just works”.

Realistically? I doubt it will ever get to the point where the benefits of having combined VRAM outweighs simply having cards with double the RAM and mirroring assets across each card.

Case in point: AMD's Threadripper and its optimal memory allocation. There are performance benefits to making sure that each core is working from its own memory bank, and this is on a large chip with dies very close to each other.

Also, with the direction development is being pushed, we are getting further and further away from the "just works" scenario. DX12 and Vulkan are moves away from that paradigm.
 
Realistically? I doubt it will ever get to the point where the benefits of having combined VRAM outweighs simply having cards with double the RAM and mirroring assets across each card.

Case in point: AMD's Threadripper and its optimal memory allocation. There are performance benefits to making sure that each core is working from its own memory bank, and this is on a large chip with dies very close to each other.

Also, with the direction development is being pushed, we are getting further and further away from the "just works" scenario. DX12 and Vulkan are moves away from that paradigm.

I was saying mirroring assets is fine. So we don’t need full VRAM speeds, just fast enough to transfer whatever is needed to make it appear as one GPU. I don’t know enough to say how close we are or if mirroring even helps for that case, but what I do know is I’ll have nothing to do with mGPU as it is now.
 
I was saying mirroring assets is fine. So we don’t need full VRAM speeds, just fast enough to transfer whatever is needed to make it appear as one GPU. I don’t know enough to say how close we are or if mirroring even helps for that case, but what I do know is I’ll have nothing to do with mGPU as it is now.

Really the only point of pooling VRAM is to be able to split the assets onto multiple GPUs. The rendered frame itself doesn't take that much memory as I recall. It's the assets that take up most of the memory, and it's these assets that the GPU core has to work from in order to render the frame. The rendered frame transfer isn't that big a deal as I recall.

With alternate frame rendering (current SLI/crossfire implementation):
GPU1 starts rendering frame
GPU2 guesses what next frame is based on GPU1's frame and how long it might take to render it
GPU1 finishes and spits out frame
GPU2 confirms the next frame to finish
GPU1 guesses what the frame after is and starts rendering
GPU2 finishes frame and spits it to GPU1 to spit out to monitor
and repeat

The problem comes when the guessed frame doesn't match with what needs to be rendered, and must be discarded and start over. This results in high frametimes and microstutter. Drivers must be properly written to minimize this, and it differs depending on game and GPU setup.

With split frame rendering:
GPU1 takes half of one frame, and GPU2 takes half of the other
GPU2 sends rendered half of frame to GPU1, and GPU1 combines the frames before spitting it out.

The problem with split frame rendering is one of balancing workload. For example, a horizontal frame split may cause one GPU to be overworked rendering everything on the ground while the other is mostly idle rendering a clear sky.

Ultimately, it comes down to properly managing the workload to each GPU, and that means nVidia and AMD putting in the effort to properly write drivers in the case of DX11 games and game developers putting in the proper effort for DX12/Vulkan games. I don't see that happening anytime soon.
 
Realistically? I doubt it will ever get to the point where the benefits of having combined VRAM outweighs simply having cards with double the RAM and mirroring assets across each card.

Case in point: AMD's Threadripper and its optimal memory allocation. There are performance benefits to making sure that each core is working from its own memory bank, and this is on a large chip with dies very close to each other.

Also, with the direction development is being pushed, we are getting further and further away from the "just works" scenario. DX12 and Vulkan are moves away from that paradigm.

Your Threadripper example isnt entirely accurate. Two of the CCX modules on a Threadripper have PCIe and Memory Controllers. The other two CCX modules lack the PCIe and Memory Controllers and thus need to send the data over infinity fabric to the CCX modules with the controllers. The other issue you can run into is that each of the CCX modules with memory controllers has access to 2 memory channels, not all 4. Having said this, each CCX has infinity fabric to every other CCX module. If each CCX had its own memory controller and memory channel along with each CCX having an infinity fabric connection to every other CCX module, then it wouldnt matter which core or memory bank your work load used.

Reason I state this is the same method could be used with multiple video cards through an NV-Link style connector to enable pooled resources with no performance penalty due to latency.
 
Your Threadripper example isnt entirely accurate. Two of the CCX modules on a Threadripper have PCIe and Memory Controllers. The other two CCX modules lack the PCIe and Memory Controllers and thus need to send the data over infinity fabric to the CCX modules with the controllers. The other issue you can run into is that each of the CCX modules with memory controllers has access to 2 memory channels, not all 4. Having said this, each CCX has infinity fabric to every other CCX module. If each CCX had its own memory controller and memory channel along with each CCX having an infinity fabric connection to every other CCX module, then it wouldnt matter which core or memory bank your work load used.

Reason I state this is the same method could be used with multiple video cards through an NV-Link style connector to enable pooled resources with no performance penalty due to latency.

I don't see how each CCX having its own memory controller eliminates the performance penalty incurred by pulling data from another CCX's memory controller. It's the exact same scenario as the CCX without a memory controller pulling from a CCX with a memory controller. The only way to avoid that penalty is to have the RAM contents that is being worked on mirrored on each channel, which is the exact same thing current SLI/crossfire implementations use.
 
I don't see how each CCX having its own memory controller eliminates the performance penalty incurred by pulling data from another CCX's memory controller. It's the exact same scenario as the CCX without a memory controller pulling from a CCX with a memory controller. The only way to avoid that penalty is to have the RAM contents that is being worked on mirrored on each channel, which is the exact same thing current SLI/crossfire implementations use.

Ahh, I do see what you are saying with regard to not being able to have *pooled* memory in this specific setup (without a performance hit).... You are indeed correct on this... I misinterpreted what you initially said due to the "performance benefits that each core is working from its own memory bank" as no Threadripper is configured in this fashion.
 
Realistically? I doubt it will ever get to the point where the benefits of having combined VRAM outweighs simply having cards with double the RAM and mirroring assets across each card.

In your opinion, which setup has more GPU cores in it:
  1. A single 2080Ti with 22GB of VRAM, or
  2. Two 2080Tis with 11GB of VRAM each
Since 4K gaming and RT are compute-limited rather than scratch paper limited, doubling the compute capability seems like it would outweigh doubling the amount of scratch paper available to the GPU.
 
In your opinion, which setup has more GPU cores in it:
  1. A single 2080Ti with 22GB of VRAM, or
  2. Two 2080Tis with 11GB of VRAM each
Since 4K gaming and RT are compute-limited rather than scratch paper limited, doubling the compute capability seems like it would outweigh doubling the amount of scratch paper available to the GPU.

You still don't get it. You can either try rereading my posts or continue to plug your ears and live in your fantasy world. I've already explained why you're wrong and things don't work the way you think they do enough times.
 
I expect an apology from you if somebody ever invents a multi-Ghz wireless communication system.
 
Back
Top