Why haven't we seen dual-chip cards that share memory?

Rizen

[H]F Junkie
Joined
Jul 16, 2000
Messages
9,487
So we've had dual GPU video cards for quite some time now - why don't we have video cards that can share the memory between both GPUs? This seems like it would be a win/win for both manufacturers and for consumers. You wouldn't need to needlessly double RAM on a video card just to duplicate the same information for each chip, and you could build video cards with more memory (like a 4GB GTX 690, which would currently require 8GB effectively) for much cheaper.

What is the technological hurdle that we haven't jumped over yet?
 
The technological hurdle is building that logic into the GPU's themselves. And why do that when .1% of all your customers will be running more than one at a time.
 
The technological hurdle is building that logic into the GPU's themselves. And why do that when .1% of all your customers will be running more than one at a time.
I realize that they would have to develop that logic, my question is why hasn't this happened already? Considering that both AMD and NVIDIA have released dual-chip flagship cards for the past several generations consistently, I would imagine it's not just laziness. If they applied your logic to their business, why even bother with high end cards (>$300-350) at all? I'm sure that's a tiny fraction of their overall revenue from graphics.
 
You'd have to route dual port RAM and two 256-bit buses (or even 512-bit depending on the chip...big Kepler might have this) on the PCB which would be difficult as hell, then we'd have to talk about additional memory management support on the silicon itself, and then we'd have to add some features into the drivers to manage the memory as well. Either than or you'd have to invent a high speed bus between the two chips that can support 200GB/s or thereabouts, and still have additional work to do with silicon and drivers. Its probably cheaper to just go buy 2x the RAM and split it evenly between the two cards than to invest all that time and money into a halo product that is going to have relatively low sales compared to the bread-and-butter of midrange cards.
 
You'd have to route dual port RAM and two 256-bit buses (or even 512-bit depending on the chip...big Kepler might have this) on the PCB which would be difficult as hell, then we'd have to talk about additional memory management support on the silicon itself, and then we'd have to add some features into the drivers to manage the memory as well. Either than or you'd have to invent a high speed bus between the two chips that can support 200GB/s or thereabouts, and still have additional work to do with silicon and drivers. Its probably cheaper to just go buy 2x the RAM and split it evenly between the two cards than to invest all that time and money into a halo product that is going to have relatively low sales compared to the bread-and-butter of midrange cards.

There ya have it. Those dual GPU cards are SO low production, that its just not worth the effort.
 
You'd have to route dual port RAM and two 256-bit buses (or even 512-bit depending on the chip...big Kepler might have this) on the PCB which would be difficult as hell, then we'd have to talk about additional memory management support on the silicon itself, and then we'd have to add some features into the drivers to manage the memory as well. Either than or you'd have to invent a high speed bus between the two chips that can support 200GB/s or thereabouts, and still have additional work to do with silicon and drivers. Its probably cheaper to just go buy 2x the RAM and split it evenly between the two cards than to invest all that time and money into a halo product that is going to have relatively low sales compared to the bread-and-butter of midrange cards.
I guess that does make sense given how cheap RAM is these days. Wasn't ATI working on something like this though, called SidePort?
 
So we've had dual GPU video cards for quite some time now - why don't we have video cards that can share the memory between both GPUs? This seems like it would be a win/win for both manufacturers and for consumers. You wouldn't need to needlessly double RAM on a video card just to duplicate the same information for each chip, and you could build video cards with more memory (like a 4GB GTX 690, which would currently require 8GB effectively) for much cheaper.

What is the technological hurdle that we haven't jumped over yet?

numa latency
 
There's no point in shared memory when all GPUs have integrated memory controllers and they can communicate with each other using a fast bus (like PCIe). In theory you can already "share" memory now using this architecture by pulling data from another card's memory over PCIe (note this is how multiprocessor systems work, they just use QPI/HT instead of PCIe), but the latencies are so huge it's not realistic for GPU workloads.

If you want "real" shared memory you have to introduce a separate northbridge chip that adds tons of latency and is completely useless for single GPUs. Being memory as cheap as it is it's not a big deal to just store a copy of every texture for every GPU.
 
There's no point in shared memory when all GPUs have integrated memory controllers and they can communicate with each other using a fast bus (like PCIe). In theory you can already "share" memory now using this architecture by pulling data from another card's memory over PCIe (note this is how multiprocessor systems work, they just use QPI/HT instead of PCIe), but the latencies are so huge it's not realistic for GPU workloads.

If you want "real" shared memory you have to introduce a separate northbridge chip that adds tons of latency and is completely useless for single GPUs. Being memory as cheap as it is it's not a big deal to just store a copy of every texture for every GPU.

It is more than latency. PCIe = 4GB/s. 256-bit Kepler bus = 192GB/s. Not even close to be useful inter-GPU rendering communication that involves anything more than basic synchronization and high level management of rendering code.
 
It is more than latency. PCIe = 4GB/s. 256-bit Kepler bus = 192GB/s. Not even close to be useful inter-GPU rendering communication that involves anything more than basic synchronization and high level management of rendering code.

Was just trying to make the point that even with a proper SMP architecture with a wide inter-GPU bus, the latencies still make it unworkable.
 
Thing is, you don't really need to share memory, since each gpu renders a different frame.

SFR is rarely used nowadays, but it would benefit from shared memory.
 
I guess that does make sense given how cheap RAM is these days. Wasn't ATI working on something like this though, called SidePort?

http://www.anandtech.com/show/2937/5

it was a better way to interconnect the GPU's.

ATI also had the ATI RAGE MAXX card which I believe shared memory, but they could never get good support to use their dual VPU card and it ultimated felt like a RADE 3D128.

Benefits of having shared memory on these cards dont outweigh the engineering costs, which would delay these cards even further. Sadly...
 
Sideport was meant to address sync bottlenecks in AFR, not share memory resources. What would be the point of a GPU allocating any data on RAM that is so "far" away that the best you can do is 5GB/s, when you have a perfectly nice 128GB+/s vRAM bank right here?
 
Sideport was meant to address sync bottlenecks in AFR, not share memory resources. What would be the point of a GPU allocating any data on RAM that is so "far" away that the best you can do is 5GB/s, when you have a perfectly nice 128GB+/s vRAM bank right here?

Yea sideport was such a good idea too, it would effectively end the microstutter if it worked as intended but apparently it required too much power to work effectively. I mean they are kind of opening up their own lane across the NB (I think it was stored in the NB) on their own motherboards and in the end had to be scrapped.

Off topic but this didn't have anything to do with memory, I was merely quoting because someone asked about it. AMD still has this technology but 5GB/s lane dedicated to two GPU's is already slow.
 
Inter processor communication bandwidth is the limit. In a dual (or more) processor PC you find that each processor has its own memory that it controls. Other processors then have to talk to that processor to get access to the memory. There are latency penalties for that. Now with a CPU, not such a big deal. However for GPUs, being so bandwidth limited, it wouldn't be practical. The amount of bandwidth between the chips that would be needed would be prohibitive. Cheaper to just buy more memory.
 
I'm wondering why we don't have multicore GPUs yet.

GPUs are already massively parallel processors. The GTX 680, for instance, actually consists of 1536 small general compute processors as well as a few hundred special function processors.
 
We use the excuse that nvidia/AMD won't add the functionality to their top end, low production cards, yet every single lower/mid to enthusiast card has one or two crossfire/SLI connections, which to me says that any high-end tech designed as a pipedream for engineers trickles down to the low end eventually, much like how automobile companies use concept cars as tech testbenches for their production vehicles. Whatever crazy tech they can engineer for the tippy-top extremely high end can be used on midrange cards eventually... Example: eyefinity.
 
Are we even sure both cards actually have same data ?
And if they use same data it would mean queing when both have to acess same segment of memory increasing latencies.
 
Are we even sure both cards actually have same data ?
And if they use same data it would mean queing when both have to acess same segment of memory increasing latencies.

That's why I said you'd have to have dual port RAM. So both could read simultaneously without additional latency.
 
So we've had dual GPU video cards for quite some time now - why don't we have video cards that can share the memory between both GPUs? This seems like it would be a win/win for both manufacturers and for consumers. You wouldn't need to needlessly double RAM on a video card just to duplicate the same information for each chip, and you could build video cards with more memory (like a 4GB GTX 690, which would currently require 8GB effectively) for much cheaper.

What is the technological hurdle that we haven't jumped over yet?

Two reasons. Any memory controller that allowed access to the same memory location to two different processors would be an arbitor... It would have no choice but to stall memory accesses and writes for one GPU until the other GPU was done and had shown its frame. That would serialize a process that is currently parallel, and kill multi gpu scaling by adding lots of delay to the rendering process.

Also the other reason is that back to back frames only share some data. The render targets and buffers have data unique to that frame... And those represent the lion's share of the memory consumption. So while sharing memory might allow you to cut down on some memory onboard, it certainly wouldnt get you down to a single gpu's worth of memory because to render some number of frames in parallel you'd need a separate copy of each of those resources per GPU. No getting around that.
 
Are we even sure both cards actually have same data ?
And if they use same data it would mean queing when both have to acess same segment of memory increasing latencies.

Smart guy, asking the right questions. Only some of the data is identical, ruling out much of the benefit.
 
I guess that does make sense given how cheap RAM is these days. Wasn't ATI working on something like this though, called SidePort?

They weren't just working on it... They marketed it as a feature of one of their dual-gpu products... I think 3870x2. They said they'd release a driver to support it when they got it working where it was actually benefiting performance. They never released it.
 
Yea sideport was such a good idea too, it would effectively end the microstutter if it worked as intended but apparently it required too much power to work effectively. I mean they are kind of opening up their own lane across the NB (I think it was stored in the NB) on their own motherboards and in the end had to be scrapped.

Off topic but this didn't have anything to do with memory, I was merely quoting because someone asked about it. AMD still has this technology but 5GB/s lane dedicated to two GPU's is already slow.

It would not affect microstutter in any way. Microstutter comes from the fact that you're rendering frames in parallel, but then presenting them serially which you'd still end up doing with a high speed interconnect being present. With single GPU the rate at which you see new frames is fairly consistent because its naturally gating/pacing itself... It gets a new frame when the last frame is done, and this new frame probably takes roughly as long as the last frame. So you naturally get evenly paced/spaced updates. This isnt the case with multi-gpu.
 
what shame?

not sure if serious....?

Gpus are already "multi-core" chips, eg, each 7970 has ~2000 cores, and if u want to break it down to execution clusters or w/e they are called, each gpu has 5 cores (containing 400 execution units each) etc....
 
not sure if serious....?

Gpus are already "multi-core" chips, eg, each 7970 has ~2000 cores, and if u want to break it down to execution clusters or w/e they are called, each gpu has 5 cores (containing 400 execution units each) etc....

I'm not exactly a tech genius but even I know that Stream Processors aren't exactly the same thing as a physical core.

And don't they have multicore GPUs in mobile phones now? So it's not like they don't exist.
 
Last edited:
Yup, since unified shaders GPUs are basically massively vectorized multicore CPUs with just a little bit of hardware added for geometry setup and rastering.
 
I'm not exactly a tech genius but even I know that Steam Processors aren't exactly the same thing as a physical core.

And don't they have multicore GPUs in mobile phones now? So it's not like they don't exist.

A "single core" GPU in a phone is just that. A single processor with fixed function hardware specifically designed for floating point intensive math. Dual core GPU on a phone is two processors with fixed function hardware for accelerating functions used in rendering (sometimes that fixed function hardware is shared between the two cores, like texture units for instance). Think of that single (dual) core as a stream processor. The phone either has one SP, or two SP. Same 'type' of design as the GTX 680 but cut down by many orders of magnitude.

A GTX 680 is 1536 single processors with supporting fixed function hardware shared amongst all of them (texture units, ROPs, etc). It is already multicore.
 
A "single core" GPU in a phone is just that. A single processor with fixed function hardware specifically designed for floating point intensive math. Dual core GPU on a phone is two processors with fixed function hardware for accelerating functions used in rendering (sometimes that fixed function hardware is shared between the two cores, like texture units for instance). Think of that single (dual) core as a stream processor. The phone either has one SP, or two SP. Same 'type' of design as the GTX 680 but cut down by many orders of magnitude.

A GTX 680 is 1536 single processors with supporting fixed function hardware shared amongst all of them (texture units, ROPs, etc). It is already multicore.

okay maybe I'm missing something here...are those 1536 physical cores? if it is then why is there only 1 core clock rate and not 1536 clock rates? I'm reading about Stream Processors right now and it seems they're more like virtual processors (kinda like hyperthreading).
 
A "single core" GPU in a phone is just that. A single processor with fixed function hardware specifically designed for floating point intensive math. Dual core GPU on a phone is two processors with fixed function hardware for accelerating functions used in rendering.

A GTX 680 is 1536 single processors with supporting fixed function hardware shared amongst all of them (texture units, ROPs, etc). It is already multicore. Its many core. Its 1536 core.

Not exactly. Stream processors by themselves can't execute a complete shader program, they're more like ALUs and FPU's in a CPU. Shader clusters can definitely execute a program though. So the 680 technically has 8 cores (with 192 execution units each) if you try to compare it to a CPU.
 
okay maybe I'm missing something here...are those 1536 physical cores? if it is then why is there only 1 core clock rate and not 1536 clock rates? I'm reading about Stream Processors right now and it seems they're more like virtual processors (kinda like hyperthreading).

The one clock is shared amongst all of the cores (see notes below as even my statements have been slightly inaccurate). Intel (for instance) only has to deal with four cores with a handful of execution units per core so they can afford to have multiple clocks.

Not exactly. Stream processors by themselves can't execute a complete shader program, they're more like ALUs and FPU's in a CPU. So the 680 technically has 8 cores (with 192 execution units each) if you try to compare it to a CPU.

This is true. I'm thinking more about the whole CUDA thing (warp schedulers, read/write latencies, etc) than anything. I guess you can say it's 8 individual processors (streaming multiprocessor) with 4 32-vector-wide dual issue cores (warp schedulers). There's 192 arithmetic units supporting those cores.
 
Also, trying to compare CPU cores to GPU cores doesn't really work that well. They work in very different ways.
 
Back
Top