Infinity Fabric: Why Crossfire needs to die and MultiGPU can be reborn

vjhawk · Sep 12, 2018

AMD shocked the CPU world when they released the epic Threadripper 2990WX. This 32 CPU 64 threaded beast crushed all Intel CPUs in multithreaded applications including most notably Cinebench.

What makes the 2990WX possible is the technology known as Infinity Fabric.

This unique technology allows all 32 cores of the Threadripper to gain access to the system memory.

If AMD wants to make a mark on the GPU world, they should implement Infinity Fabric to create true multi-gpu solutions and just let Crossfire die.

Crossfire doesn't succeed because developers are too lazy to dedicate coding resources to make it work when they know that 99% of the market uses single GPU solutions.

So to get around this, leverage the power of Infinity Fabric to create seamless multi-gpu solutions.

Create multi-gpu video cards that use the infinity fabric to link two gpus into a single virtual GPU that shares memory access across that fabric. Then design your own drivers to access and utilize this 'Virtual GPU' as a single unit.

Nvidia's expansionism of their price range actually creates an opportunity gap that AMD can exploit if they can successfully leverage new technology to create compelling counter solution.

For example the RTX 2080 Ti runs at $1200.

If AMD can create a VX 1090 with 2x smaller GPU (450-500mm2) that outperforms the Nvidia solution (750mm2) at $999 then they are winning the game.

At $699 they can again undercut the 2080 with 2x smaller core GPU, let's dub it the VX 1080.

As for single solution GPUs, RX 1070 and RX 1060 answer the call, bringing Vega level performance to mainstream consumers at around $299 and $199 respectively.

12GB on the VX line, 8 and 6 GB on the RX lines and AMD has achieved market segmentation. Of course skip HBM2 for these consumer lines, and go with DDR6 to have affordable memory for these components. And you can even downgrade to DDR5 or DDR5X for lower end models - RX 1050, 1030, etc. Meanwhile removing crossfire support from its video cards reduces manufacturing costs and increases profit margins for AMD.

How does Nvidia counter? They either need to cut prices or come out with faster products at reasonable prices. A win for all consumers.

Oh and before I go, what could be AMD's Titan killer? Oh that would be the AMD Cosmos VX 1099 featuring 4x GPUs configured just like the 2990WX, while featuring 24 GB-32GB of HBM2 memory. It would be aimed as an enterprise solution, but if you price it at $2499 for the 24GB version and $2799 for the 32GB version,it would undercut and potentially crush its competition with sheer brute force compute power.

KazeoHin · Sep 12, 2018

Just a heads up. 'Pooling' GPU memory on two different cards will never be a thing unless you are working with massive datasets. The latency introduced by even IF would be detrimental to performance.

Also, people believe that memory on SLI/CFX is 'mirrored', when in truth, it isn't. Each GPU uniquely controls it's own, individual memory bank with no regard for what the other GPU is doing with it's memory. It's just that its usually rendering the same exact thing so the memory usually has the same data.

I think, if AMD (or even Nvidia) really put their back into it, they could create a driver that can pool multiple GPUs into one "Virtual GPU" and have it be completely invisible to the software layer. No, I don't think this will be easy, or even likely, but it can be done. I would love to see a firmware/driver level approach to mGPU.

That said, AFR is literally the best way to do mGPU. any other technique is less efficient. In well optimized scenarios, AFR can hit nearly 99% scaling. This is rare due to lack of software optimizations, but no other method can reach that level of scaling even in theory. Think about it. You'll violate the laws of physics if you somehow combine two chips and get the power of more than two chips. By dual GPU AFR able to reach 199% of the performance of a single card, It's achieving close to the maximum theoretical performance possible. SFR is a close candidate, however there are reasons why it isn't ideal for dual GP configs. Using a SFAFR (Fplit-Frame-Alternate-Frame-Rendering) method with Quad GPUs is probably the best way to get good performance while keeping latency minimum (two groups of two cards alternate AFR-style rendering SFR frames), but AFR would achieve an overall higher FPS given the same resources.

Pieter3dnow · Sep 24, 2018

I would like to say that AFR is the way of the Dodo pooling multiple GPU (or cpu) like what was done in Mantle is the way to go where you would be able to send non time sensitive tasks to other computational devices. As a perk it does not matter how many "cards" you would have.

noko · Sep 28, 2018

For a GPU, I can see Infinity fabric being deployed for Ray Tracing cores or other special type cores, breaking up the different chip functions. Upgrades to any of them would be much simpler then redoing a whole large chip. Infinity fabric in Vega is so very interesting and speaks loudly to bandwidth potential. The Infinity fabric connects the high bandwidth controller and HBM2 memory to the rest of the chip. Meaning Infinity fabric is just as fast as HBM2. The four HBM2 stack Instinct 7nm Vegas will have double the bandwidth potential with Infinity fabric. Bascially 1TB/sec. Plus the large wide bus, 4096 bit, allows memory transfers at the same time for multitude of different functions (high bandwidth controller) in a clock cycle preventing or limiting latency to different parts of a chip needing data or sending data.

What Nvidia has over AMD is AI for denoising of the rendered raytraced lighting sample (since the number of rays in life is virtually a number of magnitudes larger, ray tracing has a lot of noise or blank spots). Unless AMD throws a lot more processing to clean up the render, they would also would need to use AI to clean up the image.

Pieter3dnow · Sep 29, 2018

noko said:
For a GPU, I can see Infinity fabric being deployed for Ray Tracing cores or other special type cores, breaking up the different chip functions. Upgrades to any of them would be much simpler then redoing a whole large chip. Infinity fabric in Vega is so very interesting and speaks loudly to bandwidth potential. The Infinity fabric connects the high bandwidth controller and HBM2 memory to the rest of the chip. Meaning Infinity fabric is just as fast as HBM2. The four HBM2 stack Instinct 7nm Vegas will have double the bandwidth potential with Infinity fabric. Bascially 1TB/sec. Plus the large wide bus, 4096 bit, allows memory transfers at the same time for multitude of different functions (high bandwidth controller) in a clock cycle preventing or limiting latency to different parts of a chip needing data or sending data.

What Nvidia has over AMD is AI for denoising of the rendered raytraced lighting sample (since the number of rays in life is virtually a number of magnitudes larger, ray tracing has a lot of noise or blank spots). Unless AMD throws a lot more processing to clean up the render, they would also would need to use AI to clean up the image.

What Nvidia has over AMD is a boatload of work ahead of themselves before games will run all of those feature exclusive to Nvidia hardware

Snowdog · Sep 30, 2018

noko said:
For a GPU, I can see Infinity fabric being deployed for Ray Tracing cores or other special type cores, breaking up the different chip functions. Upgrades to any of them would be much simpler then redoing a whole large chip. Infinity fabric in Vega is so very interesting and speaks loudly to bandwidth potential. The Infinity fabric connects the high bandwidth controller and HBM2 memory to the rest of the chip. Meaning Infinity fabric is just as fast as HBM2. The four HBM2 stack Instinct 7nm Vegas will have double the bandwidth potential with Infinity fabric. Bascially 1TB/sec. Plus the large wide bus, 4096 bit, allows memory transfers at the same time for multitude of different functions (high bandwidth controller) in a clock cycle preventing or limiting latency to different parts of a chip needing data or sending data.

What Nvidia has over AMD is AI for denoising of the rendered raytraced lighting sample (since the number of rays in life is virtually a number of magnitudes larger, ray tracing has a lot of noise or blank spots). Unless AMD throws a lot more processing to clean up the render, they would also would need to use AI to clean up the image.

No. Infinity Fabric isn't magic. It's just a new the latest version of Hypertransport, and off chip it just tends to just run on standard speed PCIe links. It's just an interconnect, they aren't any faster than any other companies interconnects. But give it a cool name and many think it now solves all problems.

Breaking a GPU into pieces makes latency higher and largely unworkable.

There has been a couple of papers with MCM GPUs, but look closer and you find that these are compute solutions, and not gaming solutions.

Compute is not as latency sensitive, as gaming.

For gaming. Multi-die GPU will likely remain identical fully capable dies, with their own memory pool running in AFR. Basically the same SLI/CF we have had for years, with the same pros/cons we have had for years.

Pieter3dnow · Sep 30, 2018

Snowdog said:
No. Infinity Fabric isn't magic. It's just a new the latest version of Hypertransport, and off chip it just tends to just run on standard speed PCIe links. It's just an interconnect, they aren't any faster than any other companies interconnects. But give it a cool name and many think it now solves all problems.

Breaking a GPU into pieces makes latency higher and largely unworkable.

There has been a couple of papers with MCM GPUs, but look closer and you find that these are compute solutions, and not gaming solutions.

Compute is not as latency sensitive, as gaming.

For gaming. Multi-die GPU will likely remain identical fully capable dies, with their own memory pool running in AFR. Basically the same SLI/CF we have had for years, with the same pros/cons we have had for years.

Well lets say that you could do something as 1 "normal" gpu and 2 ray tracing gpu on one board. That could work out well depending on the size and how well the communication between the gpu can be handled regarding memory.
On Mantle we seen before that you would be able to delegate tasks to other gpu (or compute sources). That means that you don't have to stick to AFR or SFR.

Neapolitan6th · Sep 30, 2018

Not very educated on the topic, but I was reading about Nvidia's VR SLI implementation which seemed pretty pertinent to this discussion.

Deleted member 82943 · Sep 30, 2018

vjhawk said:
AMD shocked the CPU world when they released the epic Threadripper 2990WX. This 32 CPU 64 threaded beast crushed all Intel CPUs in multithreaded applications including most notably Cinebench.

What makes the 2990WX possible is the technology known as Infinity Fabric.

Put down the koolaid my man. Games != Cinebench r15. Intel still holds the IPC crown which is important for games. The money is no object intel CPUs still outperform the 2990WX. Now that’s not to say that the chip doesn’t have a place where it shines but this post is about the next generation crossfire setup and gaming and better scaling will help but it won’t bridge the gap as the gap isn’t the GPU perf but the CPUs.

Pieter3dnow · Oct 1, 2018

gigatexal said:
Put down the koolaid my man. Games != Cinebench r15. Intel still holds the IPC crown which is important for games. The money is no object intel CPUs still outperform the 2990WX. Now that’s not to say that the chip doesn’t have a place where it shines but this post is about the next generation crossfire setup and gaming and better scaling will help but it won’t bridge the gap as the gap isn’t the GPU perf but the CPUs.

You mean Intel the company that deemed you are only worthy of 4 cores and if you paid enough 8 threads?
Single handedly stopped progress in software as well as hardware.
Beside taking your money those are the only things Intel was really good at ....

noko · Oct 1, 2018

Snowdog said:
No. Infinity Fabric isn't magic. It's just a new the latest version of Hypertransport, and off chip it just tends to just run on standard speed PCIe links. It's just an interconnect, they aren't any faster than any other companies interconnects. But give it a cool name and many think it now solves all problems.

Breaking a GPU into pieces makes latency higher and largely unworkable.

There has been a couple of papers with MCM GPUs, but look closer and you find that these are compute solutions, and not gaming solutions.

Compute is not as latency sensitive, as gaming.

For gaming. Multi-die GPU will likely remain identical fully capable dies, with their own memory pool running in AFR. Basically the same SLI/CF we have had for years, with the same pros/cons we have had for years.

Where do you get running at standard speeds at? Infinity fabric in Vega runs at HBM2 speeds and transfer rates. Not pcie rates.

Even Ryzen Infinity fabric runs at DDR4 memory speeds and rates and not pcie rates. Infinity fabric is the real dope.

Pieter3dnow · Oct 1, 2018

noko said:
Where do you get running at standard speeds at? Infinity fabric in Vega runs at HBM2 speeds and transfer rates. Not pcie rates.

Even Ryzen Infinity fabric runs at DDR4 memory speeds and rates and not pcie rates. Infinity fabric is the real dope.

https://en.wikichip.org/wiki/amd/infinity_fabric

Should clear things up

.

Snowdog · Oct 1, 2018

It's faster than PCI 3, but about the same as PCI 4.

Less than Half of NVlink, and NVlink is insufficient to treat the memory pool on the GPU as one big memory space.

defaultluser · Oct 1, 2018

Snowdog said:
It's faster than PCI 3, but about the same as PCI 4.

Less than Half of NVlink, and NVlink is insufficient to treat the memory pool on the GPU as one big memory space.

Right, inexpensive inter-device communication standards will always be ten times slower than local memory. Film at eleven.

If you managed to build something fast enough to make this work in theory, the cost of it alone would eclipse the yield savings fro splitting the die in two.

It's not really all that much faster than PCIe 3.0 anyway. It's 10.6 GT/s per-pin, while PCIe is 8.0. Those saying it's closer to PCIe 4.0 are wrong. That's 21.2GB/s aggregate for an infinity fabric x16 slot ( Pcie 4.0 = 32GB/s).

And expensive options are...well, expensive, and aimed at supercomputers. Infiniband tops-out at 40Gbps, but that's still too slow to act as local memory for a top-end GPU.

AMD NEEDED Infinity Fabric to make Ryzen 4-core CCX work in desktops (8-core, 16-core) and servers ( 32 core). The rest of the world does NOT NEED Infinity Fabric.

noko · Oct 1, 2018

Snowdog said:
It's faster than PCI 3, but about the same as PCI 4.

Less than Half of NVlink, and NVlink is insufficient to treat the memory pool on the GPU as one big memory space.

Much faster than PCIe 4:

https://en.wikipedia.org/wiki/PCI_Express

PCIe 4 is ~ 32gb/sec, Infinity Fabric in 2950x is 50gb/sec per the first post image between the two CPU's. That is with memory speed of 2666mhz, faster memory increases that speed proportional to the speed increase. 3200mhz would give the inter communication chip to chip 60gb/s, 3466 -> 65gb/s and so on, which would be double the bandwidth of PCIe 4 16x.

Now talking about Vega Infinity fabric, it has to be able to go at HBM2 configuration data rates other wise it would become a big bottleneck and would be pointless to use HBM2 faster than what the Infinity fabric would transfer. HBM2 is connected to the memory cache controller which via infinity fabric is connected to the rest of the chip.

https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf

Infinity fabric is indeed the cats meow.

defaultluser · Oct 1, 2018

noko said:
Much faster than PCIe 4:

View attachment 108431
https://en.wikipedia.org/wiki/PCI_Express

PCIe 4 is ~ 32gb/sec, Infinity Fabric in 2950x is 50gb/sec per the first post image between the two CPU's. That is with memory speed of 2666mhz, faster memory increases that speed proportional to the speed increase. 3200mhz would give the inter communication chip to chip 60gb/s, 3466 -> 65gb/s and so on, which would be double the bandwidth of PCIe 4 16x.

Now talking about Vega Infinity fabric, it has to be able to go at HBM2 configuration data rates other wise it would become a big bottleneck and would be pointless to use HBM2 faster than what the Infinity fabric would transfer. HBM2 is connected to the memory cache controller which via infinity fabric is connected to the rest of the chip.

View attachment 108437
https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf

Infinity fabric is indeed the cats meow.

You have no clue what you are talking about.

See that 32-bit link between cores? That means your per-lane transfer bandwidth is just 50GB/s * 8 bits/byte = 400 Gbps / 32-bits = 12.5GT/s per-pin.

THAT IS HALFWAY BETWEEN PCIE 3 and PCIE 4. When you're on-package, it's easier to add more lanes than when you're out on a motherboard. So they double the lanes, and pretend they have something unique through press releases.

https://en.wikichip.org/wiki/amd/infinity_fabric

It's just Hypertransport 3.1 with a new Brand Name.

50GB/s bidiretional bandwidth from a 32-bit connection feels awfully familiar.

IT was SORTA ABANDONED after Llano made high-speed interconnect to the integrated northbride graphics unnecessary.

noko · Oct 2, 2018

defaultluser said:
You have no clue what you are talking about.

View attachment 108453

See that 32-bit link between cores? That means your per-lane transfer bandwidth is just 50GB/s * 8 bits/byte = 400 Gbps / 32-bits = 12.5GT/s per-pin.

THAT IS HALFWAY BETWEEN PCIE 3 and PCIE 4. When you're on-package, it's easier to add more lanes than when you're out on a motherboard. So they double the lanes, and pretend they have something unique through press releases.

https://en.wikichip.org/wiki/amd/infinity_fabric

It's just Hypertransport 3.1 with a new Brand Name.

View attachment 108591

50GB/s bidiretional bandwidth from a 32-bit connection feels awfully familiar.

IT was SORTA ABANDONED after Llano made high-speed interconnect to the integrated northbride graphics unnecessary.

WTF, never said anything about hyper transport??? GAC (Get A Clue). Second, Vega - Way more than Ryzen. Third, Infinity fabric runs at ram speed, which for DDR 4 is 1/2 rated speed - so for 2666 MHz -> 1333 MHz actual which does not look like hyper transport at all.

defaultluser · Oct 2, 2018

So instead of acknowledging my awesome interface document AND accompany math that you can't counter, you'e just goin to distract the entire fucking conversation with my relatively unimportant mention of Hypertransport?

Real pro move there noko.

Of course you can use whatever fucking reference clock source you want to drive a bus. Hypertransport is no exception.

https://en.wikipedia.org/wiki/HyperTransport

Infinity Fabric is a superset of HyperTransport announced by AMD in 2016 as an interconnect for its GPUs and CPUs. It is also usable as interchip Interconnect for communication between CPUs and GPUs.[6][7] The company said the Infinity Fabric would scale from 30 GBytes/s to 512 GBytes/s, and be used in the Zen-based CPUs and Vega GPUs which were subsequently released in 2017.

None of he currently shipping AMD devices have more than 50 GB/s bidirectional bandwidth ON ANY OFFICIAL DOCUMENTATION, so that "up to 512GB/s bandwidth" looks like they're "available in 2035." Like PCIe 6.0.

TODAY THE HIGHEST SHIPPING BANDWIDTH OF 50GB/s for 32-bit BIDIRECTIONAL INFINITY FABRIC USED FOR ZEN IS THE SAME AS PEAK BANDWIDTH FOR HYPER TRANSPORT version 3.1 from 2008.

noko · Oct 2, 2018

defaultluser said:
So instead of acknowledging my awesome interface document AND accompany math that you can't counter, you'e just goin to distract the entire fucking conversation with my relatively unimportant mention of Hypertransport?

Real pro move there noko.

Of course you can use whatever fucking reference clock source you want to drive a bus. Hypertransport is no exception.

https://en.wikipedia.org/wiki/HyperTransport

None of he currently shipping AMD devices have more than 50 GB/s bidirectional bandwidth ON ANY OFFICIAL DOCUMENTATION, so that "up to 512GB/s bandwidth" looks like they're "available in 2035." Like PCIe 6.0.

TODAY THE HIGHEST SHIPPING BANDWIDTH OF 50GB/s for 32-bit BIDIRECTIONAL INFINITY FABRIC USED FOR ZEN IS THE SAME AS PEAK BANDWIDTH FOR HYPER TRANSPORT version 3.1 from 2008.

And how does that equate to Vega Infinity fabric? My argument was based on PCIe speeds from a previous post which it was much faster. Amazing that AMD would improve upon their own technology, your so observant. Moving on.

Here is speculation on Vega 7nm, interesting video but the product would have to be viable, sellable, cost efficient in the end. AMD I am sure is looking at using multiple chips as efficiently as they did with CPUs:

Snowdog · Oct 2, 2018

noko said:
And how does that equate to Vega Infinity fabric? My argument was based on PCIe speeds from a previous post which it was much faster. Amazing that AMD would improve upon their own technology, your so observant. Moving on.

Here is speculation on Vega 7nm, interesting video but the product would have to be viable, sellable, cost efficient in the end. AMD I am sure is looking at using multiple chips as efficiently as they did with CPUs:

You are missing the point.

NVlink is TWICE as fast as IF, and it doesn't get rid of SLI. Something half NVlink speed won't get rid of Crossfire.

Multi-die CPU and GPU are completely different situations. The bandwidth and latency requirement of multi-CPU are many times lower than for multi-GPU(gaming).

The pain and expensive of creating massive bandwidth, low latency, connection between GPU dies (enough so different dies can work on the same memory buffers), make it a non starter for multi-chip GPUs.

With multi-GPU we are stuck with SLI/Crossfire pros/cons even if you connect them with IF.

It's a common but naive assumption that, what works for CPUs will also work for GPUs.

If that were true, we would NOT still have issues with CF/SLI where multi-CPU has worked without issues for decades.

IF doesn't bring anything new to this table. We are stuck with separate GPUs working on their own independent memory pools, and needing some kind of CF/SLI software handling it, usually by delivering AFR.

noko · Oct 2, 2018

Snowdog said:
You are missing the point.

NVlink is TWICE as fast as IF, and it doesn't get rid of SLI. Something half NVlink speed won't get rid of Crossfire.

Multi-die CPU and GPU are completely different situations. The bandwidth and latency requirement of multi-CPU are many times lower than for multi-GPU(gaming).

The pain and expensive of creating massive bandwidth, low latency, connection between GPU dies (enough so different dies can work on the same memory buffers), make it a non starter for multi-chip GPUs.

With multi-GPU we are stuck with SLI/Crossfire pros/cons even if you connect them with IF.

It's a common but naive assumption that, what works for CPUs will also work for GPUs.

If that were true, we would NOT still have issues with CF/SLI where multi-CPU has worked without issues for decades.

IF doesn't bring anything new to this table. We are stuck with separate GPUs working on their own independent memory pools, and needing some kind of CF/SLI software handling it, usually by delivering AFR.

I agree with much what you said but we are not necessarily talking about just GPU's but also special dedicating processing engines such as ray tracing, physics, denoising where this would not be an issue. As funds increase I am sure AMD is exploring everything possible. Latency is the killer so far for GPU's for multiple die where internally GPU bandwidth is much higher than HBM2 bandwidth but more importantly latency is much lower. Anyways I would not rule out separate designed chips to aid the GPU in very heavy processing type scenarios.

defaultluser · Oct 2, 2018

noko said:
And how does that equate to Vega Infinity fabric? My argument was based on PCIe speeds from a previous post which it was much faster. Amazing that AMD would improve upon their own technology, your so observant. Moving on.

Okay then, let me destroy your one remaining argument.

PCIe 3.0 has 16GB/s in EACH DIRECTION. That's 32GB/s for BIDIRECTIONAL BANDWIDTH, for a 16-bit connection. OR 64GB/s for a 32-bit connection.

Infinity Fabric has 50GB/s bidirectional bandwidth on a 32-bit interface (you saw the slide I posted above). That is SLOWER THAN PCIE 3.0, for the same bus width.

PCIe 4.0 offers a transfer rate of 16GT/s with flexible lane width configurations. That is double the raw bitrate of PCIe 3.0 and more than triple the bitrate of PCIe 2.0. In full duplex mode, that translates to around 64GB/s of bi-direction x16 bandwidth, whereas PCIe 3.0 topped out at around 32GB/s, and PCIe 2.0 at 16GB. Going back to PCIe 1.0, that spec hit a ceiling of just 8GB/s
Read more at https://hothardware.com/news/pci-ex...rectional-bandwidth-64gbs#UBQ9FW8cOkvPsEc3.99

You posted only the unidirectional bandwidth for PCIe, which is why it sounds so slow. It has the same independent receive and transmit bus design as Infinity Fabric.

Infinity Fabric is nothing special as an interconnect. If the world were unable to create a fast internal interconnect, then AMD's GPUs would outperform Nvidia's.

You can't create a successful high-end GPU without managing massive amounts of bandwidth. The fact that AMD is still in last place says something abut their solution.

Maybe you should stop getting distracted by Press Releases (TM), and point me to where this actually helps Vega overcome the GTX 1080?

Vega die size = 477 mm²
GTX 1080 die size = 314 mm²

Nvidia manages better gaming performance for a lower build price. AMD has a raw spec advantage (40% higher on compute and texture, and 50% higher on memory bandwidth!), but the fact that they can 't turn that into better gaming performance says a lot for the architecture and drivers.

Snowdog · Oct 2, 2018

noko said:
I agree with much what you said but we are not necessarily talking about just GPU's but also special dedicating processing engines such as ray tracing, physics, denoising where this would not be an issue. As funds increase I am sure AMD is exploring everything possible. Latency is the killer so far for GPU's for multiple die where internally GPU bandwidth is much higher than HBM2 bandwidth but more importantly latency is much lower. Anyways I would not rule out separate designed chips to aid the GPU in very heavy processing type scenarios.

I am certain you won't see Physics or Ray Tracing/denoising dies either.

Multi-chip is only easy on a couple of boundary points where the units are more independent, and have lower BW/Latency requirements, So CPU-CPU, and CPU-GPU.

So other than what we have seen, I expect the next most likely Multi-chip arrangement from AMD will likely be some kind of CPU-GPU split package, like Kaby-G. Though that really isn't groundbreaking, and might wait for AM5, with perhaps a bigger package to make CPU-GPU-HBM possible under one IHS. So AM5 socket could have:

APU die.
CPU only die.
CPU+GPU+HBM dies.

But I don't think they can fit that in AM4, but they could do something more like Kaby-G for laptops.

Gideon · Oct 2, 2018

Communication speed becomes a issue with distance, shorter it is the easier it is to implement. I believe latency is a far bigger issue then bandwidth when discussing Infinity Fabric. Latency is usually the far bigger killer in high speed communication then bandwidth.

Snowdog · Oct 2, 2018

Gideon said:
Communication speed becomes a issue with distance, shorter it is the easier it is to implement. I believe latency is a far bigger issue then bandwidth when discussing Infinity Fabric. Latency is usually the far bigger killer in high speed communication then bandwidth.

When talking GPU it is both. RTX 2080Ti has over 600GB/S memory bandwidth. That is more than 10 times IF bandwidth. IF BW is more comparable to GT 1030. It just isn't in the same ballpark as a modern mid range or above GPU.

Gideon · Oct 2, 2018

Snowdog said:
When talking GPU it is both. RTX 2080Ti has over 600GB/S memory bandwidth. That is more than 10 times IF bandwidth. IF BW is more comparable to GT 1030. It just isn't in the same ballpark as a modern mid range or above GPU.

That means nothing when your talking ns to ms. Also memory is dog slow compared to using on chip cache and in fact you do everything possible to avoid using main memory cause you will stall out while waiting on that data. https://www.extremetech.com/extreme...-why-theyre-an-essential-part-of-modern-chips

noko · Oct 2, 2018

defaultluser said:
Okay then, let me destroy your one remaining argument.

PCIe 3.0 has 16GB/s in EACH DIRECTION. That's 32GB/s for BIDIRECTIONAL BANDWIDTH, for a 16-bit connection. OR 64GB/s for a 32-bit connection.

Infinity Fabric has 50GB/s bidirectional bandwidth on a 32-bit interface (you saw the slide I posted above). That is SLOWER THAN PCIE 3.0, for the same bus width.

You posted only the unidirectional bandwidth for PCIe, which is why it sounds so slow. It has the same independent receive and transmit bus design as Infinity Fabric.

Infinity Fabric is nothing special as an interconnect. If the world were unable to create a fast internal interconnect, then AMD's GPUs would outperform Nvidia's.

You can't create a successful high-end GPU without managing massive amounts of bandwidth. The fact that AMD is still in last place says something abut their solution.

Maybe you should stop getting distracted by Press Releases (TM), and point me to where this actually helps Vega overcome the GTX 1080?

Vega die size = 477 mm²
GTX 1080 die size = 314 mm²

Nvidia manages better gaming performance for a lower build price. AMD has a raw spec advantage (40% higher on compute and texture, and 50% higher on memory bandwidth!), but the fact that they can 't turn that into better gaming performance says a lot for the architecture and drivers.

I used single direction bandwidth PCIe 3 because 50gb/s is single direction bandwidth for CCX to CCX, it is also bi-directional and would be 100gb/s if going by bi-directional rates. Looks like the 50gb/s bandwidth is from faster ram than 2666mhz:

https://thetechaltar.com/amd-ryzen-clock-domains-detailed/ said:
The Data Fabric is responsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each containing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creating data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.

For Vega, Infinity fabric which is scalable, scaled to handle HBM2 bandwidth of 484gb/s. For Vega 7nm with 4 stacks of HBM2, it will scale to double that. AMD is rather tight lip on the specifics on how that was achieved. For Ryzen Infinity fabric is roughly double DDR4 ram speed capability so did not have to be some higher number, on Vega it's bandwidth capability is huge.

noko · Oct 2, 2018

Snowdog said:
I am certain you won't see Physics or Ray Tracing/denoising dies either.

Multi-chip is only easy on a couple of boundary points where the units are more independent, and have lower BW/Latency requirements, So CPU-CPU, and CPU-GPU.

So other than what we have seen, I expect the next most likely Multi-chip arrangement from AMD will likely be some kind of CPU-GPU split package, like Kaby-G. Though that really isn't groundbreaking, and might wait for AM5, with perhaps a bigger package to make CPU-GPU-HBM possible under one IHS. So AM5 socket could have:

APU die.it i
CPU only die.
CPU+GPU+HBM dies.

But I don't think they can fit that in AM4, but they could do something more like Kaby-G for laptops.

AMD is doing really well in incorporating Pro Render into multiple professional Ray Tracing packages, I use one, MODO. The two Vega FE does real time ray tracing (not like Nvidia) but it is fast enough for preview. AMD is talking about denoising, sure, not about how but post processing an image is very much doable with a separate chip just like going over a low bandwidth PCIe 3 8x bus with two Vega FE's gives about double the render rate for the preview ray traced window. Having denoising onboard with some rather much faster Infinity Fabric configuration to clean up the noise in Ray Traced images I see very much possible and how Vega is made it would just fit right in with all the other Infinity fabric connected parts of the chip. This was directed towards the professional market from AMD so probably does not apply to any upcoming gaming chip. If you can do physics with a CPU then having a dedicated physics processor (not really needed now with the more powerful CPUs that are out) would be easy. As for RT cores or something similar from AMD??? No clue but they have been working with Microsoft as well with Ray Tracing.

Snowdog · Oct 3, 2018

noko said:
AMD is doing really well in incorporating Pro Render into multiple professional Ray Tracing packages, I use one, MODO. The two Vega FE does real time ray tracing (not like Nvidia) but it is fast enough for preview. AMD is talking about denoising, sure, not about how but post processing an image is very much doable with a separate chip just like going over a low bandwidth PCIe 3 8x bus with two Vega FE's gives about double the render rate for the preview ray traced window. Having denoising onboard with some rather much faster Infinity Fabric configuration to clean up the noise in Ray Traced images I see very much possible and how Vega is made it would just fit right in with all the other Infinity fabric connected parts of the chip. This was directed towards the professional market from AMD so probably does not apply to any upcoming gaming chip. If you can do physics with a CPU then having a dedicated physics processor (not really needed now with the more powerful CPUs that are out) would be easy. As for RT cores or something similar from AMD??? No clue but they have been working with Microsoft as well with Ray Tracing.

Physics can be done on the CPU or the GPU, so what point is there in a separate physics chip? It isn't about possible in this case, just the lack of need.

Denoising would be possible using a separate chip. But there is simply no point in doing that as it isn't the primary HW speedup for ray tracing, and there are options for denoising without Tensor cores. IIRC BFV is NOT using the Tensor cores to denoise, so either CPU or normal GPU compute.

RT cores are the real speedup, and that has to be intimate with the geometry of the game world, so this one really is NOT suitable to be a separate chip.

So no specialized pieces of GPU will break out into stand alone dies.

We will basically just see CPU and GPU dies.

noko · Oct 3, 2018

Snowdog said:
Physics can be done on the CPU or the GPU, so what point is there in a separate physics chip? It isn't about possible in this case, just the lack of need.

Denoising would be possible using a separate chip. But there is simply no point in doing that as it isn't the primary HW speedup for ray tracing, and there are options for denoising without Tensor cores. IIRC BFV is NOT using the Tensor cores to denoise, so either CPU or normal GPU compute.

RT cores are the real speedup, and that has to be intimate with the geometry of the game world, so this one really is NOT suitable to be a separate chip.

So no specialized pieces of GPU will break out into stand alone dies.

We will basically just see CPU and GPU dies.

Considering that ProRender is about double the speed when ray tracing a scene with two Vega FE's going through a PCIe 3 8x bus, I am pretty sure a form of RT cores added via Infinity fabric would have no issue with any geometry transfer on changes to a scene - basically geometry of a scene is already loaded in both Vega FE's and doesn't change - only the camera view changes. Only when you add an object or change a scene would you need to update the geometry or soft physic operations. Even multiple core cpu's are almost linear when added in performance going through some very slow memory interfaces compared to what Vega uses. A geometry cache accessible to GPU and RT cores is another possibility. In any case AMD could have multiple RT cores added. AMD just has not said too much about their future designs and additions, were they blown away with Nvidia Raytracing ability or already have plans to do something similar or better? We just do not know. AMD may just concentrate on giving raytracing with professional tools and not for games like they have been doing with ProRender. As for Nvidia Raytracing it looks to be too limited currently for games in general.

Nvidia provides developers via software kit to use Tensor Cores for denoising. Maybe BFV is not using it because it is not that good in the end or just using Raytracing for limited things like just reflections which you may not need much denoising.

https://blogs.nvidia.com/blog/2018/08/13/jensen-huang-siggraph-turing-quadro-rtx/ said:
An AI for Beauty
At the same time, the Turing architecture’s Tensor Cores — processors that accelerate deep learning training and inferencing — provide up to 500 trillion tensor operations a second. This, in turn, powers AI-enhanced features — such as denoising, resolution scaling and video re-timing — included in the NVIDIA NGX software development kit.

“At some point you can use AI or some heuristics to figure out what are the missing dots and how should we fill it all in, and it allows us to complete the frame a lot faster than we otherwise could,” Huang said, describing the new deep learning-powered technology stack that enables developers to integrate accelerated, enhanced graphics, photo imaging and video processing into applications with pre-trained networks.

“Nothing is more powerful than using deep learning to do that,” Huang said.

Tensor cores will be very interesting in how developers will use them, hopefully AMD has something similar (another chip

).

Edit: here is AMD denoising blurb, should have done some searching first:

https://pro.radeon.com/en/radeon-pr...ing-workflows-interactive-viewport-denoising/

Snowdog · Oct 3, 2018

noko said:
Considering that ProRender is about double the speed when ray tracing a scene with two Vega FE's going through a PCIe 3 8x bus, I am pretty sure a form of RT cores added via Infinity fabric would have no issue with any geometry transfer on changes to a scene - basically geometry of a scene is already loaded in both Vega FE's and doesn't change - only the camera view changes. Only when you add an object or change a scene would you need to update the geometry or soft physic operations. Even multiple core cpu's are almost linear when added in performance going through some very slow memory interfaces compared to what Vega uses. A geometry cache accessible to GPU and RT cores is another possibility. In any case AMD could have multiple RT cores added. AMD just has not said too much about their future designs and additions, were they blown away with Nvidia Raytracing ability or already have plans to do something similar or better? We just do not know. AMD may just concentrate on giving raytracing with professional tools and not for games like they have been doing with ProRender. As for Nvidia Raytracing it looks to be too limited currently for games in general.

Pro Render is Not really real time. It's for single image rendering, not games.

You can use a render farm for non real time rendering.

Real time (gaming) is a different issue, and has relatively large latency issues.

Real Time Ray Tracing calculations bounce light rays potentially off every single texturized triangle in the game world. So you need instant access to all the texturized polygons. It isn't practical for a gaming ray tracing chip to try to get this access through a port into the main chip at much higher latency and much lower BW compared to the main main GPU.

For gaming, the RT HW must be in the GPU.

noko · Oct 3, 2018

Snowdog said:
Pro Render is Not really real time. It's for single image rendering, not games.

You can use a render farm for non real time rendering.

Real time (gaming) is a different issue, and has relatively large latency issues.

Real Time Ray Tracing calculations bounce light rays potentially off every single texturized triangle in the game world. So you need instant access to all the texturized polygons. It isn't practical for a gaming ray tracing chip to try to get this access through a port into the main chip at much higher latency and much lower BW compared to the main main GPU.

For gaming, the RT HW must be in the GPU.

Render to texture, lightmaps, shadow maps, irradiance maps can be updated from external sources and run in real time in a game engine. I do not see this as an issue from a mixed rendered method game engine. While the full scene geometry for ray tracing could be used to calculate how light can bounce off of objects and add color values once in the game engine from the textures or material values. Once you isolate lighting values, that data can be imported into the GPU for the lighting. Reflections can be calculated on what objects/parts what will fill the reflective surface then rasterized.

Yes, ProRender is real time and updates as you move the view point, quality is poor initially but rapidly updates and improves as more rays are calculated and it is real ray tracing. As for RTX it is not real ray tracing but a mixture of ray tracing elements and rasterization. Is ProRender fast enough for a game - hell no. The question is: Is RTX fast enough for games without too many sacrifices?

Snowdog · Oct 3, 2018

noko said:
. While the full scene geometry for ray tracing could be used to calculate how light can bounce off of objects and add color values once in the game engine from the textures or material values. Once you isolate lighting values, that data can be imported into the GPU for the lighting. Reflections can be calculated on what objects/parts what will fill the reflective surface then rasterized.

Not could, Must.

Imported? Nonsense. It MUST be calculated in the GPU, because that is where the full scene geometry and textures exist.

Do you think you are going to export the full scene geometry and textures to an RT chip. That makes no sense at all.

noko · Oct 3, 2018

Snowdog said:
Not could, Must.

Imported? Nonsense. It MUST be calculated in the GPU, because that is where the full scene geometry and textures exist.

Do you think you are going to export the full scene geometry and textures to an RT chip. That makes no sense at all.

All right - what does the RT chip itself process? What is it speeding up? How much bandwidth/latency would be required? Since Computational extensive where a lot of calculations are done it does not need much.

Intel did this in a real game 10 years ago, 720p, got 14-29fps doing Ray Tracing using 16 cores, 4 cpu's 4 sockets:

"On June 12, 2008 Intel demonstrated a special version of Enemy Territory: Quake Wars, titled Quake Wars: Ray Traced, using ray tracing for rendering, running in basic HD (720p) resolution. ETQW operated at 14-29 frames per second. The demonstration ran on a 16-core (4 socket, 4 core) Xeon Tigerton system running at 2.93 GHz"
https://en.wikipedia.org/wiki/Ray_tracing_(graphics)

So using socket to socket cpu speeds, DDR 2 ram speeds and all the scene was ray traced via the CPUs which then dumped to the GPU - yet AMD cannot use Infinity Fabric, High Bandwidth Cache Controller, HBM2 or DDR 6 to make it faster using a Hardware RT core right with the GPU? Of course they can. Not sure why you are arguing or what point you are trying to make. The question is more like will AMD see any benefit in doing this? Will they do it for the professional market first (that is what I think) then carry it to a gaming card or console or both? AMD is heavily into multi-chip designs and I would not be too surprised this is already working on the test bench - does not mean it will make it to the market. AMD really pushing Ray Tracing and giving tools for it may have more reasons than apparent.

vjhawk · Oct 29, 2018

I don't know how I missed out on this article.

It looks like AMD loves multi-GPU but only on the enterprise level.

AMD would have to be able to make the multi-gpu appear as a single gpu to make it practical in a consumer card because of lack of games developer support.

In theory infinity fabric does allow multi-gpu designs, but it's the support is lacking.

“To some extent you’re talking about doing CrossFire on a single package,” says Wang. “The challenge is that unless we make it invisible to the ISVs [independent software vendors] you’re going to see the same sort of reluctance."

So it appears true seamless multi-gpu is still dead in the water with Navi.

AMD’s Navi GPUs will start shipping next year in 7nm trim, but what level of performance Navi will deliver is still up for debate. Without an MCM setup it looks likely that we’re talking about a mainstream GPU, an RX 680 successor to the RX 580, rather than something that’s going to punch it out with Nvidia’s GTX 1180.

So if this holds true to expectations, it looks like Navi will ship in 2019 and still be slower than Nvidia's 2080 Ti but slightly faster than the 1080 Ti. Kind of unexciting for a brand new 7nm technology. But it looks like AMD is not even aiming for the performance crown and they know that crossfireX is so bad multi-gpu is not even worth implementing, even as a halo product anymore.

The frustrating part is that the hardware in theory can do it, but they can't make it work seamlessly, or could they if they put enough work into it...

So, is it possible to make an MCM design invisible to a game developer so they can address it as a single GPU without expensive recoding?

“Anything’s possible…” says Wang.

Pieter3dnow · Oct 29, 2018

The Idea behind MCM was that you could have a good structure and say 3 or more chips and with a small Watt footprint lets say 60 Watt that would make MCM viable if the performance of 60 Watt GPU is enough.
The die size of these chips should be small so that yields are stellar and it is extremely cheap to produce,

Since AMD did not have all of their problems solved a MCM GPU solution will not benefit from this (can't get right power/performance). The idea is not that weird that is what makes Ryzen (Zen cores) work so well.

Navi is according to Fudzilla not for high end.

On the software side you can find material about Mantle supporting multiple gpu as compute devices rather then "crossfire".

Until videocards keep running into stupid large die sizes to be able to keep increasing performance MCM is a long way away.

Gideon · Oct 29, 2018

Pieter3dnow said:
The Idea behind MCM was that you could have a good structure and say 3 or more chips and with a small Watt footprint lets say 60 Watt that would make MCM viable if the performance of 60 Watt GPU is enough.
The die size of these chips should be small so that yields are stellar and it is extremely cheap to produce,

Since AMD did not have all of their problems solved a MCM GPU solution will not benefit from this (can't get right power/performance). The idea is not that weird that is what makes Ryzen (Zen cores) work so well.

Navi is according to Fudzilla not for high end.

On the software side you can find material about Mantle supporting multiple gpu as compute devices rather then "crossfire".

Until videocards keep running into stupid large die sizes to be able to keep increasing performance MCM is a long way away.

I believe the initial release of Navi will be to replace Polaris and then they will release their new high end cards to replace Vega. Likely yields will demand that at first anyway.

N4CR · Oct 29, 2018

Snowdog said:
It's faster than PCI 3, but about the same as PCI 4.

Less than Half of NVlink, and NVlink is insufficient to treat the memory pool on the GPU as one big memory space.

From what I understand, Vega IF is 500Gb/sec. Far faster than the one for Zen or Pci 4, so yes it is a sort of special sauce. Keep in mind the die is about 2.2x the size of Zen.

Infinity Fabric: Why Crossfire needs to die and MultiGPU can be reborn

Limp Gawd

[H]F Junkie

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]ard|Gawd

Deleted member 82943

Guest

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

[H]F Junkie

2[H]4U

[H]F Junkie

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

Limp Gawd

Supreme [H]ardness

2[H]4U

Supreme [H]ardness