AMD Stacked V-Cache

DukenukemX · Jun 2, 2021

paradoxical said:
The guy is just delusional when it comes to understanding how large businesses actually work, and how much of a lead Apple has over literally every other silicon company in the world. Because x86. Or something.

Fix tax loop holes and we'll see how much money Apple will have then.

ChadD · Jun 2, 2021

DukenukemX said:
They don't have an L3 cache because most ARM devices don't have ram slots. The ram usually sits right next to the SoC. Putting ram closer to the SoC will decrease the need for an L3, though not entirely. ARM is not magically efficient. ARM still works on the same basic principals as x86. AMD's Vcache suggests that AMD isn't about to put ram next to their CPU's like Apple did. 192MB is a hella lot of cache to the point where it's practically ram. That is going to give AMD a huge performance boost while also having removable ram. Especially for APU's if AMD finally decides to put RDNA2 onto their APU's.

This is similar to Intel's eDram cache from Broadwell where it was fantastic in boosting performance, especially GPU performance but then Intel took it away. It was 128MB worth of the stuff, which is really good for something released in 2017. Cache is king, and you won't get far in CPU and APU performance without a good lot of it.

That isn't the only reason ARM chips don't need as much L3 however. Arm chips have drastically shorter pipes... which drastically reduces the penalty for running data back through a cycle. I mean the days of Pentium 4s insanely long pipes are gone saner heads dialed that back... however its still just reality that x86 is heavily penalized performance wise if it runs out of cache. Running out of cache on x86 means having to run a lot of things back through a 14-16 stage pipeline that it wouldn't have had to do if it enough cache.

As we where talking about before Arm designers are starting to do some very crazy things with L3. Newer Arm and Apple Arm chips as an example use L3 cache as a pool for all processing on the SOC including accelerators meaning the CPU can say hey AI accelerate process X Y and Z on this video stream and save the output to L3. Then as the software messes with that data further the CPU can just pull the processed information directly from L3. So you know I don't know the answer on more L3 for something like a M2 or M3 with 3d stacking. Perhaps with all the co processors on board a ton of L3 may actually make a big difference to performance.

On the core end though ARM has a 2 major advantages over x86 in regards to cache use. One is the basic shorter pipeline. Having to redo stuff isn't as penalizing on Arm vs x86 cause the pipelines are 60% the size. (then again x86 on the high end tends to have a clock speed advantage to nullify that somewhat). Arm also however has second gen big.Little... dynamIQ, Apple uses something like it but no doubt is in house developed. Where they really can use a big and Little core on the same single thread. Its possible for Arm chips to use little cores as shorter pipelines to recalculate small chunks. No doubt its still slower then using cache. However its a lot less punishing on performance then x86 maxing cache space. And again being Apple we are speculating somewhat... I assume Apple is using something like Arms DynamIQ it seems when ARM announces such things Apple had already built their in house version a few years before. (I'm going to guess Apple has been running something like DynamIQ since A7 or so). Also not sure if the small cores as assistance is something that even if Apple has it for sure doesn't have some caveats... Fujitsu as an example has their assistance cores in A64, however the Assistance cores can only be engaged when the the cores they are assisting are at the top speed frequency. (don't quote me I think its 2.0 GHZ... at that speed Fujitsu A64 cores can engage the unexposed 13th core on each cluster to speed single threaded operation)

JasonLD · Jun 2, 2021

DukenukemX said:
The ram usually sits right next to the SoC. Putting ram closer to the SoC will decrease the need for an L3, though not entirely.

There is no particular bandwidth or latency advantage with RAM sitting closer to SoC, it is more for the power savings and cost.

DukenukemX · Jun 2, 2021

JasonLD said:
There is no particular bandwidth or latency advantage with RAM sitting closer to SoC, it is more for the power savings and cost.

You need to understand the purpose of cache is to deal with latency. Putting things closer together reduces latency. Chiplets increase latency and therefore need a method to deal with it. That's where cache comes in, and the more of it you have then the better off you are. The physical distance from components does have a performance impact of devices.

ChadD said:
That isn't the only reason ARM chips don't need as much L3 however. Arm chips have drastically shorter pipes... which drastically reduces the penalty for running data back through a cycle. I mean the days of Pentium 4s insanely long pipes are gone saner heads dialed that back... however its still just reality that x86 is heavily penalized performance wise if it runs out of cache. Running out of cache on x86 means having to run a lot of things back through a 14-16 stage pipeline that it wouldn't have had to do if it enough cache.

Pipeline lengths effect things like clock speed and branch prediction. We're talking about the time needed to fetch data from ram, which is universal between x86 and ARM. The purpose of cache is to avoid going back to ram as often as possible. Cache is a lot faster than ram, both in latency and bandwidth. ARM doesn't make this less of a problem. Also in terms of needing cache, the ARM architecture needs it more than x86. Generally RISC needs cache otherwise performance tanks.

As we where talking about before Arm designers are starting to do some very crazy things with L3. Newer Arm and Apple Arm chips as an example use L3 cache as a pool for all processing on the SOC including accelerators meaning the CPU can say hey AI accelerate process X Y and Z on this video stream and save the output to L3.

This is basically eDram that Intel developed years ago, except it was also used for the GPU as well. ATI/AMD did this with the Xbox 360 back in 2005.

Then as the software messes with that data further the CPU can just pull the processed information directly from L3.

That's how L3 cache works on all CPU's. The branch prediction puts the most frequency used output onto the L2 and the least used on the L3. Everything else will have to be fetched from the ram. The very slow ram. Modern ram is getting too slow for modern CPU's and cache is the solution. Bandwidth isn't the only measurement of ram that is of concern, but latency.

ChadD · Jun 2, 2021

DukenukemX said:
Pipeline lengths effect things like clock speed and branch prediction. We're talking about the time needed to fetch data from ram, which is universal between x86 and ARM. The purpose of cache is to avoid going back to ram as often as possible. Cache is a lot faster than ram, both in latency and bandwidth. ARM doesn't make this less of a problem. Also in terms of needing cache, the ARM architecture needs it more than x86. Generally RISC needs cache otherwise performance tanks.

Yes I'm aware how how cache works. I am not suggesting Arm doesn't need cache. Still it doesn't benefit from massive amounts of cache as x86 does. If Arm has to run something again because it wasn't able to store it in faster cache yes it takes a performance hit but a much smaller hit then x86 does for the same. Its not that cache isn't important... I'm not not convinced Arm with a CD volume of cache available to it would really see any uplift vs what M1 and the like have for cache now. As AMD is showing x86 will benefit greatly from massive cache... but even with x86 there is a limit. If AMD was able to double the cache amounts again I'm sure the uplift would diminish until the point where the extra cache just sits idle anyway.

Red Falcon · Jun 2, 2021

JasonLD said:
There is no particular bandwidth or latency advantage with RAM sitting closer to SoC, it is more for the power savings and cost.

Nope, DukenukemX is correct in that the physical distance between RAM and the memory controller (SoC and/or CPU) will make a substantial difference in memory-access latency.
This was proven back in the 2000s when AMD moved the memory controller from the North Bridge to the CPU, and in the early 2010s on the original Raspberry Pi where the RAM chip is literally sitting with the SoC.

Nobu · Jun 2, 2021

ChadD said:
Yes I'm aware how how cache works. I am not suggesting Arm doesn't need cache. Still it doesn't benefit from massive amounts of cache as x86 does. If Arm has to run something again because it wasn't able to store it in faster cache yes it takes a performance hit but a much smaller hit then x86 does for the same. Its not that cache isn't important... I'm not not convinced Arm with a CD volume of cache available to it would really see any uplift vs what M1 and the like have for cache now. As AMD is showing x86 will benefit greatly from massive cache... but even with x86 there is a limit. If AMD was able to double the cache amounts again I'm sure the uplift would diminish until the point where the extra cache just sits idle anyway.

Assuming no performance lost from the increase in physical size and distance, it would increase up until you reached multiple gigabytes of cache. Unfortunately, physics (and cost) are a thing.

ChadD · Jun 2, 2021

DukenukemX said:
That's how L3 cache works on all CPU's. The branch prediction puts the most frequency used output onto the L2 and the least used on the L3. Everything else will have to be fetched from the ram. The very slow ram. Modern ram is getting too slow for modern CPU's and cache is the solution. Bandwidth isn't the only measurement of ram that is of concern, but latency.

I'm not sure you understand what I'm saying. CPUs have core complexes. Meaning for instance the first 8 cores on a AMD chip shares the same cache pool... the second up to 8 core core complex shares another pool of cache.

ARM SOC with DynamIQ and Apples chips since a7 or so. The L3 cache is not only shared with the CPU complex but also with the attached accelerators. Also since dIQ/a7 the big and little cores are on the same complex where each core has its own L1 and the core complex shares L2... meaning big and Little cores on modern Arm can work on the same data chunks and share the same L2 space. The L3 space in terms of ARM is actually very rarely used by the Arm cores themselves.... they mostly store data from AI / Video decoders ect. Its partly why M1s Audio and video editing (on properly M1 compiled software) is very very good. M1 is engaging with the accerlators like they almost on the actual CPU core complex. That is not how x86 based SOC work... as an example AMD x86 SOC GPUs are not writing directly to L3. If they did it would have some interesting implications for video editing on x86 SOC.... and you know this 3d stacking might actually have some massive implications for performance of next gen x86 SOC. The main reason sharing L3 with x86 cores and GPUs or accelerators would not be a great idea right now is it would just starve the actual cores of cache. However if AMD is now capable of including 3-4x the L3 cache on a SOC die. Well that might be a lot more interesting.

Red Falcon · Jun 2, 2021

ChadD said:
Yes I'm aware how how cache works. I am not suggesting Arm doesn't need cache. Still it doesn't benefit from massive amounts of cache as x86 does. If Arm has to run something again because it wasn't able to store it in faster cache yes it takes a performance hit but a much smaller hit then x86 does for the same. Its not that cache isn't important... I'm not not convinced Arm with a CD volume of cache available to it would really see any uplift vs what M1 and the like have for cache now. As AMD is showing x86 will benefit greatly from massive cache... but even with x86 there is a limit. If AMD was able to double the cache amounts again I'm sure the uplift would diminish until the point where the extra cache just sits idle anyway.

Pfff, cache is overrated, and I am totally fine with a whopping 2MB L2 cache on my AMD Jaguar.

I don't think adding more cache (assuming SRAM and not DRAM) would ever have anything but a positive effect on any CPU ISA or microarchitecture, going back to the beginning of CPUs.

Whatever either can be accessed or fit in the L1-L3 cache then needs to be accessed by the much slower DRAM, just as DukenukemX is saying, and that is definitely a performance hit.
Normally the lack of a large L2 cache and/or presence of L3/L4 cache is due to the cost of the CPU or SoC, not because the performance gains aren't there.

SRAM is extremely costly, and the silicone real estate can also be especially costly, so just simply adding more to a CPU or SoC could greatly increase the cost as well as the TDP.

ChadD · Jun 2, 2021

Nobu said:
Assuming no performance lost from the increase in physical size and distance, it would increase up until you reached multiple gigabytes of cache. Unfortunately, physics (and cost) are a thing.

Well what they are showing off now is basically a + chip... its the same old Zen 3 with just a bit of cache stacked on top. I have a feeling that with Zen 4 and beyond... they could probably arrange the cores on the chip in a different configuration to stack even more cache on top accounting for max efficiency in terms of heat distribution. Stacking the cache around the edges of the die or chunks of it in specific low heat spots of the die. This Zen3+ stacked thing is very cool... but I'm sure the tech will shine when its part of the initial design phase.

Lakados · Jun 2, 2021

primetime said:
whats stopping intel from using TSMC like everyone else i wonder?

Intel has TSMC making a lot of their products currently and have been for years.
"Intel has outsourced the production of about 15-20% of its non-CPU chips, with most of the wafer starts for these products assigned to TSMC and UMC, according to TrendForce’s latest investigations. While the company is planning to kick off mass production of Core i3 CPUs at TSMC’s 5nm node in 2H21, Intel’s mid-range and high-end CPUs are projected to enter mass production using TSMC’s 3nm node in 2H22." https://www.trendforce.com/presscenter/news/20210113-10651.html

Also, Intel's Ponte Vecchio GPU is currently being manufactured and shipping on Intel's 7nm, and TSMC's 5nm they are separated by SKUs but they supposedly perform near identically

Intel orders some 180,000 wafers from TSMC, where AMD currently gets around 240,000 wafers so Intel is a significant TSMC customer.

Lakados · Jun 2, 2021

Red Falcon said:
As for Intel, a slow sinking ship comes to mind...

Intel ships more CPU's a month than AMD makes CPU's in a year, Intel is fine. Their processes are hurting but they have 4 10nm fabs online now with a 5'th coming soon, busy shipping out Xeons as fast as they come off the line to OEM's while AMD has yet to deliver the EPYC's in any meaningful quantity. TSMC may have the process advantage and will for the foreseeable future but Intel still has some 5x their manufacturing capacity and is growing their capacity faster than TSMC is.

Lakados · Jun 2, 2021

DukenukemX said:
Fix tax loop holes and we'll see how much money Apple will have then.

Still a hell of a lot more than all the other guys doing the same things, just to a smaller scale.

Lakados · Jun 2, 2021

I'm hoping this is something they work into Threadripper, and future EPYC designs. As it currently stands I don't see how that large amount of cache could really benefit a consumer workload in a way that justifies the price increase.

chameleoneel · Jun 2, 2021

Lakados said:
I'm hoping this is something they work into Threadripper, and future EPYC designs. As it currently stands I don't see how that large amount of cache could really benefit a consumer workload in a way that justifies the price increase.

It should make AM4 pretty competitive with first gen Alder Lake and AM5 in some workloads. Especially latency dependent ones, such as gaming. So in other words, it gives people options.

Lakados · Jun 2, 2021

chameleoneel said:
It should make AM4 pretty competitive with first gen Alder Lake and AM5 in some workloads. Especially latency dependent ones, such as gaming. So in other words, it gives people options.

Yeah, there are a good number of simulation tasks as well that could see some benefits there as well that the Xeons currently crush in, I really want to see the 10nm Xeon W's face off against a new series of Threadrippers, I think that will be a closer fight than the current Desktop one because ick that is a beat down.

primetime · Jun 2, 2021

chameleoneel said:
It should make AM4 pretty competitive with first gen Alder Lake and AM5 in some workloads. Especially latency dependent ones, such as gaming. So in other words, it gives people options.

Even better if there buyable options unlike Zen 3 desk/mobile APUs

JasonLD · Jun 2, 2021

Red Falcon said:
Nope, DukenukemX is correct in that the physical distance between RAM and the memory controller (SoC and/or CPU) will make a substantial difference in memory-access latency.
This was proven back in the 2000s when AMD moved the memory controller from the North Bridge to the CPU, and in the early 2010s on the original Raspberry Pi where the RAM chip is literally sitting with the SoC.

Apple's M1 has higher latency (around 96ns) on DRAM compared to typical Zen 3(around 50-60ns). Physical distance would matter for Inter-chiplet or cache where it goes down to single digit ns, but it wouldn't matter that much for DRAM since it already has high enough latency to the point where physical distance wouldn't matter that much.

DukenukemX · Jun 2, 2021

ChadD said:
Well what they are showing off now is basically a + chip... its the same old Zen 3 with just a bit of cache stacked on top. I have a feeling that with Zen 4 and beyond... they could probably arrange the cores on the chip in a different configuration to stack even more cache on top accounting for max efficiency in terms of heat distribution. Stacking the cache around the edges of the die or chunks of it in specific low heat spots of the die. This Zen3+ stacked thing is very cool... but I'm sure the tech will shine when its part of the initial design phase.

The stacking makes sense as to reduce latency. Look at the die shot of a 5800X with all the cache in the middle. The cache is like more than half the CPU as well. Putting the cache on the edge of the CPU as shown is probably putting it on top of the cores, for better efficiency. Whether that's a Zen3 with V-Cache or Zen4 is a mystery we'll find out later this year. I'm also worried about heat since this seems like a similar issue to Vega 56/64 cards and getting an even surface to mount a cooler. I would probably avoid ever deliding one of these chips.

JasonLD said:
Apple's M1 has higher latency (around 96ns) on DRAM compared to typical Zen 3(around 50-60ns). Physical distance would matter for Inter-chiplet or cache where it goes down to single digit ns, but it wouldn't matter that much for DRAM since it already has high enough latency to the point where physical distance wouldn't matter that much.

Where you getting those numbers? Has Apple ever released those specs of their M1?

Red Falcon · Jun 2, 2021

JasonLD said:
Apple's M1 has higher latency (around 96ns) on DRAM compared to typical Zen 3(around 50-60ns). Physical distance would matter for Inter-chiplet or cache where it goes down to single digit ns, but it wouldn't matter that much for DRAM since it already has high enough latency to the point where physical distance wouldn't matter that much.

I don't agree that physical distance to DRAM doesn't make an impactful difference, as I remember articles in the mid-2000s to early-2010s clearly showing this very fact.
As for 96ns latency on the M1, looks like you are right, though it does still manage to get to 68.25GB/s, which is very impressive.

As for inter-chiplet communication, it does definitely make a difference, depending on the design, agreed.

ChadD · Jun 3, 2021

DukenukemX said:
The stacking makes sense as to reduce latency. Look at the die shot of a 5800X with all the cache in the middle. The cache is like more than half the CPU as well. Putting the cache on the edge of the CPU as shown is probably putting it on top of the cores, for better efficiency. Whether that's a Zen3 with V-Cache or Zen4 is a mystery we'll find out later this year. I'm also worried about heat since this seems like a similar issue to Vega 56/64 cards and getting an even surface to mount a cooler. I would probably avoid ever deliding one of these chips.
View attachment 362277

I agree ya potentially depending where they are located latency for sure could be improved. Every little bit helps.

I think come zen 4 perhaps even 5 the advantage if this tech may really shine. For now the Zen 3+ seems more a proof of concept. It seems like it will work and for sure be a gain. Looking forward to a Zen 4 design where they actually account for 3D traces. I believe when we see die shots of Zen 4 it will look very odd with cores not lined up in neat rows as we are used to seeing. I can see them spreading them out a bit to move the heat around the die.... with the traces going up. I could imagine a hexagon type shape ring where cache in the middle of the ring could occupy 2 layers while being equidistant to each core. With a dense 2 layer tight ring around the outside outlining the complex of perhaps L2 or even L1 cache as that would likely be per core cache and the distance to each core wouldn't matter. Then again I imagine that could have issues with fab defects. One advantage to the rows of cores probably is better yields as major defects tend to cluster. Anyway should be cool to see what they do with this stuff when they design for it and not just tack it on.

Lakados · Jun 3, 2021

I know Intel has been touting their version of this, but I don’t think I have seen it in a working demo. So I’m interested in seeing how they respond to this.

serpretetsky · Jun 3, 2021

Red Falcon said:
I don't agree that physical distance to DRAM doesn't make an impactful difference, as I remember articles in the mid-2000s to early-2010s clearly showing this very fact.
As for 96ns latency on the M1, looks like you are right, though it does still manage to get to 68.25GB/s, which is very impressive.

I have a feeling he may be right. As a reference, typical propagation delay in pcie traces is 160ps/inch. That would mean 1 foot of of extra pcie trace would add about 2ns of delay. However PCIe is differential signaling, and I'm not sure if dram signals would have signficantly different propogation delays. If it is similar, than 1 extra foot of traces would only add 2ns of delay to already existing 60-90ns.

edit: additional 2ns of propogation delay would actually add more like 4ns of latency (forgot to include return trip)

Lakados · Jun 3, 2021

I’m seeing various “news” articles claiming these will be on chips before the end of the year. I’m hopeful, but I’ll believe it when I see actual stock on something.

illli · Jun 3, 2021

Lakados said:
I’m seeing various “news” articles claiming these will be on chips before the end of the year. I’m hopeful, but I’ll believe it when I see actual stock on something.

there was an update to the anandtech article. My impression is this will be on AM4 and that means Zen 4 + AM5 will be pushed back to late 2022

In a call with AMD, we have confirmed the following:

This technology will be productized with 7nm Zen 3-based Ryzen processors. Nothing was said about EPYC.
Those processors will start production at the end of the year. No comment on availability, although Q1 2022 would fit into AMD's regular cadence.
This V-Cache chiplet is 64 MB of additional L3, with no stepped penalty on latency. The V-Cache is address striped with the normal L3 and can be powered down when not in use. The V-Cache sits on the same power plane as the regular L3.
The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration
As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.
The V-Cache is a single 64 MB die, and is relatively denser than the normal L3 because it uses SRAM-optimized libraries of TSMC's 7nm process, AMD knows that TSMC can do multiple stacked dies, however AMD is only talking about a 1-High stack at this time which it will bring to market.

Lakados · Jun 3, 2021

illli said:
there was an update to the anandtech article. My impression is this will be on AM4 and that means Zen 4 + AM5 will be pushed back to late 2022

In a call with AMD, we have confirmed the following:

This technology will be productized with 7nm Zen 3-based Ryzen processors. Nothing was said about EPYC.
Those processors will start production at the end of the year. No comment on availability, although Q1 2022 would fit into AMD's regular cadence.
This V-Cache chiplet is 64 MB of additional L3, with no stepped penalty on latency. The V-Cache is address striped with the normal L3 and can be powered down when not in use. The V-Cache sits on the same power plane as the regular L3.
The processor with V-Cache is the same z-height as current Zen 3 products - both the core chiplet and the V-Cache are thinned to have an equal z-height as the IOD die for seamless integration
As the V-Cache is built over the L3 cache on the main CCX, it doesn't sit over any of the hotspots created by the cores and so thermal considerations are less of an issue. The support silicon above the cores is designed to be thermally efficient.
The V-Cache is a single 64 MB die, and is relatively denser than the normal L3 because it uses SRAM-optimized libraries of TSMC's 7nm process, AMD knows that TSMC can do multiple stacked dies, however AMD is only talking about a 1-High stack at this time which it will bring to market.

I'm hoping this means a Threadripper announcement, but I doubt it.

primetime · Jun 3, 2021

Any theory's of applications or programs that might benefit from the added cache? Or people just gonna figure out a way to mine on these?

Ebernanut · Jun 3, 2021

primetime said:
Any theory's of applications or programs that might benefit from the added cache? Or people just gonna figure out a way to mine on these?

They're claiming a 15% improvement in gaming with it Which I'm sure is about the best case scenario. I have a hard time seeing how L3 would improve single core performance so it seems likely that it helps most in memory intensive applications or at least ones where the data being accessed by the cpu is getting swapped out frequently.

schmide · Jun 3, 2021

It does make you wonder. AMD's L3 latency went up, Zen 2 40, Zen 3 46. Maybe just to get better clocks. Or maybe extra logic to work with memory stacked on top of it.

Does the cache line increase to 192 or way to 48 ?

sleepeeg3 · Jun 4, 2021

Gideon said:
Pride and TSMC capacity.

primetime said:
whats stopping intel from using TSMC like everyone else i wonder?

They would be giving up the billions of dollars they have invested in their fabs to be at the mercy of a growing monopoly. If they can somehow get their technology back on track, their fabs would be a major resource to A. give them a competitive edge and B. offer as a service to others.

It's like you could resell all of your goods through Amazon, but by doing so, you know you will eventually pay more for everything, once they put their competitors out of business. Same thing is happening with TMSC/Intel. Intel still has it's own storefront and is working for itself.

DukenukemX · Jun 4, 2021

primetime said:
Any theory's of applications or programs that might benefit from the added cache? Or people just gonna figure out a way to mine on these?

Better IPC is the main reason for it. The more cache then the more branch prediction you can do, and therefore more IPC. The L3 cache holds outputted data that was used less often. If the CPU runs into the need for that data then instead of ram it goes to the L3 cache. Ram is slow even for the CPU. This is made even worse every time we upgrade our DDR memory. DDR2 has more bandwidth than DDR1 but has more latency. This trend continues with DDR3, DDR4, and soon DDR5. We're exchanging latency for bandwidth and cache is how we deal with this problem. Assuming that AMD does couple this with their APU's then you can boost graphics performance by a lot, as both AMD and Intel has demonstrated in the past.

AzixTGO · Jun 5, 2021

This doesn't sound like anything that needs to be unique to AMD. Its mainly a TSMC technology? Intel also has their own methods that could produce something like this I think. So its really just a question of decisions.

This should be good for AMD and getting more out of a refresh than might be expected otherwise. Guessing much of the processors in the refresh will be old zen 3 CPUs and the consumer will just have to figure what is what.

cdabc123 · Jun 5, 2021

AzixTGO said:
This doesn't sound like anything that needs to be unique to AMD. Its mainly a TSMC technology? Intel also has their own methods that could produce something like this I think. So its really just a question of decisions.

This should be good for AMD and getting more out of a refresh than might be expected otherwise. Guessing much of the processors in the refresh will be old zen 3 CPUs and the consumer will just have to figure what is what.

Its a complex process dependent on tsmc's capabilities and integration of that into amds architecture. Provided none of this ip is protected it is possible for other fabs and designs to utilize it however it's not as simple as "apple, intel, etc can use the same thing" there is a significant amount of ip and design work currently utilized exclusively by amd.

The process node also makes a huge difference in if this is possible.

I would be reluctant to claim any manufacture can use this as well without seeing them using it.

Axman · Jun 5, 2021

AzixTGO said:
This doesn't sound like anything that needs to be unique to AMD.

If memory serves both AMD and Intel have patents on it. They're different in how they're physically connected but I don't remember how.

Other companies could do it, too, but they'll either have to come up with their own connection system or license it. I wouldn't be surprised if AMD licenses it, but to who? Not a lot of game in the x86 space and ARM doesn't use cache the same way IIRC.

And I have to assume that licensing to Nvidia is a hard no. Assuming it can be beneficial to graphics. I have to imagine the answer to that is, probably?

Strange bird · Jun 5, 2021

Why did amd leave HBM/2 memory in gpu segment? It is a similar comparison,we only have gddr6 and hbm2 is a better memory in everything
Does anyone remember r9 fury x and amd r9 nano (I really wanted to have nano but I didn't)

cdabc123 · Jun 5, 2021

pticurina said:
Why did amd leave HBM/2 memory in gpu segment? It is a similar comparison,we only have gddr6 and hbm2 is a better memory in everything
Does anyone remember r9 fury x and amd r9 nano (I really wanted to have nano but I didn't)

hbm2 could be interesting in the apu space.

illli · Jun 5, 2021

obviously it is easy to say it from where I am sitting, but I always thought it would be interesting if AMD made an APU with like 4Gb of HBM on it, much like that collaboration they did with intel a couple years back (kaby lake G)

funkydmunky · Jun 6, 2021

Red Falcon said:
Pfff, cache is overrated, and I am totally fine with a whopping 2MB L2 cache on my AMD Jaguar.
I don't think adding more cache (assuming SRAM and not DRAM) would ever have anything but a positive effect on any CPU ISA or microarchitecture, going back to the beginning of CPUs.

Whatever either can be accessed or fit in the L1-L3 cache then needs to be accessed by the much slower DRAM, just as DukenukemX is saying, and that is definitely a performance hit.
Normally the lack of a large L2 cache and/or presence of L3/L4 cache is due to the cost of the CPU or SoC, not because the performance gains aren't there.

SRAM is extremely costly, and the silicone real estate can also be especially costly, so just simply adding more to a CPU or SoC could greatly increase the cost as well as the TDP.

You need to display sarcasm, yeah.....?
Do better please.

DukenukemX · Jun 6, 2021

Axman said:
Other companies could do it, too, but they'll either have to come up with their own connection system or license it. I wouldn't be surprised if AMD licenses it, but to who? Not a lot of game in the x86 space and ARM doesn't use cache the same way IIRC.

There aren't many methods to increase IPC, and cache is one of them. Intel was already going to do this themselves but they were busy snorting cocaine off Apple until Apple dumped their 14nm ass. ARM does utilize cache the same way but since most ARM CPU's are cheap SoC's sitting in phones then it doesn't benefit most of the ARM industry to dump that kind of money into something that'll end up in mobile devices. Apple and Nvidia might be interested since both are making ARM CPU's outside of tablets and smart phones. If you want to compete in that level of performance then you'll want similar amounts of cache like AMD's V-Cache.

And I have to assume that licensing to Nvidia is a hard no. Assuming it can be beneficial to graphics. I have to imagine the answer to that is, probably?

Technically it was done before with the PowerPC Xbox 360 and x86 Xbox One. If you have a CPU and GPU sharing the same memory then large amounts of cache is beneficial. Nvidia is definitely looking to get into ARM+Nvidia graphics in their future, so they would want this.

illli said:
obviously it is easy to say it from where I am sitting, but I always thought it would be interesting if AMD made an APU with like 4Gb of HBM on it, much like that collaboration they did with intel a couple years back (kaby lake G)

Would also be interesting if AMD put what's in the PS5 into laptops and desktop motherboards but obviously AMD doesn't want to disrupt it's own market. Something like what's in the PS5 would destroy a good deal of the market for them.

AzixTGO · Jun 7, 2021

DukenukemX said:
Would also be interesting if AMD put what's in the PS5 into laptops and desktop motherboards but obviously AMD doesn't want to disrupt it's own market. Something like what's in the PS5 would destroy a good deal of the market for them.

Seems apple is the most likely to build systems like that. Similar to consoles.

AMD Stacked V-Cache

[H]F Junkie

[H]F Junkie

Weaksauce

[H]F Junkie

[H]F Junkie

[H]ard DCOTM x3

[H]F Junkie

[H]F Junkie

[H]ard DCOTM x3

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

Weaksauce

[H]F Junkie

[H]ard DCOTM x3

[H]F Junkie

[H]F Junkie

2[H]4U

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

Supreme [H]ardness

[H]ard|Gawd

Limp Gawd

Supreme [H]ardness

[H]F Junkie

Limp Gawd

Supreme [H]ardness

VP of Extreme Liberty

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

2[H]4U

[H]F Junkie

Limp Gawd