The Radeon Technology Group (RTG) has received its first Zen 2 sample!

Was hoping to hear 12 cores / 24 threads. Hopefully AMD can get the core count up on later samples.
 
I think this "leak" is not actually not Zen2. More likely, it is the Zen+ 2800x meant to counter the 9900k.
Zen2 will most likely be 16c/32t. Now way the RTG would have Zen2 a year before release. Must be some changes in the 2800x that the RTG needs to adjust for.
 
I think this "leak" is not actually not Zen2. More likely, it is the Zen+ 2800x meant to counter the 9900k.

So a year tweaking from 14nm to 12nm gives only 100MHz higher base clock*, but now six months of tweaking 12nm gives 300MHz? How?

Zen2 will most likely be 16c/32t. Now way the RTG would have Zen2 a year before release. Must be some changes in the 2800x that the RTG needs to adjust for.

Why cannot RTG have Zen2 six month before launch? And that 2800X, if it exists, it would be only a OC 2700X. RTG doesn't need it for testing drivers. GPU drivers don't depend on CPU clocks.

* after increasing the TDP as well.
 
Color me skeptical on a 15% IPC increase. Maybe a 15% *overall* performance increase is possible. Meaning combination of clockspeed and IPC. AMD shot for a 10% performance uplift with Zen+, which they... more or less delivered on. A combination of minor 3% IPC bump

Or only 1.5% IPC bump, it depends what you are measuring.

YuvqEF9.png



Then Zen 2 moves to an 8 core "CCX" that's the whole die.

So much as I like the 8-core CCX idea. The last rumor is that the ZP2 die is a pair of 4 core CCX again.

a 2x 256 bit AVX setup (instead of 2x 128).

This is difficult to believe. Rome increases the core count to 64 core. If each core has now twice bigger AVX units, then the memory subsystem would be improved by about 4x whereas maintaining backwards compatibility with SP3 specs.
 
I think this "leak" is not actually not Zen2. More likely, it is the Zen+ 2800x meant to counter the 9900k.
Zen2 will most likely be 16c/32t. Now way the RTG would have Zen2 a year before release. Must be some changes in the 2800x that the RTG needs to adjust for.

Did you read from the first page? It’s zen 2 no ifs or but’s about it. ES samples are worked on 8-12 months in advance. It’s perfectly believable. You really think rtg needs to optimize for 2800x? The architecture that has been out for 2 years now? Zen+ wasn’t a big improvement it was a node improvement and it had few hundred MHz increased clock. 2800x will be same thing as 2700x only if it can get even higher clocks cuz they are not going to add more cores to it. Why the hell would they have to optimize for 2800x? And why would it crash so much when 2700x has been out already for a while.

On top OP is not a noob. He knows what he is talking about lol.
 
Or only 1.5% IPC bump, it depends what you are measuring.

View attachment 108459




So much as I like the 8-core CCX idea. The last rumor is that the ZP2 die is a pair of 4 core CCX again.



This is difficult to believe. Rome increases the core count to 64 core. If each core has now twice bigger AVX units, then the memory subsystem would be improved by about 4x whereas maintaining backwards compatibility with SP3 specs.


Leave it to AMD to completely loose their momentum by forfeiting their core count supremacy....
 
I think what a lot of people tend to forget is that adding more cores increases latency as well.

Not really, the mere act of adding more cores does not add latency. Hence the 8700K does not have more latency than the older (but same architecture) 7700K. Changing core counts while maintaining the same interconnect architecture will have identical latency. The additional latency comes from the groundwork interconnect architecture, which can be made to handle large core counts with less die-space (which usually increases latency) or more die-space can be used for the interconnect which isn't as conducive to large core counts.
 
Not really, the mere act of adding more cores does not add latency. Hence the 8700K does not have more latency than the older (but same architecture) 7700K. Changing core counts while maintaining the same interconnect architecture will have identical latency. The additional latency comes from the groundwork interconnect architecture, which can be made to handle large core counts with less die-space (which usually increases latency) or more die-space can be used for the interconnect which isn't as conducive to large core counts.

Anytime you add cores you're increasing latency somewhere on the die. There's no such thing as a free lunch.
 
Color me skeptical on a 15% IPC increase. Maybe a 15% *overall* performance increase is possible. Meaning combination of clockspeed and IPC. AMD shot for a 10% performance uplift with Zen+, which they... more or less delivered on. A combination of minor 3% IPC bump, minor ~5% max boost clock boost, and precision boost rework (good for a nice practical performance boost) got them over the finish line, if barely. I expect Zen2 will deliver more than this, because of a much better process. So a 15% bump is not out of the question. More, even, is possible if there is a core count increase to go along with it.

One possibility which crossed my mind is that Zen2 may dispense with the CCX. I wondered the other day if 2x 4 core CCXs was a stopgap measure from AMD. Perhaps when designing Zen, they realized that while it was impressive - and ran right over bulldozer, it wasn't quite going to reach parity with Intel in clockspeed or IPC. So AMD used IF, and expanded a 4 core design to be an 8 core design so that it would beat out the 7700k in at least some tasks, and create a market for Zen. This might have happened early in the design process. Anyway... the IF would still be useful for gluing together Epyc - and so they just added it on die as well as a way to glue together multiple dies.

Then Zen 2 moves to an 8 core "CCX" that's the whole die. Full ring bus internally. But, of course, you can still glue together multiple dies for Epyc and TR, for rapid, cheap core count scaling. A 4 core version could continue to be produced for APU design, with the interconnect still used.

If true Zen2's IPC improvements may be tied partly to the reduction in latency from eliminating cross-CCX traffic and, perhaps, a 2x 256 bit AVX setup (instead of 2x 128). Combine with what we assume will be much better clock scaling... and you have something that should compete well against a 9900k, while still preserving the ability to glue multiple dies for TR and Epyc at relatively low cost.

Of course, this is all supposition. I have no idea. We could still see a 6 core CCX or something - and this ES (presuming the rumor is true) could have just had some cores disabled because partly broken ES.

After having written my own prediction of where AMD will go with things, I do stroke my beard at yours, but we'll get back to that lol

I'm interested to see where the Samsung-TSMC node takes AMD in terms of clocks. The later Zen ES were arguably at the 3.0-3.2GHz "sweet spot" on the 14NM process, at that point where it was still efficient enough to not need a stupid amount of voltage, was easy to cool, and was capable of good performance (ignoring the fact that come release, it would need higher speeds to actually compete). With that in mind, it meant roughly 1GHz was left on the table as to what it could actually max out at. I don't recall if those had a Boost or not, part of me wants to say they didn't. Either way, if the 4.0GHz of this sample is similarly held back, juuust maybe we'll see them capable of 5Ghz... :D

While this is referencing GloFo at 7nm, and we know they have decided to sideline their 7nm for the high-volume market... I think it's still a valid data point to consider:
"Q17: Does the first generation of 7LP target higher frequency clocks than 14LPP?

GP: Definitely. It is a big performance boost - we quoted around 40%. I don't know how that exactly will translate into frequency, but I would guess that it should be able to get up in the 5GHz range, I would expect."​
(Also funny how that being still rather recent, was showing no sign of GF backing out of 7nm in volume.

Whilst TMSC is quite different from GloFo, and also this doesn't factor in architecture either, I'd say it does leave it a possibility :p


As far as the unification of CCX. I can kinda see that as a potential situation. I mean Zen was conceived at a time when 4 cores was enough, and probably when they figured Zen wasn't going to do too well compared to Intel's core (as in thread) performance. In addition to that, they were planning on this being in laptops, likely paired with a GPU, and would need to be power conscious. As we all recall, AMD said they weren't going to compete on the high end anymore, so a 4C/8T APU that didn't draw a ton of power makes sense. Add in the fact that AMD picked the 14LPP node for low power and it brings it all together rather well; however, I wouldn't be surprised if they determined (either after simulations or a very early ES) that the performance was higher then they anticipated and felt it could be a very real competitor to Intel.... So why not do what they did with the Athlon 64... merge two dies together (cut me some artistic impression here lol) and create the Athlon 64 X2. Thus, they went from a 4C/8T APU into a 4C/8T CCX that could technically scale in die, but also on chip with multiple dies, as they've done for a long time with Opertons.

That just sets the stage and to me makes sense and answers the classic question that would likely come up regarding a large monolithic approach: "why didn't do that in the first place?"
Well, because it wasn't what they envisioned in the first place. If you remember what Jim Keller said when he left AMD after working on Zen, that they've gotten the groundwork laid down for as far out as Zen 3. So if we want to speculate... well that could mean:
Zen 1 - 4C8T CCX x2
Zen 2 - 8C/16T CCX solo
Zen 3 - 8C/16T CCX x2

Personally, I find that just as plausible as my original thought, since it still would result in a new socket to accommodate Zen 3 due to more cores.

The future releases by AMD will be interesting for sure. :)
 
As we know from The_Stilt's deep dive on the original Zen ES chip, it ran the Infinity Fabric (uncore) at 1:1, and was said that it'd likely be the case for the EPYC chips, and TR by association (whether or not that came to be the case I don't know and I don't know how we'd even be able to determine what speed it runs at).

I don't think, or at least recall, it ever being explained why they ended up going with a 1:2 speed.
The IF frequency is clocked 1:1 to the memory frequency. The illusion of 1:2 ratio comes from the fact that what is colloquially called "memory frequency" is actually memory transfer rate.
DDR stands for Double Data Rate which means that DDR memory performs two transfers per clock cycle.
The Stilt's analysis (in its latest form) explains this.

An AMD slide (not sure from which presentation) mentioned that the fabric is in the memory clock domain so they didn't have to add a buffer in between that adds latency for memory accesses.

Regarding your memory troubles:
All I can say is that memory is... complicated when you try to run it near its physical limitations.
A memory controller is incredibly complex because it has to implement complex procedures that are parameterized over dozens of memory-specific details (timings etc.).

Color me skeptical on a 15% IPC increase. Maybe a 15% *overall* performance increase is possible. Meaning combination of clockspeed and IPC. AMD shot for a 10% performance uplift with Zen+, which they... more or less delivered on. A combination of minor 3% IPC bump, minor ~5% max boost clock boost, and precision boost rework (good for a nice practical performance boost) got them over the finish line, if barely. I expect Zen2 will deliver more than this, because of a much better process. So a 15% bump is not out of the question. More, even, is possible if there is a core count increase to go along with it.

One possibility which crossed my mind is that Zen2 may dispense with the CCX. I wondered the other day if 2x 4 core CCXs was a stopgap measure from AMD.

Then Zen 2 moves to an 8 core "CCX" that's the whole die. Full ring bus internally. But, of course, you can still glue together multiple dies for Epyc and TR, for rapid, cheap core count scaling. A 4 core version could continue to be produced for APU design, with the interconnect still used.

I have also considered that the CCX was just a necessity to reduce design work.
But then again I think there is not really a reason to do away with the general concept. There needs to be some kind of fabric across all cores and abstracting that into different layers makes this a whole lot easier.

AMD is heavily invested in interconnect technology now (as several patents and papers show). They are also doing and developing MCMs and "chiplets" etc.
Overall I think the trend towards "distributed" (on multiple chips) and heterogenous systems is inevitable, which must result in a departure from the cozy UMA model where every core can be treated the same.

Rather than trying to squeeze ever more stuff into a single chip or "complex", AMD might instead try to come up with an interconnect that makes these boundaries as seamless as possible (although as said above, this can not be perfect).
As evidenced by EPYC, scaling a system becomes a lot easier when you have a good interconnect.

Leave it to AMD to completely loose their momentum by forfeiting their core count supremacy....
That "momentum" is worth nothing on its own.
I know this might sound old, but there is still barely any consumer software that can use 8C/16T.

Not really, the mere act of adding more cores does not add latency. Hence the 8700K does not have more latency than the older (but same architecture) 7700K. Changing core counts while maintaining the same interconnect architecture will have identical latency.
This is incorrect.
The addition of cores makes the ring bus of Intel CPUs longer, thereby adding to average latency and reducing available bandwidth per core (if everything else is constant).

In Ivy Bridge-EX and Broadwell-E the ring was rearanged on high core count dice.
IB-EX had a hybrid triple-ring design while BDW-E used two rings.

In Zen, the cores of a CCX are connected in a full mesh which scales with quadratic complexity.
This topology is infeasible for six or eight cores.
 
In Zen, the cores of a CCX are connected in a full mesh which scales with quadratic complexity.
This topology is infeasible for six or eight cores.
I have to disagree here. Sure, it won't scale as well with six or eight cores, but you could use the mesh in a similar way. There are a number of possible arrangements, with varying degrees of complexity, depending mostly on the limits of the technology it's implemented on.

Code:
X—X—X
|\|\|
X—X—X

X—X—X
|\ /|
X—X—X

  X—X
 /| |\
X  +  X
|  |  |
X  +  X
 \| |/
  X—X
I'm sure an engineer could come up with more efficient designs, but these are just a few off the top of my head.

Edit: If you have a shared cache, you could even eliminate some of the links:
Code:
      or
CACHE    CACHE
↑ ↑ ↑      ↑
X X X    X—X—X
|\|\|    |\|\|
X X X    X—X—X
↓ ↓ ↓      ↓
CACHE    CACHE
 
Last edited:
I have to disagree here. Sure, it won't scale as well with six or eight cores, but you could use the mesh in a similar way. There are a number of possible arrangements, with varying degrees of complexity, depending mostly on the limits of the technology it's implemented on.
I specifically meant a full mesh (or fully connected network) as it is now (i.e. all nodes/cores connected directly).
There are certainly ways to reduce the number of links, but this inevitably leads to forwarding - adding routing latency and complexity.

Regarding the cache in the middle, this is how it is done right now on a high level of abstraction; four cores are connected to the L3 cache.
However when looking at the detailed implementation, the cache consists of four slices, each adjacent to one core, which are connected to each other.
The connection between the slices is typically graphically presented as a full mesh (incl. by AMD), but AFAIK the actual topology has never been officially mentioned.
Intel also uses one L3 slice per core, and all are connected via (modified) ring or mesh.
The interesting observation here is that neither AMD nor Intel uses a physically unified large cache block.

As I said before, AMD is evidently very active in the research of interconnects so I am confident they can come up with a good compromise of complexity and performance.
It will be interesting to see whether AMD believes that the drawbacks of their NUMA-esque architecture are worth a paradigm shift.

If the Infinity Fabric and locality-aware software (scheduler and applications) can be improved enough, the CCX might not be worth giving up.
The latency of the CCX-local L3 cache is actually superior to the Intel ring bus counterpart.
 
I specifically meant a full mesh (or fully connected network) as it is now (i.e. all nodes/cores connected directly).
There are certainly ways to reduce the number of links, but this inevitably leads to forwarding - adding routing latency and complexity.

Regarding the cache in the middle, this is how it is done right now on a high level of abstraction; four cores are connected to the L3 cache.
However when looking at the detailed implementation, the cache consists of four slices, each adjacent to one core, which are connected to each other.
The connection between the slices is typically graphically presented as a full mesh (incl. by AMD), but AFAIK the actual topology has never been officially mentioned.
Intel also uses one L3 slice per core, and all are connected via (modified) ring or mesh.
The interesting observation here is that neither AMD nor Intel uses a physically unified large cache block.

As I said before, AMD is evidently very active in the research of interconnects so I am confident they can come up with a good compromise of complexity and performance.
It will be interesting to see whether AMD believes that the drawbacks of their NUMA-esque architecture are worth a paradigm shift.

If the Infinity Fabric and locality-aware software (scheduler and applications) can be improved enough, the CCX might not be worth giving up.
The latency of the CCX-local L3 cache is actually superior to the Intel ring bus counterpart.
Yeah, for some reason I confused the core configuration with the die configuration. You could swap the cache and X's in my ascii diagram to make it more representative of actual possible configurations. As shown here, though, it's not necessary to have one L3 slice per-core: you could have one per two cores. You end up with a smaller cache, but I wonder what are the effects on latency there?
 
The RTG has just received its first Zen 2 sample (to optimize for) and it's really impressive.

8C/16T

4.0 GHz/4.5 GHz

DDR4-3600 CAS 15

Radeon RX Vega 64 LE

__________________________

The good: It's already nibbling at the Core i7-8700K.

The bad: It crashes a lot.

The ugly: It crashes all the time. Some of the tests have to be run multiple times because they crashed before finishing.

I know it's only early but that's disappointing. Either they fix up the IPC at least 5 / 10% or they push that number a bit higher.
Hopefully, very hopefully, this will be resolved in 6 months time.
Also, to be honest, considering rumoured, potential launch dates, I'm surprised this wasn't at least 2 or 3 months ago. (I could've sworn there was indications this had occurred months ago?)
 
I know it's only early but that's disappointing. Either they fix up the IPC at least 5 / 10% or they push that number a bit higher.
Hopefully, very hopefully, this will be resolved in 6 months time.
Also, to be honest, considering rumoured, potential launch dates, I'm surprised this wasn't at least 2 or 3 months ago. (I could've sworn there was indications this had occurred months ago?)

its common for es samples to be ironed out. It probably has a lot to do with software than it does with the chip itself. If they can get early es sample to run at those clocks I think they will be fine.

EPYC 2 was already in labs months ago I think. so This just might be RTG testing it out and likely driver issue more than anything. I won't worry too much. Sometimes they have like multiple ES samples.
 
The IF frequency is clocked 1:1 to the memory frequency. The illusion of 1:2 ratio comes from the fact that what is colloquially called "memory frequency" is actually memory transfer rate.
DDR stands for Double Data Rate which means that DDR memory performs two transfers per clock cycle.
The Stilt's analysis (in its latest form) explains this.

An AMD slide (not sure from which presentation) mentioned that the fabric is in the memory clock domain so they didn't have to add a buffer in between that adds latency for memory accesses.

Regarding your memory troubles:
All I can say is that memory is... complicated when you try to run it near its physical limitations.
A memory controller is incredibly complex because it has to implement complex procedures that are parameterized over dozens of memory-specific details (timings etc.)
I'm still trying to track down the exact post in his thread (sadly, that's even assuming it was his Deep Dive thread that I read it in), but given I had originally read the at-the-time 24 or so pages, it'll take me awhile to track down the post :\
But, to be fair, in case my memory is just playing tricks on me... In his first post, he does state that it runs 1:2, but that it is as you said, only 1:2 because of DDR being doubled. Which, technically speaking, I have DR4-3200M/Ts memory, not DDR4-3200MHz, but we all just have forgone those technicalities and simply went with it running at 3200, where some people imply MHz and in those cases we know what they mean. I personally try to say it w/o MHz and don't try to imply it as such, but I don't always catch myself. At any rate... My point being, that I could've just as easily misread what he wrote; however, I distinctly recall him mentioning 1:1 in relation to memory speed, which is why I don't really feel like I've misread or misremembered that detail from his first post.

Which after holding off posting for a day so I could research... I finally managed to at least track down one of the posts I was recalling. (I'd still swear he mentioned it again, relating to a chip sample he had, but regardless...)
"UCLK, FCLK & DFICLK default to half of the effective MEMCLK frequency (i.e. DDR-2400 = 1200MHz).
There is a way to configure the memory controller (UCLK) for 1:1 rate, however that is strictly for debug and therefore completely untested. The end-user has neither the knowledge or the hardware to change it.
AFAIK FCLK & DFICLK are both fixed and cannot be tampered with. However certain related fabrics, which run at the same speed have their own frequency control. The "infinity fabric" (GMI) runs at 4x FCLK frequency.
"​
In all fairness, I suppose it's my error for assuming that the UCLK (aka UMC Clock; Unified Memory Controller) was directly related to the IF speed. And it still could be, but from what he says, that doesn't sound like the case. I'll have to do my damn best to disassociate the two and forget it ever mentioned. BUT, it does still leave me curious regarding what having the UMC running 1:1 would translate to performance wise. The person quoted in his response seems to feel it'd have a sizable impact on things, but being I'm not an engineer, I can't even begin to speculate.

As for "my memory troubles", I'm not sure which ones you mean? Aside from a couple BIOS releases that caused me issues, I've been able to run my RAM at DDR4-3200 on my 1700X since launch and running my Titanium's v1.10 shipping BIOS.
Or do you mean what I had been talking about with Odd/Even timings? As I understand how complex memory is, and how one timing can play on another, leading to you either gaining more performance or losing stability.
Regardless, it's not really that I'm having troubles, insofar as that it's not something I'm doing wrong... it's just a quirky choice AMD has made. In the times when I was permitted to run Odd timings, I had perfect stability, there were no troubles to speak of. Thus why I find it strange that the ability was once again taken away.


Regarding the cache in the middle, this is how it is done right now on a high level of abstraction; four cores are connected to the L3 cache.
However when looking at the detailed implementation, the cache consists of four slices, each adjacent to one core, which are connected to each other.
The connection between the slices is typically graphically presented as a full mesh (incl. by AMD), but AFAIK the actual topology has never been officially mentioned.
Intel also uses one L3 slice per core, and all are connected via (modified) ring or mesh.
The interesting observation here is that neither AMD nor Intel uses a physically unified large cache block.

In terms of using a 'solid chunk' of cache... To me, the most obvious reason would be: yields.

When you use a solid chunk of cache, if that happens to be the sector of a die with a fault, then that die is junk.
However... When you have segmented cache, I suspect it opens up the ability to lasing it off, then binning it as a lower-end product like an i5 or i3 (or Pentium and Celeron, when they were options).

Even if that isn't necessary due to high yields, like in AMD's case with Ryzen, it may have provided more difficult to then create the APU which has 1/2 the L2 in the CCX.

Then again, this is all assuming that the cache is how it is depicted in the die shot of chips. Keep in mind that the die shots of the 8C Zen and 4C Zen APU are provided by AMD. This is not me trying to create conspiracy, but that it's known that some elements are obscured while other elements are altered slightly to better illustrate what they are. After all, they are just PR material we're shown. I've also never bothered to look around for a 3rd-party die shot of Ryzen, assuming there are any published, and as such can't say I've compared to see if there are changes. Most wouldn't have any incentive to do that given the work involved. The only place I know will sometimes do that is iFixIt in their teardowns, to better compare something when there isn't any info available or if the numbers on an IC were removed/changed. I don't know if it'd be worth their time to have a Ryzen sent off and gone through the processes to take a peak under the hood.
 
Which after holding off posting for a day so I could research... I finally managed to at least track down one of the posts I was recalling. (I'd still swear he mentioned it again, relating to a chip sample he had, but regardless...)
"UCLK, FCLK & DFICLK default to half of the effective MEMCLK frequency (i.e. DDR-2400 = 1200MHz).
There is a way to configure the memory controller (UCLK) for 1:1 rate, however that is strictly for debug and therefore completely untested. The end-user has neither the knowledge or the hardware to change it.
AFAIK FCLK & DFICLK are both fixed and cannot be tampered with. However certain related fabrics, which run at the same speed have their own frequency control. The "infinity fabric" (GMI) runs at 4x FCLK frequency.
"​
In all fairness, I suppose it's my error for assuming that the UCLK (aka UMC Clock; Unified Memory Controller) was directly related to the IF speed. And it still could be, but from what he says, that doesn't sound like the case. I'll have to do my damn best to disassociate the two and forget it ever mentioned. BUT, it does still leave me curious regarding what having the UMC running 1:1 would translate to performance wise. The person quoted in his response seems to feel it'd have a sizable impact on things, but being I'm not an engineer, I can't even begin to speculate.
You are right. When I read your first paragraph, I vaguely remembered that he had written something about 1:1.
I immediately searched for it (before reading the above; that would have saved some time).
As for "my memory troubles", I'm not sure which ones you mean? Aside from a couple BIOS releases that caused me issues, I've been able to run my RAM at DDR4-3200 on my 1700X since launch and running my Titanium's v1.10 shipping BIOS.
Or do you mean what I had been talking about with Odd/Even timings? As I understand how complex memory is, and how one timing can play on another, leading to you either gaining more performance or losing stability.
Regardless, it's not really that I'm having troubles, insofar as that it's not something I'm doing wrong... it's just a quirky choice AMD has made. In the times when I was permitted to run Odd timings, I had perfect stability, there were no troubles to speak of. Thus why I find it strange that the ability was once again taken away.
Yeah, it was just a remark.I'm personally stuck at 3133 with my 2700X, but only due to my own laziness.

Who knows why they changed it back. Maybe it was an optimization, maybe they found an issue with it. Maybe they accidentally broke it. :rolleyes:
In terms of using a 'solid chunk' of cache... To me, the most obvious reason would be: yields.

When you use a solid chunk of cache, if that happens to be the sector of a die with a fault, then that die is junk.
However... When you have segmented cache, I suspect it opens up the ability to lasing it off, then binning it as a lower-end product like an i5 or i3 (or Pentium and Celeron, when they were options).

Even if that isn't necessary due to high yields, like in AMD's case with Ryzen, it may have provided more difficult to then create the APU which has 1/2 the L2 in the CCX.
Memory blocks typically have quite a bit of redundancy. SRAM takes up a huge portion of most chips so it would be foolish not to consider yields in each slice already.
Yeah, for some reason I confused the core configuration with the die configuration. You could swap the cache and X's in my ascii diagram to make it more representative of actual possible configurations. As shown here, though, it's not necessary to have one L3 slice per-core: you could have one per two cores. You end up with a smaller cache, but I wonder what are the effects on latency there?
AFAIK the Raven Ridge CCX still has four slices, but each with half the size.
AMD's materials (slide 6) )indicate this, although the physical locations are not correct (this is a copied diagram from Zeppelin with modified values).
Also looking at a die shot, going by the patterns I would say there are four slice control blocks and a strip of something else in the middle (that isn't there on Zeppelin).

Unfortunately, the only thing WikiChip said about the interconnect is that "the L3 acts as a crossbar for each of the four cores".
The actual connection structure still remains a mystery to me.

Anyway, four cores (nodes) happens to be IMO the last number where a fully connected network is feasible. Anything beyond that IMO needs switching/forwarding.
A fully connected network of four nodes requires six bi-directional connections. At six nodes this number jumps to 15, at eight nodes it's 28.
Adding more cores adds routing complexity that costs some latency.
Then again, this is all assuming that the cache is how it is depicted in the die shot of chips. Keep in mind that the die shots of the 8C Zen and 4C Zen APU are provided by AMD. This is not me trying to create conspiracy, but that it's known that some elements are obscured while other elements are altered slightly to better illustrate what they are. After all, they are just PR material we're shown. I've also never bothered to look around for a 3rd-party die shot of Ryzen, assuming there are any published, and as such can't say I've compared to see if there are changes.
There are some excellent die shots by Fritzchens Fritz on Flickr. One of my favourites is this (be sure to check out the full resolution originals).
All the patterns perfectly match the official die shots and diagrams of Zeppelin (like the ones from HotChips 28).
 
The Anandtech article on the new 7nm TSMC Apple A12 processor is quite interesting.

120mhz-200mhz gains when compared to its 10nm TSMC A11 predecessor. Hard to say how the current Global Foundries 12nm process would translate in this case for AMD CPUs.

It could be that these clock boosts are purely the result of better power efficiency. Interesting nonetheless.
 
The Anandtech article on the new 7nm TSMC Apple A12 processor is quite interesting.

120mhz-200mhz gains when compared to its 10nm TSMC A11 predecessor. Hard to say how the current Global Foundries 12nm process would translate in this case for AMD CPUs.

It could be that these clock boosts are purely the result of better power efficiency. Interesting nonetheless.
Don't forget that when you have a mobile processor you are mostly limited by the amount of heat generated by the silicon. You could go higher on raw clock speed but the design would have to be able to cope with it (heat).

If you translate that to what AMD did with Threadripper and Ryzen is that you could still get Threadripper to 4ghz if you have decent water cooling. Ryzen never got much further then that result (1st generation), But the difference between air and water solutions becomes quite clear.
 
If any of that shit is true, AMD is gonna need Zen 2 sooner rather than later. That 9900k is looking MIGHTY impressive. Finally a balls-to-the-wall product out of Satan Clara.
 
If any of that shit is true, AMD is gonna need Zen 2 sooner rather than later. That 9900k is looking MIGHTY impressive. Finally a balls-to-the-wall product out of Satan Clara.

Nothing I have seen looks impressive and it's looking to be priced way too high as well. 12% faster in gaming then a 2700x is jut not that impressive, especially when you factor in price.
 
Nothing I have seen looks impressive and it's looking to be priced way too high as well. 12% faster in gaming then a 2700x is jut not that impressive, especially when you factor in price.

The key difference is that it's also faster in multithreaded productivity tasks too. Up until now, there's been a good rationalization for Ryzen > mainstream Intel products due to better multithreaded performance and only a minor gaming hit. Neither the 8700k nor the 2700X could be said to wear that absolute performance crown in the mainstream - it's more use case based. Both are relatively equal. The 9900k turns that on its head, offering better performance in both types of tasks, though admittedly at a higher price point. The 2700X will still win on performance for the dollar - provided it remains competitive with the 9700k (which it probably will) - but speaking for myself, if I were building on a mainstream platform today, it would be 9900k all day.

Either way, hopefully Zen 2 competes favorably with the 9900k.
 
The key difference is that it's also faster in multithreaded productivity tasks too. Up until now, there's been a good rationalization for Ryzen > mainstream Intel products due to better multithreaded performance and only a minor gaming hit. Neither the 8700k nor the 2700X could be said to wear that absolute performance crown in the mainstream - it's more use case based. Both are relatively equal. The 9900k turns that on its head, offering better performance in both types of tasks, though admittedly at a higher price point. The 2700X will still win on performance for the dollar - provided it remains competitive with the 9700k (which it probably will) - but speaking for myself, if I were building on a mainstream platform today, it would be 9900k all day.

Either way, hopefully Zen 2 competes favorably with the 9900k.

at double the cost. It looks to have the same IPC as the 8700k too. Don't get me wrong everyone knows that the 8700k is the best gaming CPU, but not by much right now. All AMD needs to do is increase clock speeds and IPC and Intel is in a world of hurt.

Let's not even bring up how much power the 9900k will use before overclocking. AMD is coming out with 7nm soon. Intel is using 14++++++++++++ and +
 
The key difference is that it's also faster in multithreaded productivity tasks too. Up until now, there's been a good rationalization for Ryzen > mainstream Intel products due to better multithreaded performance and only a minor gaming hit. Neither the 8700k nor the 2700X could be said to wear that absolute performance crown in the mainstream - it's more use case based. Both are relatively equal. The 9900k turns that on its head, offering better performance in both types of tasks, though admittedly at a higher price point. The 2700X will still win on performance for the dollar - provided it remains competitive with the 9700k (which it probably will) - but speaking for myself, if I were building on a mainstream platform today, it would be 9900k all day.

Either way, hopefully Zen 2 competes favorably with the 9900k.

Problem I see for almost the same cash you can buy Threadripper which is far better at mulithreaded performance. I just dont see the average person spending that much more for such a little gain.
 
7% and they up the clockspeeds a bit- still big ifs, but that'd be enough I think.

I was hoping for more cores, but it's starting to look like the core count isn't going up on the mainstream. Not confirmed, of course. But seems that way.

AMD needs ~15% more performance per core if they intend to compete at the same number of cores. Whether that comes from clockspeed or IPC, or both... doesn't really matter. But that's the bar. I think it's very possible, but by no means a sure thing.
 
I was hoping for more cores, but it's starting to look like the core count isn't going up on the mainstream. Not confirmed, of course. But seems that way.

Therein lies the market challenge- if you're not doing massive spreadsheets, compiling large applications, content creation etc., two cores with hyper-threading is actually just as fast as sixteen for your average desktop/laptop. The most performance-sensitive applications that your average consumer is going to be using are games.

AMD needs ~15% more performance per core if they intend to compete at the same number of cores. Whether that comes from clockspeed or IPC, or both... doesn't really matter. But that's the bar. I think it's very possible, but by no means a sure thing.

While 7nm should give them a boost in power usage per clock, historically speaking, I would hope that they get their per-core performance up through IPC increases. That's what's needed to help them scale performance up in core-limited (size limited) scenarios like consumer sockets and down into the ultrabook/large tablet space effectively.

One thing to add: I hope AMD comes up with a basic graphics block that's more or less equivalent to what Intel is shipping, with at least three display outputs, and puts it on every die they forge. Even TR/Epyc could use that for remote desktops, but really, not having it in their current Ryzen 6+ core CPUs represents a limitation.
 
In Zen, the cores of a CCX are connected in a full mesh which scales with quadratic complexity.
This topology is infeasible for six or eight cores.
...Unless you forget about an active interposer which AMD is developing and patented.
 
...Unless you forget about an active interposer which AMD is developing and patented.

So long as costs- and worse yields, a la HBM!- don't skyrocket in the process, I see this as being a win for them, supposing that they can keep performance in line with non-interposer parts. How well HBM works is a good sign, though!
 
So long as costs- and worse yields, a la HBM!- don't skyrocket in the process, I see this as being a win for them, supposing that they can keep performance in line with non-interposer parts. How well HBM works is a good sign, though!
I am not aware of them using HBM for the active interposer CPU designs, instead only in MCMs/GPU stuff, but it's more [active interposer] just a way to provide better mesh/linking without taking valuable, high density 7nm die area.
I'm very curious as to how much die space is interlink, as that would effectively be minimised or removed entirely depending on how their design go.

I see this as being as big of a jump (at this end of the optimisation pile - most free lunches are well gone by now), as the IMC was in the AMD Athlon 64, especially for SMT systems. If it works as well as I expect, Intel is in for a hell of a time. As there is so much room for interconnect using active interposer now, we have the luxury of multiple paths being a reality. E.g. It could be 4 core ccx and 8 core ccx at once! Imagine that, having variable yet fixed latency core scaling depending on the task. So something that needs e.g. 4 threads running on one sub-CCX/optimised routing pattern, but occasionally it needs 16 threads, so uses alternate traces that are slightly higher latency but invokes both CCX (e.g. 16 cores) on the die at similar latency, in order to minimise losses to scheduling/synchronisation.
So now (assuming 2x8core CCX) you have the scalability of a 16 core TR rig with far better latency, the ability to run octo-core CCX with very low latency. Or 2x octocores with far lower latency than TR/EPYC previously allowed, along with a single die. I would imagine if they can make the active interposers large enough, it will allow far shorter distances between dies for EPYC/TR multi-die layouts. Now they should be much closer to e.g. Intel ring bus latency for a multi-die setup... which is death to Intels slight monolithic advantage.

When you take factors like this into account, it gives more feasibility to the 13% IPC leak (I still think that's a bit optimistic). IMC tweaks, extreme latency reduction, CCX scaling, interconnection speed etc all will help, before we even get to overall clock speed adding to that IPC boost to the mix.
It further makes sense when you look at the ES leak by mockingbird. 4.5Ghz ES (add 200MHz+ for release) with ~10% IPC is going to really rock the boat for a 14nm-hamstrung Intel.
 
I think we still don't have a good idea of Zen 2's real performance.

The tests were done with a Radeon RX Vega 64 LE.

It's possible that a lot of the tests ended up slamming into a GPU bottleneck.
 
Last edited:
The magic fairy told you?

Edit: and this is not good.

Engineering samples are meant to crash a lot. but ONLY coming close to matching an 8700K? Who freaking cares? It needs to BEAT a 9800K. Intel is going to have its own 8/16 CPU that will be the definitive gaming champion, AMD needs to beat that.
Nope they just have to come anywhere near it and price appropriately
 
Back
Top