An In-Depth Analysis of the 7nm 64-Core AMD EPYC "Rome" Server Processor

cageymaru · Nov 9, 2018

Charlie Demerjian of SemiAccurate has written an in-depth analysis of the 7nm 64-core 9-die AMD EPYC "Rome" server processor. He hypothesizes over the real world performance potential for the processor versus benchmarks by analyzing the chip's design features. He explains why the 14nm IOX chip didn't need a shrink, and how the design choice will speed up production of the processor. He articulates how doubling the instruction widths internally for the new Zen 2 cores leads to a 2X FP performance increase per core and a 4X FP increase per socket. Many other revelations about the new "Intel killers" can be found in the article.

AMD has full transparent memory encryption, can secure VMs from the host, and many other useful things that protect users in the real world. Better yet they are enabled with a BIOS switch and have a tiny overhead. Officially there is now more space for keys so you can support more encrypted memory VMs per box. The other big box to check is Spectre mitigations are now rolled in to the core but Meltdown and L1TF/Foreshadow are not. Why? Because AMD wasn't affected by either one and never will be. Patching can work but to be immune from the start is always a better choice.

Sikkyu · Nov 9, 2018

TLDR; AMD's big dick processor and its gonna be slappin around intel for a while.

power666 · Nov 9, 2018

Still waiting for independent reviews when they arrive in next year for final judgement but on paper, AMD's design here is VERY impressive. Even without the changes from Zen+ -> Zen 2, there looks to be a nice performance increase in just how the topology changes. The certain IO removes the need for multiple NUMA nodes on a single socket system. For applications that needed both high PCIe lane count and ran multithreaded workloads should see a very healthy boost. This also opens up oddities like an 8 core Epyc chip for dirt cheap: this would only require two dies instead of four due to the IO being centralized. This has implications for Threadripper as well as the NUMA hole and restricting applications to particular cores to benefit from memory locality will be unnecessary.

Zen 2 looks to be a good step forward. The doubling of floating point throughput is also welcome but I am curious how AMD is going to tackle power consumption here. On the Intel side, using wider SIMD tends to invoke a clock speed offset (still faster due to the double of width but it could never be twice as fast).

Charlie is right that yields should be high. 75 mm^2 on the CPU side should have good yields to begin with to the point that it makes me wonder if there is a point toward going below 8 active cores in a package for server (send all the defects down to the consumer market). Binning should make a few higher grades of SKUs available. Not because of any fundamental change in the design, just that the raw number of smaller, higher yielding dies should produce more 'golden samples'. We're looking at a few 100 Mhz here but still welcome.

On the Intel side of things, they have the technology to do exactly what AMD is pulling off here and even more. However, the delays in their 10 nm production and apparent 14 nm shortages have wrecked their road map. Once Intel recovers, they can easily catch up but next year is looking brutal for Intel.

Not mentioned anywhere and a bit speculative on my part, I think Epyc can now scale up to an 8 socket design. Previously four Zeppelin dies has to linked to four other Zeppelin dies in another socket. Now a single off die Infinity Fabric link can hit all the coherency points in the remote socket, leaving three other available for additional CPU sockets. The PCIe lanes share connectivity with Infinity fabric so even in a dual CPU be more like a 112 + 112 lane configuration. A quad socket system would have 256 lanes and an eight socket beast would double that. The modular nature of this opens up the possibility of hosting a GPU in the socket. If Navi is modular as well, a compute die linked via Ininfity Fabric would be a straight substitution for a CPU die. There is enough room on the package if AMD wanted to add HBM for the GPU dies too.

Atearen · Nov 9, 2018

I hope someone will keep an eye on Intel, they played dirty in the past and the fines were a joke

Glock24 · Nov 9, 2018

power666 said:
Still waiting for independent reviews when they arrive in next year for final judgement but on paper, AMD's design here is VERY impressive. Even without the changes from Zen+ -> Zen 2, there looks to be a nice performance increase in just how the topology changes. The certain IO removes the need for multiple NUMA nodes on a single socket system. For applications that needed both high PCIe lane count and ran multithreaded workloads should see a very healthy boost. This also opens up oddities like an 8 core Epyc chip for dirt cheap: this would only require two dies instead of four due to the IO being centralized. This has implications for Threadripper as well as the NUMA hole and restricting applications to particular cores to benefit from memory locality will be unnecessary.

Zen 2 looks to be a good step forward. The doubling of floating point throughput is also welcome but I am curious how AMD is going to tackle power consumption here. On the Intel side, using wider SIMD tends to invoke a clock speed offset (still faster due to the double of width but it could never be twice as fast).

Charlie is right that yields should be high. 75 mm^2 on the CPU side should have good yields to begin with to the point that it makes me wonder if there is a point toward going below 8 active cores in a package for server (send all the defects down to the consumer market). Binning should make a few higher grades of SKUs available. Not because of any fundamental change in the design, just that the raw number of smaller, higher yielding dies should produce more 'golden samples'. We're looking at a few 100 Mhz here but still welcome.

On the Intel side of things, they have the technology to do exactly what AMD is pulling off here and even more. However, the delays in their 10 nm production and apparent 14 nm shortages have wrecked their road map. Once Intel recovers, they can easily catch up but next year is looking brutal for Intel.

Not mentioned anywhere and a bit speculative on my part, I think Epyc can now scale up to an 8 socket design. Previously four Zeppelin dies has to linked to four other Zeppelin dies in another socket. Now a single off die Infinity Fabric link can hit all the coherency points in the remote socket, leaving three other available for additional CPU sockets. The PCIe lanes share connectivity with Infinity fabric so even in a dual CPU be more like a 112 + 112 lane configuration. A quad socket system would have 256 lanes and an eight socket beast would double that. The modular nature of this opens up the possibility of hosting a GPU in the socket. If Navi is modular as well, a compute die linked via Ininfity Fabric would be a straight substitution for a CPU die. There is enough room on the package if AMD wanted to add HBM for the GPU dies too.

Where will you put all the sockets, memory slots and PCI-E ports? That would require a new form factor or stacked boards/enclosures.

Monkey34 · Nov 9, 2018

Sikkyu said:
TLDR; AMD's big dick processor and its gonna be slappin around intel for a while.

This is what immediately popped into my mind...

DrBorg · Nov 9, 2018

I will love to see Kyle's evaluation of this Beast.

They should do a mode where one chip gets clocked to ungodly speeds, and just rotate out which chip it is as they thermal soak; moving the heat around the package, and "sharing the wealth."

If the whole thing clocks 4.5GHz or higher, it will eat intel.

And not just servers; I can see some [H]ard individuals making this into a gaming rig. Eventually.

This, and a 4 way Vega 64 Crossfire setup probably Still won't need a 2000W cooler.

This is AMD smacking intel around with a pink GTA V dildo, lol.

Rauelius · Nov 9, 2018

Please don't over-hype this...I don't want to experience Bulldozer again...Then again, this may be more like if AMD just Die-Shrunk Thuban/Zosmas, decreased cache latency, increase cache size and added SMT instead of making Bulldozer. We could've had a Phenom III x6 2090t with a 3.7Ghz base/4.1Ghz boost and compete somewhat with Sandybridge....what could have been...

defaultluser · Nov 9, 2018

Rauelius said:
Please don't over-hype this...I don't want to experience Bulldozer again...Then again, this may be more like if AMD just Die-Shrunk Thuban/Zosmas, decreased cache latency, increase cache size and added SMT instead of making Bulldozer. We could've had a Phenom III x6 2090t with a 3.7Ghz base/4.1Ghz boost and compete somewhat with Sandybridge....what could have been...

This isn't Bulldozer.

They took Zen, and doubled the AVX2 unit throughput, to match Intel (old = 128-bits per-clock, new = 256). Which means they will no-longer be second-fiddle at FPU-heavy scientific compute loads.

They've also tweaked I/O (single CCX for 8-core, and other throughput improvements), branch prediction and other pieces of the architecture, so we should see a 10% bump in IPC overall (almost 2x per-core with FPU compute).

They have manged t do this big 64-core chip by optimizing the architecture (single memory controller chip, instead of memory controllers on each processor) to double the number of cores on 7nm process.

It would have been a lot more expensive to do this with memory controllers on each processor die. We will see if they can maintain he same throughput in a single chip on-package "northbridge".

But even if heir northbridge has trouble scaling, you won't have any of those issues with consumer versions of the chip.

juanrga · Nov 10, 2018

defaultluser said:
This isn't Bulldozer.

They took Zen, and doubled the AVX2 unit throughput, to match Intel (old = 128-bits per-clock, new = 256). Which means they will no-longer be second-fiddle at FPU-heavy scientific compute loads.

They matched Haswell peak throughput. Zen is 16 FLOP/core. Haswell is 32 FLOP/core, Xeon Phi and Skylake-X are 64 FLOP/core, but the FLOP/GB ratio is dropped by a factor of about 3x on Rome compared to Naples. Moreover Cascade Lake AP, which just replaces Xeon Phi on HPC, will have extra advantage from having 12 memory channels, so will have a bigger GB/FLOP ratio than Skylake-X and Rome.

It is not strange that TACC just chose Cascade Lake and rejected Rome for the next supercomputer.

ole-m · Nov 10, 2018

juanrga said:
They matched Haswell peak throughput. Zen is 16 FLOP/core. Haswell is 32 FLOP/core, Xeon Phi and Skylake-X are 64 FLOP/core, but the FLOP/GB ratio is dropped by a factor of about 3x on Rome compared to Naples. Moreover Cascade Lake AP, which just replaces Xeon Phi on HPC, will have extra advantage from having 12 memory channels, so will have a bigger GB/FLOP ratio than Skylake-X and Rome.

It is not strange that TACC just chose Cascade Lake and rejected Rome for the next supercomputer.

Where do you see they've dropped it by a factor of 3x ? all I see is improved FP not reduced!

on cascade you'd have NUMA stuff to play around with:
they have two cpu's with 6 channels..... 2P cascade is the same as 4P skylake.

also in the real world what does the floating point performance of Intel have to show for it? not a lot I'm afraid.
https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/21

Currently Xeon has some great strong points over zen1 and has it's great use case scenarios where you shouldn't be using zen but the same is true the other way also.

Nobu · Nov 10, 2018

juanrga said:
They matched Haswell peak throughput. Zen is 16 FLOP/core. Haswell is 32 FLOP/core, Xeon Phi and Skylake-X are 64 FLOP/core, but the FLOP/GB ratio is dropped by a factor of about 3x on Rome compared to Naples. Moreover Cascade Lake AP, which just replaces Xeon Phi on HPC, will have extra advantage from having 12 memory channels, so will have a bigger GB/FLOP ratio than Skylake-X and Rome.

It is not strange that TACC just chose Cascade Lake and rejected Rome for the next supercomputer.

Where'd you get the figure for Rome (what math did you use, or what source)? Is it strange that cray is using a zen cpu (maybe not Rome...) for a supercomputer, then?

juanrga · Nov 10, 2018

ole-m said:
Where do you see they've dropped it by a factor of 3x ? all I see is improved FP not reduced!

on cascade you'd have NUMA stuff to play around with:
they have two cpu's with 6 channels..... 2P cascade is the same as 4P skylake.

also in the real world what does the floating point performance of Intel have to show for it? not a lot I'm afraid.
https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/21

Currently Xeon has some great strong points over zen1 and has it's great use case scenarios where you shouldn't be using zen but the same is true the other way also.

I did not say that FLOPS dropped. I said that FLOP/GB ratio dropped.

NUMA isn't a problem. HPC codes usually work in heterogeneous systems with CPUs and accelerators. Not to mention those CCL dies are connected with fast interconnect.

That Anandtech review is rather useless. First, in general the performance of Xeons was crippled by 40% (Broadwell) or 60% (Skylake) for that review.

https://www.realworldtech.com/forum/?threadid=169894&curpostid=169970
https://www.realworldtech.com/forum/?threadid=169894&curpostid=169972
https://www.realworldtech.com/forum/?threadid=169894&curpostid=170012

Second, Anandtech is using toy benches as the C-ray bench that fits into L1 and so gives an irreal performance advantage to the chip with larger L1 (aka EPYC) and/or using benches without proper support for Skylake-X new FMA units. Next is a popular HPC application with AVX512 support

Platinum 8158 is 12 core at 3GHz. EPYC 7601 is the fastest Zen with 32 core at 2.2GHz.

juanrga · Nov 10, 2018

Nobu said:
Where'd you get the figure for Rome (what math did you use, or what source)? Is it strange that cray is using a zen cpu (maybe not Rome...) for a supercomputer, then?

Rome has 4x higher throughput than Naples. So to keep the FLOP/GB ratio, memory BW would have to be increased by a factor of 4x as well. Since Rome has only 8 channels, the ratio will drop. How much?

Naples has 2666MHz DRAM. Assuming that Rome supports 2933MHz, the drop will be 3.6x. Even imagining that Rome supports 3600MHz (which is difficult to believe since it is uses same platform than Naples) the drop would be 2.96x.

Cray is using Naples only to feed the Nvidia GPUs, not to crunch numbers. You can feed the GPUs using IBM or ARM CPUs as well. All the praise is for the Nvidia GPUs

https://www.datacenterdynamics.com/...-supercomputer-featuring-crays-shasta-system/

However systems as Stampede2 don't use GPUs and all the heavy floating computations are made by the CPUs with AVX512 support.

BrotherMichigan · Nov 10, 2018

juanrga said:
I did not say that FLOPS dropped. I said that FLOP/GB ratio dropped.

NUMA isn't a problem. HPC codes usually work in heterogeneous systems with CPUs and accelerators. Not to mention those CCL dies are connected with fast interconnect.

That Anandtech review is rather useless. First, in general the performance of Xeons was crippled by 40% (Broadwell) or 60% (Skylake) for that review.

https://www.realworldtech.com/forum/?threadid=169894&curpostid=169970
https://www.realworldtech.com/forum/?threadid=169894&curpostid=169972
https://www.realworldtech.com/forum/?threadid=169894&curpostid=170012

Second, Anandtech is using toy benches as the C-ray bench that fits into L1 and so gives an irreal performance advantage to the chip with larger L1 (aka EPYC) and/or using benches without proper support for Skylake-X new FMA units. Next is a popular HPC application with AVX512 support

View attachment 119003

Platinum 8158 is 12 core at 3GHz. EPYC 7601 is the fastest Zen with 32 core at 2.2GHz.

"C-Ray benefits AMD, so it's useless. Let's look at GROMACS instead because it uses AVX-512!"

It's going to be funny looking at these scores when Rome comes out given the increase in FP throughput. Even half the theoretical increase will put Rome significantly ahead of Intel in this benchmark.

defaultluser · Nov 10, 2018

juanrga said:
They matched Haswell peak throughput. Zen is 16 FLOP/core. Haswell is 32 FLOP/core, Xeon Phi and Skylake-X are 64 FLOP/core, but the FLOP/GB ratio is dropped by a factor of about 3x on Rome compared to Naples. Moreover Cascade Lake AP, which just replaces Xeon Phi on HPC, will have extra advantage from having 12 memory channels, so will have a bigger GB/FLOP ratio than Skylake-X and Rome.

It is not strange that TACC just chose Cascade Lake and rejected Rome for the next supercomputer.

Since the AVX 512 speedup is a lot more of a corner-case than the AVX 256 speedup, you understand this is still a huge step forward for AMD.

This is the same reason AMD skipped 256-bit for it's first iteration: if the speedup is only in corner-cases, you can better spend the die space on other things.

If you can fit twice the cores, you can be faster at a lot of other operations (for example, applications that can't scale to AVX 512 effectively).

The GB/s per operation was already overkill for Ryzen 7 (hence why you can double the cores on Threadripper, and get mostly expected speedup in NUMA-aware software), so a larger cache plus slightly higher DDR4 speed should be able to keep it fed.

power666 · Nov 10, 2018

juanrga said:
Rome has 4x higher throughput than Naples. So to keep the FLOP/GB ratio, memory BW would have to be increased by a factor of 4x as well. Since Rome has only 8 channels, the ratio will drop. How much?

Naples has 2666MHz DRAM. Assuming that Rome supports 2933MHz, the drop will be 3.6x. Even imagining that Rome supports 3600MHz (which is difficult to believe since it is uses same platform than Naples) the drop would be 2.96x.

Cray is using Naples only to feed the Nvidia GPUs, not to crunch numbers. You can feed the GPUs using IBM or ARM CPUs as well. All the praise is for the Nvidia GPUs

https://www.datacenterdynamics.com/...-supercomputer-featuring-crays-shasta-system/

However systems as Stampede2 don't use GPUs and all the heavy floating computations are made by the CPUs with AVX512 support.

If the purpose of Epyc in Cray's Shasta system is feed only the GPUs, AMD has a nice option here: ship a chip with only the IO dies and one 8 core die. The bandwidth per FLOP ratio for the CPU side would inherently go up but then again, If most of the computation is not the CPU side, then it is kinda useless. The external devices could leverage that bandwidth.

The big thing here is that Epyc will provide a lot of PCIe 4.0 connectivity and now uniform latency to memory. With their previous NUMA in a socket topology, information may have had to traverse several dies before reaching the external device. Now that is all done internally via the IO chip. Similarly HPC code is increasingly leveraging code that permits devices to directly communicate with each other and memory without CPU intervention. A single socket Epyc system can connect to eight nVidia GPUs at full 16x PCIe 4.0 links (though 8x links is more likely to give room for networking cards). AMD's change in topology now if far more attractive for herterogenous compute. It should also be cheaper not because Epyc is inherently cheaper than Intel's Xeons (that matters too!) but because it lets board designers skip using Avago or Microsemi PCIe switching fabric. Removing these switches also improves latency by a hair.

The strength of the AMD's platform is the shear amount of configurability possible to target specific workloads optimally. AMD is in a very good position going into 2019 and 2020.

power666 · Nov 10, 2018

Glock24 said:
Where will you put all the sockets, memory slots and PCI-E ports? That would require a new form factor or stacked boards/enclosures.

Big iron server hardware generally uses proprietary form factors for motherboards (though the 19" rack is here to stay). Quad socket on a single board with PCIe slots is possible as we're similar systems from HPe and IBM using Xeons and POWER9 respectively.

The trend for 8 and some 4 socket systems has been to split the system into multichassis: a chassis for every 4 sockets and then pair with an additional IO chassis. This helps serviceability as these systems tend also have hot swap PCIe slots with a select few supportting hot swapping CPU and memory too. Doing all of that in a single chassis would be madness. HPe's Superdome Flex system would be a prime example of these design principles. IBM does it too with their high end POWER9 systems.

Again, this is hypothetical based on the topology changes from Naples to Rome. A single Infinity Fabric link between dies may simply not have enough bandwidth to make scaling up like this feasible.

juanrga · Nov 11, 2018

BrotherMichigan said:
"C-Ray benefits AMD, so it's useless. Let's look at GROMACS instead because it uses AVX-512!"

It's going to be funny looking at these scores when Rome comes out given the increase in FP throughput. Even half the theoretical increase will put Rome significantly ahead of Intel in this benchmark.

The discussion was about HPC exclusively, so the context is HPC, and a popular real-life HPC application as GROMACS is much more relevant than a toy (thirteen seconds) C-ray rendering.

The "theoretical increase" is a peak throughput, not the throughput one will see in most benches, because Rome has ~3x worse GB/FLOP than Naples, so the sustained performance will be only a fraction of the peak.

"Half the theoretical increase" implies 100 ns/Day, which is less than what 28C Skylake does.

juanrga · Nov 11, 2018

defaultluser said:
Since the AVX 512 speedup is a lot more of a corner-case than the AVX 256 speedup, you understand this is still a huge step forward for AMD.

This is the same reason AMD skipped 256-bit for it's first iteration: if the speedup is only in corner-cases, you can better spend the die space on other things.

If you can fit twice the cores, you can be faster at a lot of other operations (for example, applications that can't scale to AVX 512 effectively).

The GB/s per operation was already overkill for Ryzen 7 (hence why you can double the cores on Threadripper, and get mostly expected speedup in NUMA-aware software), so a larger cache plus slightly higher DDR4 speed should be able to keep it fed.

AVX512 is a standard in HPC and in servers. And lots of throughput workloads get a nice speedup. It is not a corner case.

Of course, Rome is a huge step for AMD. My complain was about the DOOM and GLOOM in the article. Charlie is exaggerating again. In the end Rome will win in some cases and Cascade Lake will win in others.

schmide · Nov 11, 2018

juanrga said:
The discussion was about HPC exclusively, so the context is HPC, and a popular real-life HPC application as GROMACS is much more relevant than a toy (thirteen seconds) C-ray rendering.

The "theoretical increase" is a peak throughput, not the throughput one will see in most benches, because Rome has ~3x worse GB/FLOP than Naples, so the sustained performance will be only a fraction of the peak.

"Half the theoretical increase" implies 100 ns/Day, which is less than what 28C Skylake does.

Funny when a toy can out scale a massive project. Reality is GROMACS scales like shit. (source)

I'm not going to dive into the code nor the algorithm but it is obviously an outlier especially when you look at the article you took the bench from.

https://www.servethehome.com/intel-xeon-gold-6152-benchmarks-and-review-top-core-count-xeon-gold/2/

I highly doubt you can attribute any of this to avx512 nor raw FLOPS/core. It's just a unique situation where a linear problem favors core speed over core count.

But great if you kick the ball in an odd direction you may hit a uniquely placed goalpost.

Edit: Any processor especially a 56 core intel beast will operate like crap.

defaultluser · Nov 11, 2018

juanrga said:
AVX512 is a standard in HPC and in servers. And lots of throughput workloads get a nice speedup. It is not a corner case.

Of course, Rome is a huge step for AMD. My complain was about the DOOM and GLOOM in the article. Charlie is exaggerating again. In the end Rome will win in some cases and Cascade Lake will win in others.

Because of Dynamic Frequency scaling when the AVX 512 unit is fully-loaded , it can cause unexpected performance in cases where you have MIXED amounts of AVX512 (think web servers).

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Unfortunately, if you have the cores loaded "just enough," the clock speed can drop significantly. And that will affect all the rest of your software's performance. Unless you port EVERYTHING to AVX512, then you could see some performance reductions.

Single type of load servers (compute) are a lot less common than mixed-load servers. And mixed-load servers tend to have a huge mix of software running on them, with various vectorization enhancements.

And for those mixed loads, having double the processors will give a lot more consistent performance.

power666 · Nov 11, 2018

juanrga said:
AVX512 is a standard in HPC and in servers. And lots of throughput workloads get a nice speedup. It is not a corner case.

Of course, Rome is a huge step for AMD. My complain was about the DOOM and GLOOM in the article. Charlie is exaggerating again. In the end Rome will win in some cases and Cascade Lake will win in others.

The big if is how much is AMD correct in their assertions that they were able to increase IPC and clock speeds. AMD is clearly doubling core count. So a Cascade Lake-AP has to rely on AVX-512 to pull ahead as AMD would have a 33% lead in cores count if nothing else changed from Zen 1. AMD has disclosed that they have increased the width of their AVX units and combined with the doubling of count puts them at a higher peak throughput if Cascade Lake-AP and Rome were clocked the same. Further increases in IPC (which the AVX width increase already contributed to) and clocks would widen AMD's lead.

This is why Cascade Lake-AP exists as Intel has no other solution waiting in the wings to compete in HPC throughout 2019. Xeon Phi is seemingly dead as the next iteration was tied to the 10 nm process. Intel's GPU efforts won't be ready in 2019 and 2020 is optimistic. Staying at 28 core wouldn't have been competitive in the face of a 64 core AMD part.

There are reasons to stay with Intel but right now they are increasingly niche. For HPC, Omnipath is an advantage but it is not clear if that will even part of Cascade Lake-AP (Sky Lake-SP LCC and XCC dies have 64 PCIe lanes but 16 of them are reserved for on package Omnipath) as the number of PCIe lanes available on the socket is unknown. Optane DIMMs have merit in the datacenter world, though they don't inherently increase performance by solely themselves. FGPA integration would be a huge win but additional parts in that area are seemingly tied to 10 nm as well. Only thing left is quad/octo socket support but is not favored in HPC areas due to increased cost (generally cheaper to get two dual socket boxes). The availability of quad/octo sockets will keep Intel in the ultra high end of the server market for other workloads though.

Other than the old adage of 'no one got fired for choosing Intel' (which may no longer apply in the era of Meltdown/Spectre), what is the case for going Intel in 2019 outside of the few niches mentioned above?

KazeoHin · Nov 11, 2018

Remember guys, juanrga is the guy who said Zen has Sandbridge IPC, and that Zen+ would not have any IPC gains over Zen1.

So he lives in his own little world.

juanrga · Nov 16, 2018

schmide said:
Funny when a toy can out scale a massive project. Reality is GROMACS scales like shit. (source)

View attachment 119253

I'm not going to dive into the code nor the algorithm but it is obviously an outlier especially when you look at the article you took the bench from.

https://www.servethehome.com/intel-xeon-gold-6152-benchmarks-and-review-top-core-count-xeon-gold/2/

I highly doubt you can attribute any of this to avx512 nor raw FLOPS/core. It's just a unique situation where a linear problem favors core speed over core count.

The source is requesting cores in a cluster, so showing the scaling of the cluster, not the scaling of the benchmark.

Serve The Home usually tests joy benches as dyyristone and basic benchs like C-Ray, and some other stuff. They only tested a single AVX512 workload, they didn't test any of the dozens of server/HPC AVX512 workloads.

juanrga · Nov 16, 2018

defaultluser said:
Because of Dynamic Frequency scaling when the AVX 512 unit is fully-loaded , it can cause unexpected performance in cases where you have MIXED amounts of AVX512 (think web servers).

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Unfortunately, if you have the cores loaded "just enough," the clock speed can drop significantly. And that will affect all the rest of your software's performance. Unless you port EVERYTHING to AVX512, then you could see some performance reductions.

Single type of load servers (compute) are a lot less common than mixed-load servers. And mixed-load servers tend to have a huge mix of software running on them, with various vectorization enhancements.

And for those mixed loads, having double the processors will give a lot more consistent performance.

Cascade Lake solves frequency scaling. It is why TACC mentioned 25--35% higher clocks for AVX.

juanrga · Nov 16, 2018

KazeoHin said:
Remember guys, juanrga is the guy who said Zen has Sandbridge IPC, and that Zen+ would not have any IPC gains over Zen1.

So he lives in his own little world.

I said "~ Sandy" and "~ Haswell", and reviews proved that Zen IPC is Sandy-like for games and Haswell-like for stuff as Handrake. Do I need to give the benchmarks once again?

Zen+ has same IPC than Zen cores in EPYC, Threadripper, and Raven Ridge. Summit Ridge shipped with a flawed implementation of Zen cores (e.g. 17 cycles L2 instead 12 cycles), and so 2700X has ~1.5% higher IPC than 1800X, but the Zen+ cores on Pinnacle Ridge are identical to the Zen cores on Raven Ridge and the IPC is the same. This was also demonstrated and benches given.

So I live in real world.

juanrga · Nov 16, 2018

power666 said:
The big if is how much is AMD correct in their assertions that they were able to increase IPC and clock speeds. AMD is clearly doubling core count. So a Cascade Lake-AP has to rely on AVX-512 to pull ahead as AMD would have a 33% lead in cores count if nothing else changed from Zen 1. AMD has disclosed that they have increased the width of their AVX units and combined with the doubling of count puts them at a higher peak throughput if Cascade Lake-AP and Rome were clocked the same. Further increases in IPC (which the AVX width increase already contributed to) and clocks would widen AMD's lead.

This is why Cascade Lake-AP exists as Intel has no other solution waiting in the wings to compete in HPC throughout 2019. Xeon Phi is seemingly dead as the next iteration was tied to the 10 nm process. Intel's GPU efforts won't be ready in 2019 and 2020 is optimistic. Staying at 28 core wouldn't have been competitive in the face of a 64 core AMD part.

There are reasons to stay with Intel but right now they are increasingly niche. For HPC, Omnipath is an advantage but it is not clear if that will even part of Cascade Lake-AP (Sky Lake-SP LCC and XCC dies have 64 PCIe lanes but 16 of them are reserved for on package Omnipath) as the number of PCIe lanes available on the socket is unknown. Optane DIMMs have merit in the datacenter world, though they don't inherently increase performance by solely themselves. FGPA integration would be a huge win but additional parts in that area are seemingly tied to 10 nm as well. Only thing left is quad/octo socket support but is not favored in HPC areas due to increased cost (generally cheaper to get two dual socket boxes). The availability of quad/octo sockets will keep Intel in the ultra high end of the server market for other workloads though.

Other than the old adage of 'no one got fired for choosing Intel' (which may no longer apply in the era of Meltdown/Spectre), what is the case for going Intel in 2019 outside of the few niches mentioned above?

Rome seems to have 2.35GHz base clock. So the clocks are increased compared to Naples. The IPC is also higher. So AMD is correct on both claims.

Rome has twice the number of cores and twice bigger FPUs. So the peak throughout increases by ~4x. But the memory subsystem is not increased by 4x, so the GB/FLOP ratio is lower than Naples, which means Rome sustained throughput will not be 4x bigger than Naples. The same happens for integer stuff. Rome has about 2x higher peak throughput, but not 2x better memory subsystem (even Charlie claims that sustained integer-like performance will be in the 1.7x range).

TACC already mentioned "performance" was one of the reasons why chose Cascade Lake SP and rejected Rome.

Cascade Lake SP replaces Skylake SP. Cascade Lake AP replaces Xeon Phi. 28C Cascade Lake SP couldn't replace 72C Phi, so they had to invent 48C Cascade lake AP.

There is no doubt that Rome will help AMD to win marketshare (as Naples does), but there are lots of hype about Rome, as there was hype about Naples. I rememeber perfectly people claiming Naples was the death of Intel, and then Intel posting revenue records in the datacenter after Zen launch.

schmide · Nov 16, 2018

juanrga said:
The source is requesting cores in a cluster, so showing the scaling of the cluster, not the scaling of the benchmark.

An algorithm could bottleneck on the producer leaving the consumers superfluous. The chart I posted shows the scaling of the benchmark regardless of frequency. The high speed low core count processor's performance only reinforces that relationship.

juanrga said:
Serve The Home usually tests joy benches as dyyristone and basic benchs like C-Ray, and some other stuff. They only tested a single AVX512 workload, they didn't test any of the dozens of server/HPC AVX512 workloads.

Regardless, quoting a benchmark out of context only severs to confuse. Is it that hard to link to your source?

power666 · Nov 16, 2018

juanrga said:
Rome seems to have 2.35GHz base clock. So the clocks are increased compared to Naples. The IPC is also higher. So AMD is correct on both claims.

Rome has twice the number of cores and twice bigger FPUs. So the peak throughout increases by ~4x. But the memory subsystem is not increased by 4x, so the GB/FLOP ratio is lower than Naples, which means Rome sustained throughput will not be 4x bigger than Naples. The same happens for integer stuff. Rome has about 2x higher peak throughput, but not 2x better memory subsystem (even Charlie claims that sustained integer-like performance will be in the 1.7x range).

What wasn't discussed were cache sizes nor the cache hierarchy. Larger L3 appears to be a given even with the smaller chiplets.

The nice thing is that with AMD's design, even a single CPU chiplet will have access to the entire 512 bit wide memory bus.

juanrga said:
TACC already mentioned "performance" was one of the reasons why chose Cascade Lake SP and rejected Rome.

Cascade Lake SP replaces Skylake SP. Cascade Lake AP replaces Xeon Phi. 28C Cascade Lake SP couldn't replace 72C Phi, so they had to invent 48C Cascade lake AP.

Odd considering that Cascade Lake-SP isn't moving performance forward much compared to Sky Lake-SP. It is essentially an errata fixed Sky Lake-SP with a bit more process tuning to give it slightly higher clocks. Outside of Optane DIMM support, there is only AVX-512 VDDI extension to assist in specific workloads.

Depending on workload, Naples is competitive today. Rome steadily improves Naples in multiple dimensions (higher IPC, higher clocks and more cores), it is going to be an exceedingly difficult 2019 for Intel.

juanrga said:
There is no doubt that Rome will help AMD to win marketshare (as Naples does), but there are lots of hype about Rome, as there was hype about Naples. I rememeber perfectly people claiming Naples was the death of Intel, and then Intel posting revenue records in the datacenter after Zen launch.

AMD will probably hit the same problem as Intel: production capacity. The good news here is that AMD's 7 nm chiplets are small and should yield exceptionally well. Any partially defective parts could be handed down to Threadripper or lower. IO hub chip is on the mature 14 nm process which shouldn't be an issue getting good yields there despite its larger size. The question is how much market share can AMD gain before they will hit this limit. Outside of some nice scenarios (quad/octo socket, FPGA, on package fabric, Optane DIMMs), there is very little reason to go with Cascade Lake over Rome.

juanrga · Nov 17, 2018

power666 said:
What wasn't discussed were cache sizes nor the cache hierarchy. Larger L3 appears to be a given even with the smaller chiplets.

The nice thing is that with AMD's design, even a single CPU chiplet will have access to the entire 512 bit wide memory bus.

The rumored twice larger L3 and the higher clocked DRAM (2993MHz?) will help to reduce the bottleneck, but that will not provide 4x memory BW to keep the GB/FLOP ratio.

There is a possibility that the IO die in Rome is in fact four symmetrical IO dies joined together. In that case an AM4 chip would have only access to 1/4 of the memory bus, aka 128bit. Anandtech just asked about this possibility to Papermaster in the recent interview.

power666 said:
Odd considering that Cascade Lake-SP isn't moving performance forward much compared to Sky Lake-SP. It is essentially an errata fixed Sky Lake-SP with a bit more process tuning to give it slightly higher clocks. Outside of Optane DIMM support, there is only AVX-512 VDDI extension to assist in specific workloads.

TACC mentioned 25--35% higher clocks.

Mode13 · Nov 17, 2018

Where do you fine fellows obtain the relevant information to hold such low level arguments to begin with?

My latest text is 80x86 architecture up to the core 2 series by Bray. I'm well over a decade behind now.

power666 · Nov 18, 2018

juanrga said:
The rumored twice larger L3 and the higher clocked DRAM (2993MHz?) will help to reduce the bottleneck, but that will not provide 4x memory BW to keep the GB/FLOP ratio.

It depends on how much data re-usage there is in the algorithm and how it works with prefetching. Sure, the bandwidth didn't go up by a factor of 4x but we have yet to see how bandwidth starved we are to begin with. Current Naples platform seems to be fine in terms of aggregate bandwidth with the cache hierarchy and multiple NUMA nodes to be a bigger issue.

juanrga said:
There is a possibility that the IO die in Rome is in fact four symmetrical IO dies joined together. In that case an AM4 chip would have only access to 1/4 of the memory bus, aka 128bit. Anandtech just asked about this possibility to Papermaster in the recent interview.

This doesn't make much sense. Granted there will be some copy/paste in the IO hub (ie the PHY part of the memory controller etc.) but their should be a high speed coherent cross bar linking everything together. It'd be an incredibly poor engineering choice to copy/paste the entire die's layout, especially when certain redundant functionality can be removed (for example each Zeppelin die has USB controllers but only one of them is exposed in Naples. Having one huge IO die is about simplifying the layout while combating latency by merit of centralizing coherency. The trade off is that the IO hub becomes more complex in how it links everything together and additional packaging costs.

juanrga said:
TACC mentioned 25--35% higher clocks.

I found that quote here which is in the context of only AVX-512. Considering that the turbo AVX-512 clock of a fully loaded Xeon Platinum 8160 is only 2.0 Ghz, then Cascade Lake likely has at best 2.4 Ghz or 2.5 Ghz AVX-512 turbo clock at a similar core count. Most realistically Intel has improved the AVX-512 base clock by a bit and/or tuned how often it can hit the AVX-512 turbo max. Changing how turbo worked was one of the big things Haswell-EP -> Broadwell-EP that improved average performance while not altering the base or turbo clocks much.

The amusing thing is that such benefits can be had today if they were upgrade to the Xeon Platinum 8168 or 8180 over their current Xeon Platinum 8160s.

An In-Depth Analysis of the 7nm 64-Core AMD EPYC "Rome" Server Processor

Fully [H]

I Question Reality

Weaksauce

n00b

Weaksauce

Supreme [H]ardness

Gawd

2[H]4U

[H]F Junkie

2[H]4U

Limp Gawd

[H]F Junkie

2[H]4U

2[H]4U

Limp Gawd

[H]F Junkie

Weaksauce

Weaksauce

2[H]4U

2[H]4U

Limp Gawd

[H]F Junkie

Weaksauce

[H]F Junkie

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Limp Gawd

Weaksauce

2[H]4U

Gawd

Weaksauce