ARM server status update / reality check

I do like seeing Intel begin to get their ass kicked. After EPYC was just marginal, this is the first sign of real competition from anyone :D
 
ThunderX2 reviewed and benchmarked. It is still pre-production silicon and firmware, but

we have an Arm chip that can go toe-to-toe with Intel and AMD and come out ahead in some cases. Best of all, the list price of the 32 core top-bin CN9980 part is $1795 about half of the competitive Intel and AMD chips.

https://www.servethehome.com/cavium-thunderx2-review-benchmarks-real-arm-server-option/

Cavium is still working in compiler optimizations for this microarchitecture. When using GCC and standard flags ThunderX2 runs circles around best Xeons and EPYC in both integer performance and memory bandwidth

Cavium-ThunderX2-SPEC-Int-Rate-Peak-gcc7.jpg


Cavium-ThunderX2-Stream-Triad-gcc7.jpg


ThunderX2 is also being evalutated by the HPC community and will be part of several supercomputers. The high memory bandwidth is key for memory-bound HPC code.
 
What a difference two years makes! Intel has MOSTLY stood still, and Cavium now has a competitive product. The wait was definitely worth it :D

I'm now a little more interested in how well they fix things in ThunderX 2. Too bad we'll have to wait two years for benchmarks.

The fact that they also managed this with 50% less cores means they've solved their single-threaded performance issues (or the threading is much enhanced).

EDIT: yeah, 4-wide issue plus 4 threads per-core will do it, but the massive number of threads means threading need to be optimized for each type of workload, so the don't step on each other.

Has a lot of potential for improvements in threading efficiency with compiler optimization, and even without that it's quite impressive.
 
Last edited:
Well, two years ago, some of us were at RWT predicting that Intel and AMD would have lots of trouble with 16/14nm ARM servers. Then only 28nm AMD Opteron Seattle and the 40nm XGene-1 were available and lots of geniuses at RWT said us that it wasn't happening: "ARM cannot scale up", "maybe for microservers, but no one will beat Xeon",...

They turned to be wrong, very wrong.
 
Well, two years ago, some of us were at RWT predicting that Intel and AMD would have lots of trouble with 16/14nm ARM servers. Then only 28nm AMD Opteron Seattle and the 40nm XGene-1 were available and lots of geniuses at RWT said us that it wasn't happening: "ARM cannot scale up", "maybe for microservers, but no one will beat Xeon",...

They turned to be wrong, very wrong.

Well, that was because XGene always treated their chip as some kind of a science project with no future. And we all know how cobbled-together the AMD Seattle platform was (great software effort, half-assed hardware).

When you don't put in the effort, people don't take you seriously. They were right to poke fun at the projects, but I agree there was a lot of undeserved poking at the the ARM server effort. People Fear Change, even though change is good :(

ThunderX was the first platform to take ARM on servers seriously. I mentioned they were the only one to support CUDA back in 2016. That's kinda important if you want to do any sort of large-scale simulations or content creation. Those APIs don't write themselves :D

Dual-socket means they're competitive on the most important server configurations out there (quad is not as popular).

Once Qualcomm learns how to make a dual-socket system, the ARM invasion will officially begin. Their cores are already fairly impressive, so they just need this small addition.
 
Last edited:
XGene was made on ancient 40nm node and its goal was more for development and testing purposes than for real production workloads. Then we said that the ARM servers would give trouble to Intel would be those made on 14/16nm nodes: i.e. K12, Vulcan, and XGene3. The 10nm Centric wasn't even a rumor then.

K12 was canceled. Vulcan was sold and is now Cavium ThunderX2. And XGene3 was sold and is now Ampere A1.

I expected Vulcan (now ThunderX2) to be competitive because we knew it was a 4-wide core with SMT4, 180 instruction ROB, 3ALU+3MEM and a ~3GHz target on 16FF node, but its performance has really impressed me. I didn't expect it to be so good. Makes one wonder that it could probably beat AMD K12, which I expected to be faster than EPYC.
 
K12 was cancelled for good reason. At 8 native cores (i.e. Seattle with K12 instead of A57) it would not have been enough to bury AMD's server woes anytime soon, and would have delayed Zen another year.

I don't feel that AMD would have been fast enough to stand out from their other ARM competitors. When you're ARM, you really have to overwhelm Intel AND YOUR OTHER ARM COMPETITION to get the design win. Qualcomm offers much higher perf/watt, and Thunder X2 offers similar performance for half the price. Both do this without NUMA issues.

At least with Ryzen they can play pretend in the server market again, because some people are willing to deal with NUMA mess (or treat each node as a separate server) if they can get their x86 performance for cheap.

What AMD really needs is a native 32-core chip, that bypasses the NUMA mess. That will be big and expensive to develop no matter what the architecture.
 
Last edited:
K12 was a 32-core design that did target the same high-performance servers than Zen. Seattle, aimed at microservers, was initially designed as an 16-core system but finally released only as 8-core and replacing the Jaguar Opteron line.

The problem with AMD canceling K12 due to competition is they continue fighting both Intel and the ARM competition! If EPYC had launched a pair of years ago when the ARM ecosystem was in its infancy, it could get more traction, but now with an ARM ecosystem mature enough companies can switch away from x86. Indeed, several companies have migrated from Xeon to ARM servers, ignoring EPYC. During ThunderX2 servers presentation, Microsoft reiterated again that wants more than 50% of their servers powered by ARM


Yes, AMD needs a 32-core die, but it is not happening. I expect a 12-core or a 16-core die for Zen2 on 7nm and again MCM2 and MCM4 configurations.
 
From what I can gather, both Seattle and K12 were supposed to use the same custom core, being just like Skylake and Skylake-X. So it would be believable if Seattle was supposed to be 16-core, and K12 32-core.

Unfortunately, AMD didn't have the engineering talent to pull it release these products before similar chips would come out from better-situated companies. Cavium is ONLY developing ThunderX, so they can put all engineering efforts behind that. And Qualcomm is forever pushed to justify their existence by improving processor performance against ARM stock cores and Apple, and that experience flows into their server parts.

At least if they compete in the x86 world, they have a chance against sleeping Intel. Apple/Qualcomm is all you need to look at to find the drive behind unstoppable architectural development, something AMD has never consistently done before (can't make this judgement about Cavium yet, it's too early in their lifetime. But they are ion the way to becoming a third unstoppable ARM maker.)

For CPUs, both AMD and Intel get complacent when they have the lead. This works fine if all they have is each other, but that complacency means they lack the deep-seated drive of other more successful companies, which will slowly kill both of them once ARM is introduced.

Nintendo shows how easy it is to switch (har) the consoles to ARM, if price/performance/efficiency is right. They've been doing it since the Advance, and now ARM is all grown up in the Switch. AMD only won the PS4/Bone because it had no high-end competition in the custom market. But the Smart devices revolution has changed that.

Imagine how much easier it will be to sell console makers when Qualcomm attaches their server core to a powerful custom Adreno? It's only a matter of time before these markets are opened up to a half-dozen custom core competitors. Unless AMD changes their inconsistency, they will fall behind better-run companies.
 
Last edited:
Seattle uses A57 cores and was going to be socket compatible with a puma+ based microserver

8.jpg


K12 was going to be socket compatible with Zen

12.jpg


But all those ARM-x86 plans were canceled and Keller left.

Both Microsoft and Sony wanted ARM in the consoles. AMD won the PS4/Bone because then ARM was only 32bit. There was an evaluation of x86-64 vs ARM32 prototypes and finally Microsoft and Sony decided to go the x86 route, which leaved out Nvidia from the competition.
 
More benches and info for ThunderX2

https://www.nextplatform.com/2018/05/16/getting-logical-about-cavium-thunderx2-versus-intel-skylake/

Interesting article, but with some mistakes, as when the article mentions the Xeon used in SPEC comparison. The model used was 6148 (not 6140) and base clock is 2.4GHz (not 2.5). Also I am not sure what the author means by "only 27.5 MB of cache on the die activated", the Xeon model has the amount of cache that corresponds to 20 cores, i.e. 1.375MB per core just as rest of Skylake models. This is more than what ThunderX2 has: 1MB per core.

So Intel can tune up STREAM Triad a bit better than Cavium can on the Xeons

How does he know?

Cavium is quoting internal benchmarks it has run but not yet submitted to SPEC against Intel results that have been submitted, which is not exactly kosher but we have to get the data we can get

ThunderX2 has already been benchmarked by third parties and SPEC scores are in the expected range.

As for floating point math, the custom Armv8 cores in the Vulcan chips have a pair of 128-bit NEON math units, and the Xeon SP Gold chips have a 512-bit AVX-512 unit

No. The Gold 6148 used in the SPEC comparison has two AVX 512 units.

On the SPEC floating point test, the ThunderX2 can beat the Intel chips using GCC compilers, but Intel pulls ahead on its own iron using its own compilers by about 26.5 percent over the ThunderX2 using GCC compilers. The important thing is that Cavium is working with Arm Holdings, which now owns software tools maker Allinea, to create optimized compilers that goose the performance of integer and floating point jobs by around 15 percent, which will put ThunderX2 ahead on integer performance (for these parts anyway) and close the gap considerably on floating point (with about a 10 percent gap still to the advantage of Intel).

In my opinion the better results will come by using Cray optimized compiler (CCE), which already provides boost higher than 20% over the GCC compiler in several dozens of HPC workloads.

On the HPC results mentioned at the end, the Xeon systems used are E5-2695 v4 and Gold 6152. I guess the Skylake results are using ICC, whereas the ThunderX2 results use a mixture of GCC and Cray compilers.

For many server workloads ThunderX2 is able to match/beat Skylake but at nearly half the price. For many HPC workloads ThunderX2 is able to provide about 85 percent of the performance but with "42 percent better performance per dollar".[/quote][/quote]
 
Last edited:
More benches, this time at Anandtech. More praise for the platform, but they're withholding the power consumption figures until they have a shipping system to test on (power management broke).

https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality

I like the threaded SPEC performance analysis. Those extra threads can really help, depending on the load.

Just wish the fuckers had put the results on the same page, instead of splitting them up for ad views.
 
Last edited:
More benches, this time at Anandtech. More praise for the platform, but they're withholding the power consumption figures until they have a shipping system to test on (power management broke).

https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality

I like the threaded SPEC performance analysis. Those extra threads can really help, depending on the load.

Just wish the fuckers had put the results on the same page, instead of splitting them up for ad views.

I took the average of the single-thread SPEC results and normalized for the different clocks: 3.8GHz vs 2.5GHz. If I didn't make any mistake the TX2 core gives the 99% of the IPC of the Skylake core.

Not bad not bad. (y)
 
Well, that didn't take long. Qualcomm may end up ditching servers entirely. And even if they don't, the creator has already left the team, so progress will hit a wall.

https://www.theregister.co.uk/2018/05/24/qualcomm_snapdraon_710/

It's amazing how short-sighted big company management can be. It's barely been on the market a year, and they forget that you have to be dedicated to build a market. Not to mention adding a 2-socket solution to the pot if they want to be taken seriously?

But Qualcomm has gotten used to being #2 behind Apple and in a race for third place with ARM themselves, and they are in no hurry to change this. Guess it's back to business-as-usual.
 
Last edited:

Well yes, I thought they might be jumping to conclusions there at El Reg. But it doesn't counter the fact that the chief engineer left. That will probably slow down progress on Centriq v2.0, while they find a new dream team.

Mind you, the Thunder X2 team is in the same boat, since they bought this new design from Broadcom. Nobody seems to be willing to stick it out in the ARM server chip design market.
 
Last edited:
Well yes, I thought they might be jumping to conclusions there at El Reg. But it doesn't counter the fact that the chief engineer left. That will probably slow down progress on Centriq v2.0, while they find a new dream team.

Mind you, the Thunder X2 team is in the same boat, since they bought this new design from Broadcom. Nobody seems to be willing to stick it out in the ARM server chip design market.

The Centriq v2 design is almost finished; so the chief engineer leaving is not a problem. The Centriq v3 design is canceled and it will be replaced by a new core that will be designed by Qualcomm CDMA Technologies unit.
 
4x4 A72 cores with no custom L3? 2014 called, and wants it's phone back (there is only a 10% IPC difference between A57 and A72).

You can tell it's no more than a science project when they didn't even target modern cores. The A75 has been in shipping products for a year now, and the lack of it shows under real benchmarks:


https://www.servethehome.com/putting-aws-graviton-its-arm-cpu-performance-in-context/

The funny thing, Amazon is a big enough cloud provider that they could actually benefit form a real effort building their own cutting-edge ASIC. But at this rate of disinterest by Amazon, that's five to ten years away.
 
Last edited:
4x4 A72 cores with no custom L3? 2014 called, and wants it's phone back (there is only a 10% IPC difference between A57 and A72).

You can tell it's no more than a science project when they didn't even target modern cores. The A75 has been in shipping products for a year now, and the lack of it shows under real benchmarks:


https://www.servethehome.com/putting-aws-graviton-its-arm-cpu-performance-in-context/

The funny thing, Amazon is a big enough cloud provider that they could actually benefit form a real effort building their own cutting-edge ASIC. But at this rate of disinterest by Amazon, that's five to ten years away.

Customers as Smugmug already moved away from x86. They have similar performance but ~40% cost savings.

https://www.datacenterdynamics.com/news/aws-starts-offering-graviton-custom-arm-cpu-built-amazon/
 
Arm's new Neoverse platforms
https://www.anandtech.com/show/13959/arm-announces-neoverse-n1-platform

Seems interesting and a good path forward.

32bit ARM server -- opening move
40nm 64bit ARM server -- move
28nm 64bit ARM server -- move
16nm 64bit ARM server -- check
10nm 64bit ARM server -- check
7nm 64bit ARM server -- checkmate


This 64C Neoverse offers scalar performance similar to 64C Rome, but on about half the power: ~100W vs ~200W. And Intel Cascade Lake and Copper Lake will require ~400W to get that level of performance.

And a 128C Neoverse is in the pipeline

https://www.hpcwire.com/2019/02/20/arm-unveils-neoverse-n1-platform-with-up-to-128-cores/
 
Last edited:
32bit ARM server -- opening move
40nm 64bit ARM server -- move
28nm 64bit ARM server -- move
16nm 64bit ARM server -- check
10nm 64bit ARM server -- check
7nm 64bit ARM server -- checkmate


This 64C Neoverse offers scalar performance similar to 64C Rome, but on about half the power: ~100W vs ~200W. And Intel Cascade Lake and Copper Lake will require ~400W to get that level of performance.

And a 128C Neoverse is in the pipeline

https://www.hpcwire.com/2019/02/20/arm-unveils-neoverse-n1-platform-with-up-to-128-cores/


What about Gromacs AVX512? A processor cant be a processor without that. (PS this is a SSE class processor aka 128bit)

But a simulated SPECrate2006 (cache fit) number means checkmate.
 
Last edited:
I took SPECint numbers from Anandtech's Andrei for Zen, Skylake, and ThunderX2, and the ARM scores for the new N1 core to plot the IPC of each core.

D0vVB0KXgAItXEQ.png


A mistake. The SPEC score for TX2 used GCC7 instead GCC8. The TX2 score would be a bit higher using GCC8, and its performance/GHz higher than 10 points. The gap with Zen would increase from ~5% to ~15% and almost match Skylake.
 
Back
Top