ARM server status update / reality check

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700
Things are getting more and more interesting.
I can't recommend Jeff Geerling's YouTube channel enough.




NVMe booting is currently in beta but will be added to the main feature-set soon via the Raspberry Pi Foundation.
Jeff's video shows that it is pretty quick to get it up and running, though, and ~400MB/s is pretty good for such a low-power SBC.
 
Last edited:

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
14,399
Wow, two fucking years after release, and they finally fixed the feature-list to make it competitive with every other x86 SBC that includes NVMe out there?

If you want to know why Broadcom has zero customers for their SoC world (outside PI) their complete lack of support here is a pretty easy answer.
 
  • Like
Reactions: travm
like this

BlueLineSwinger

[H]ard|Gawd
Joined
Dec 1, 2011
Messages
1,260
Wow, two fucking years after release, and they finally fixed the feature-list to make it competitive with every other x86 SBC that includes NVMe out there?

If you want to know why Broadcom has zero customers for their SoC world (outside PI) their complete lack of support here is a pretty easy answer.

Because it's silly for Bradocom to waste time and effort on getting NVMe up on the single PCIe 2.0 lane for I/O speeds that are slower than SATA3?

I'm not sure why RPi are bothering. The PCIe connectivity is only available from RPi 4 Compute Modules, on the 4B it's dedicated to the USB ports. Any application that really needs such I/O would probably be better off with a much higher-end ARM SoC (like Snapdragon or Apple M1 level (I can never keep all of the various ARM architectures straight)) or x86-64.

AFAICT Broadcom's SOC business is doing fine. They're embedded in tons of devices from manufacturers willing to pay for proper support and docs that the cheap Chinese SoCs don't offer (e.g., you're not going to find AllWinner in your car).
 

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700
Because it's silly for Bradocom to waste time and effort on getting NVMe up on the single PCIe 2.0 lane for I/O speeds that are slower than SATA3?
NVMe has a much great queue depth and command queues than SATA.
This makes it a low-cost boon for ARM developers looking to test high queuing tasks such as databases where singular transfer rates aren't as important, especially compared to current SATA, eMMC, or micro SD flash storage solutions which are currently natively available.

I'm not sure why RPi are bothering. The PCIe connectivity is only available from RPi 4 Compute Modules, on the 4B it's dedicated to the USB ports. Any application that really needs such I/O would probably be better off with a much higher-end ARM SoC (like Snapdragon or Apple M1 level (I can never keep all of the various ARM architectures straight)) or x86-64.
It depends on the cost and the task, and other ARM-based solutions which natively offer more PCIe lanes with much higher costs associated with them.
x86-64 solutions don't really apply unless the ISA doesn't matter to the end-user.
 
Last edited:

BlueLineSwinger

[H]ard|Gawd
Joined
Dec 1, 2011
Messages
1,260
NVMe has a much great queue depth and command queues than SATA.
This makes it a low-cost boon for ARM developers looking to test high queuing tasks such as databases where singular transfer rates aren't as important, especially compared to current SATA, eMMC, or micro SD flash storage solutions which are currently natively available.

True, NVMe is technically more capable. But I doubt the RPi SoC or similar is going to be a top choice for anyone who needs to process a ton of IOPS. I'd be concerned that the CPU wouldn't be able to keep up and load average would go through the roof. Initial dev, yeah, sure.


It depends on the cost and the task, and other ARM-based solutions which natively offer more PCIe lanes high much higher costs associated with them.
x86-64 solutions don't really apply unless the ISA doesn't matter to the end-user.

Well yeah, that's a given, more capable == more $$$.
 

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700

NVIDIA Announces CPU for Giant AI and High Performance Computing Workloads

Credit goes to Lakados


https://nvidianews.nvidia.com/news/...t-ai-and-high-performance-computing-workloads
Underlying Grace’s performance is fourth-generation NVIDIA NVLink® interconnect technology, which provides a record 900 GB/s connection between Grace and NVIDIA GPUs to enable 30x higher aggregate bandwidth compared to today’s leading servers.

Grace will also utilize an innovative LPDDR5x memory subsystem that will deliver twice the bandwidth and 10x better energy efficiency compared with DDR4 memory. In addition, the new architecture provides unified cache coherence with a single memory address space, combining system and HBM GPU memory to simplify programmability.

https://www.anandtech.com/show/1661...formance-arm-server-cpu-for-use-in-ai-systems
The company isn’t directly gunning for the Intel Xeon or AMD EPYC server market, but instead they are building their own chip to complement their GPU offerings, creating a specialized chip that can directly connect to their GPUs and help handle enormous, trillion parameter AI models.

Image%20-%20Grace_678x452.jpg

Old design infrastructure with x86-64 and PCIE:
PCIe_575px.jpg

New design infrastructure with AArch64 and NVLINK:
NVLink_575px.jpg
 
Last edited:

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
14,399

NVIDIA Announces CPU for Giant AI and High Performance Computing Workloads

Credit goes to Lakados


https://nvidianews.nvidia.com/news/...t-ai-and-high-performance-computing-workloads


https://www.anandtech.com/show/1661...formance-arm-server-cpu-for-use-in-ai-systems


View attachment 347381

Old design infrastructure with x86-64 and PCIE:
View attachment 347382

New design infrastructure with AArch64 and NVLINK:
View attachment 347383
the reason I didn't post it is that it's still a mystery. I assume this thing is built to accept as many GPU as you have slots for (just like the existing AMD servers)?

They can't even tell you any fuckng details about the CPU

So yea, this is a Pointless Press Release (tm) that we will have to wait a goddamed year to find out NVIDIA added a tiny tweak to the standard N2 design (plus the obvious addition of on-chip custom AI communicating with that CPU)

This feels like another empty Orin Press Release (along with an 18-month delay before specs and boards were shown)

https://www.anandtech.com/show/12598/nvidia-arm-soc-roadmap-updated-after-xavier-comes-orin

Fuck this preannounce shit man,you're not putting these in cars (so you don't have to give car designers 2-years empty pr notice to design these in)
 
Last edited:

juanrga

2[H]4U
Joined
Feb 22, 2017
Messages
2,804
the reason I didn't post it is that it's still a mystery. I assume this thing is built to accept as many GPU as you have slots for (just like the existing AMD servers)?
The full thing is rendered in the post by Red Falcon.

There are no slots, because slots are slow.
They can't even tell you any fuckng details about the CPU
It is based in a future microarchitecture to be announced by ARM. But Nvidia has already announced over 300 points on SPECrate2017_int_base.
So yea, this is a Pointless Press Release (tm) that we will have to wait a goddamed year to find out NVIDIA added a tiny tweak to the standard N2 design (plus the obvious addition of on-chip custom AI communicating with that CPU)

This feels like another empty Orin Press Release (along with an 18-month delay before specs and boards were shown)

https://www.anandtech.com/show/12598/nvidia-arm-soc-roadmap-updated-after-xavier-comes-orin

Fuck this preannounce shit man,you're not putting these in cars (so you don't have to give car designers 2-years empty pr notice to design these in)
The Swiss National Supercomputing Center and the Los Alamos National Laboratory will build supercomputers based on this.
 

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
14,399
The full thing is rendered in the post by Red Falcon.

There are no slots, because slots are slow.

It is based in a future microarchitecture to be announced by ARM. But Nvidia has already announced over 300 points on SPECrate2017_int_base.

The Swiss National Supercomputing Center and the Los Alamos National Laboratory will build supercomputers based on this.
TLDR: empty press release shows tons of potential, but is, in-reality, purely hype. I'm more pissed-off because Tegra has made this "normal" for NVIDIA
 

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
14,399
Ampere moving to custom cores - Anandtech Link

Interesting to see them jump on the Custom chain with their success with the Neoverse cores. I find it absolutely exciting to see a bunch of custom arm stuff popping up in both the server and consumer space.

Now just need to see some more RISC-V movement.

They own X-Gene, so it will be interesting to see what revision 4 brings!

Will it be faster than n2, or is arm upping the license costs after the surprise success of N1?
 
Last edited:

schmide

Limp Gawd
Joined
Jul 22, 2008
Messages
366
The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Very unique. Moar Cores / Moar Problems.

TLRL: Less L3, cache coherency rears its head, throughput for some things is amazing as is compiling (not linking), transactional java sux.
 

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
14,399
The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Very unique. Moar Cores / Moar Problems.

TLRL: Less L3, cache coherency rears its head, throughput for some things is amazing as is compiling (not linking), transactional java sux.


Well, we knew that pathetic refresh was coming, while it does some major work to make X-Gene 4 faster.
 

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700

Nvidia Unveils 144-core Grace CPU Superchip, Claims Arm Chip 1.5X Faster Than AMD's EPYC Rome

RVjW8BTVzKJWJjGycR9C8h-970-80.jpg.webp


The Grace CPU Superchip memory subsystem provides up to 1TB/s of bandwidth, which Nvidia says is a first for CPUs and more than twice that of other data center processors that will support DDR5 memory. The LPDDR5X comes spread out in 16 packages that provide 1TB of capacity. In addition, Nvidia notes that Grace uses the first ECC implementation of LPDDR5X.

This brings us to benchmarks. Nvidia claims the Grace CPU Superchip is 1.5X faster in the SPECrate_2017_int_base benchmark than the two previous-gen 64-core EPYC Rome 7742 processors it uses in its DGX A100 systems. Nvidia based this claim on a pre-silicon simulation that predicts the Grace CPU at a score of 740+ (370 per chip). AMD's current-gen EPYC Milan chips, the current performance leader in the data center, have posted SPEC results ranging from 382 to 424 apiece, meaning the highest-end x86 chips will still hold the lead. However, Nvidia's solution will have many other advantages, such as power efficiency and a more GPU-friendly design.
 

whateverer

[H]ard|Gawd
Joined
Nov 2, 2016
Messages
1,796
I'm glad to hear they finally gave up on the VLIW distractions; there's no way in hell they could make it work as a general-purpose server chip!

I guess once they released sve2, they figured they could get the same compute throughput without hacking the rest of the core with clunky designs?
 

schmide

Limp Gawd
Joined
Jul 22, 2008
Messages
366
It looks really neat.

The new mesh

CMN-700
12 x 12 = 144

CMN-600 (old)
6 x 6 = 36

Which I guess implies each 72 core chip is a 6 x 6 mesh. 2 cores per cross point 2 x 6 x 6 = 72

Further a 4 chip module would be 4 x 2 x 6 x 6 = 288

So CMN-700 is 4 x CMN-600 how nicely incremental
 

whateverer

[H]ard|Gawd
Joined
Nov 2, 2016
Messages
1,796
Things are starting to get interesting. 🍎:penguin:




Of course it is.

OSx will always be hindered by its Microkernel. This is really sad this generation,because there is no-longer any officially-supported kick-in-the-ass for Apple from other OS options
 

whateverer

[H]ard|Gawd
Joined
Nov 2, 2016
Messages
1,796
At long last, GPU functionality on the Raspberry Pi. (y)




Executive Summary: the drivers plus i/o architecture are so bad on the ARM platform,the only folks who can get GPU Compute working on Arm servers are Supercomputer vendors like NVIDIA.
 
Last edited:

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700
Exective Summary: the drivers plus i/o architecture are so bad on the RM platform,the onlyu folks who can get GPU Compute working on Arm servers are Supercxomputer vendors like NVIDIA.
It is a start, and that's how everything begins, very small.
As explained in the video, ARM has many various platforms with various PCIe standards enabled/disabled, so it may very well be on a case-by-case basis for GPU functionality.
 

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700


From the comments:

AMD EPYC 7601 = 3.2 GHz Ryzen Zen1 | 894 @ 3.2 GHz
Ampere Q80-30 = 3 GHz Neoverse N1 | 882 @ 3.0 GHz

To compare, N1 has +5% higher 1T GB5 IPC vs Zen1.
Arm claims the 2022 N2 has +40% IPC gains. Iso-power, +10% perf (though with 8MB vs 4MB L3). Iso-perf, -30% power (both are iso-process, but presumably N2 will be on some 5-nm nodes).
 
Last edited:

Red Falcon

[H]F Junkie
Joined
May 7, 2007
Messages
11,700

Raspberry Pi Cluster Versus Ampere Altra Max Supermicro Arm Server



The conclusion is about what you would expect, especially if you saw our AoA Analysis Marvell ThunderX2 Equals 190 Raspberry Pi 4. I also discuss the Ampere Altra Max’s HPL performance versus the newly released AMD EPYC Genoa as in the floating point workload, there is a massive gap between the efficiency of AMD and Ampere parts. There is a reason that the Arm server CPUs typically have integer-focused performance figures for things like web serving and SPEC CPU2017 integer rates, not floating point.
 
Top