The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Name: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Item: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Author: Andrei Frumusanu

by Andrei Frumusanu on December 18, 2020 6:00 AM EST

148 Comments | Add A Comment

148 Comments

Test Bed and Setup - Compiler Options

For the rest of our performance testing, we’re disclosing the details of the various test setups:

Ampere "Mount Jade" - Dual Altra Q80-33

Obviously, for the Ampere Altra system we’re using the provided Mount Jade server as configured by Ampere.

The system features 2 Altra Q80-33 processors within the Mount Jade DVT motherboard from Ampere.

In terms of memory, we’re using the bundled 16 DIMMs of 32GB of Samsung DDR4-3200 for a total of 512GB, 256GB per socket.

CPU	2x Ampere Altra Q80-33 (3.3 GHz, 80c, 32 MB L3, 250W)
RAM	512 GB (16x32 GB) Samsung DDR4-3200
Internal Disks	Samsung MZ-QLB960NE 960GB Samsung MZ-1LB960NE 960GB
Motherboard	Mount Jade DVT Reference Motherboard
PSU	2000W (94%)

The system came preinstalled with CentOS 8 and we continued usage of that OS. It’s to be noted that the server is naturally Arm SBSA compatible and thus you can run any kind of Linux distribution on it.

Ampere makes special note of Oracle’s active support of their variant of Oracle Linux for Altra, which makes sense given that Oracle a few months ago announced adoption of Altra systems for their own cloud-based offerings.

The only other note to make of the system is that the OS is running with 64KB pages rather than the usual 4KB pages – this either can be seen as a testing discrepancy or an advantage on the part of the Arm system given that the next page size step for x86 systems is 2MB – which isn’t feasible for general use-case testing and something deployments would have to decide to explicitly enable.

The system has all relevant security mitigations activated, including SSBS (Speculative Store Bypass Safe) against Spectre variants.

AMD - Dual EPYC 7742

For our AMD system, unfortunately we had hit some issues with our Daytona reference server motherboard, and moved over to a test-bench setup on a SuperMicro H11DSI0.

We’re also equipping the system with 256GB per socket of 8-channel/DIMM DDR4-3200 memory, matching the Altra system.

CPU	2x AMD EPYC 7742 (2.25-3.4 GHz, 64c, 256 MB L3, 225W)
RAM	512 GB (16x32 GB) Micron DDR4-3200
Internal Disks	OCZ Vector 512GB
Motherboard	SuperMicro H11DSI0
PSU	EVGA 1600 T2 (1600W)

As an operating system we’re using Ubuntu 20.10 with no further optimisations. In terms of BIOS settings we’re using complete defaults, including retaining the default 225W TDP of the EPYC 7742’s, as well as leaving further CPU configurables to auto, except of NPS settings where it’s we explicitly state the configuration in the results.

The system has all relevant security mitigations activated against speculative store bypass and Spectre variants.

Intel - Dual Xeon Platinum 8280

For the Intel system we’re also using a test-bench setup with the same SSD and OS image actually – we didn’t have enough RAM to run both systems concurrently.

Because the Xeons only have 6-channel memory, their maximum capacity is limited to 384GB of the same Micron memory, running at a default 2933MHz to remain in-spec with the processor’s capabilities.

CPU	2x Intel Xeon Platinum 8280 (2.7-4.0 GHz, 28c, 38.5MB L3, 205W)
RAM	384 GB (12x32 GB) Micron DDR4-3200 (Running at 2933MHz)
Internal Disks	OCZ Vector 512GB
Motherboard	ASRock EP2C621D12 WS
PSU	EVGA 1600 T2 (1600W)

The Xeon system was similarly run on BIOS defaults on an ASRock EP2C621D12 WS with the latest firmware available.

The system has all relevant security mitigations activated against the various vulnerabilities.

Compiler Setup

For compiled tests, we’re using the release version of GCC 10.2. The toolchain was compiled from scratch on both the x86 systems as well as the Altra system. We’re using shared binaries with the system’s libc libraries.

Topology, Memory Subsystem & Latency SPEC - Single-Threaded Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

148 Comments

View All Comments

Brane2 - Saturday, December 19, 2020 - link
Meh. Nothing special. it has been benchmarked on Phoronix and it performed more or less on par with Rome. 80 newest ARM cores against 64 mature x86 cores within constrained power envelope.
Naples is just about to come out and I suspect some time after that AMD will have something like really wide new RISC-V cores.
Wilco1 - Saturday, December 19, 2020 - link
It won most benchmarks on Phoronix while using significantly less power. Yes Milan is about to be released, and it will have to compete with the 128-core Altra Max. Which do you believe is going to win - 64 SMT cores or 128 real cores?
mode_13h - Sunday, December 20, 2020 - link
It actually won less than half of the benchmarks on phoronix, since a number of those graphs just re-state the results in score/W. There are also questions over some of the compiler options used on those benchmarks, since many of the tests are compiled with options that won't enable AVX on benchmarks where it should be beneficial (yet, not having SVE, the N1 cores are at no such disadvantage).
Wilco1 - Monday, December 21, 2020 - link
"should be beneficial" -> "might help in a few limited cases". AVX/AVX512 isn't that useful for general C/C++ code. You typically only see large gains when people optimize using intrinsics.
mode_13h - Monday, December 21, 2020 - link
Intrinsics don't compile if they're for a CPU arch beyond what the compiler is being instructed to target. So, even packages where people take the time to optimize with intrinsics need to guard them with compile-time checks to ensure the CPU target is capable of executing those instructions.

Compilers do generate vectorized code. I don't know how well GCC is doing on that front, lately, but the TNN tests should be a good way to see that. Too bad those tests don't use -march=native.

What's interesting about TNN is I'm looking at the exact source revision Phoronix is using, and it seems they've completely dropped their backend for x86. The source/tnn/device/x86/ is simply missing. So, I wonder if they decided the compiler was good enough that they didn't need to bother with their own hand-optimized code for it, or if they just decided they don't care how fast their stuff runs on it.

See:
* https://openbenchmarking.org/innhold/83a730ed41d4e...
* https://github.com/Tencent/TNN/tree/v0.2.3
Wilco1 - Monday, December 21, 2020 - link
TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU. The makefile selects the right ISA variant for any files using intrinsics. But none of this is used in the TNN test.
mode_13h - Monday, December 21, 2020 - link
> TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

At this point, I probably will.

> Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU.

That's a whole additional level of effort for the developers. For them to bother compiling and conditionally calling different versions only makes sense if they think their main userbase aren't going to bother recompiling specifically for their hardware. In the case of specialized packages, it's reasonable to expect your users to take a little trouble for the best performance. It's really things like very low-level libs or multimedia code where you tend to see the sort of elaborate runtime detection and dynamic codepath selection that you're describing.
mode_13h - Monday, December 21, 2020 - link
I think Basis Universal and High Performance Conjugate Gradient are some other cases where the wider SIMD of Zen2 and Skylake-SP should confer significant benefit.
Wilco1 - Monday, December 21, 2020 - link
"should give significant benefit" -> "might give some benefit". I suggest you try out. Autovectorization is not nearly as good as you seem to believe, and the overall speedup is often disappointing even if some loops are 10-20x faster.
vinayshivakumar - Saturday, December 19, 2020 - link
I am a bit puzzled why none of these processors support SMT... Can someone shed light on why this is the case ?

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Test Bed and Setup - Compiler Options

Ampere "Mount Jade" - Dual Altra Q80-33

AMD - Dual EPYC 7742

Intel - Dual Xeon Platinum 8280

Compiler Setup

Post Your Comment

148 Comments

View All Comments

Brane2 - Saturday, December 19, 2020 - link

Wilco1 - Saturday, December 19, 2020 - link

mode_13h - Sunday, December 20, 2020 - link

Wilco1 - Monday, December 21, 2020 - link

mode_13h - Monday, December 21, 2020 - link

Wilco1 - Monday, December 21, 2020 - link

mode_13h - Monday, December 21, 2020 - link

mode_13h - Monday, December 21, 2020 - link

Wilco1 - Monday, December 21, 2020 - link

vinayshivakumar - Saturday, December 19, 2020 - link

Log in

Don't have an account? Sign up now