Test Bed and Setup - Compiler Options

For the rest of our performance testing, we’re disclosing the details of the various test setups:

Ampere "Mount Jade" - Dual Altra Q80-33

Obviously, for the Ampere Altra system we’re using the provided Mount Jade server as configured by Ampere.

The system features 2 Altra Q80-33 processors within the Mount Jade DVT motherboard from Ampere.

In terms of memory, we’re using the bundled 16 DIMMs of 32GB of Samsung DDR4-3200 for a total of 512GB, 256GB per socket.

CPU ​2x Ampere Altra Q80-33 (3.3 GHz, 80c, 32 MB L3, 250W)
RAM 512 GB (16x32 GB) Samsung DDR4-3200
Internal Disks Samsung MZ-QLB960NE 960GB
Samsung MZ-1LB960NE 960GB
Motherboard Mount Jade DVT Reference Motherboard
PSU 2000W (94%)

The system came preinstalled with CentOS 8 and we continued usage of that OS. It’s to be noted that the server is naturally Arm SBSA compatible and thus you can run any kind of Linux distribution on it.

Ampere makes special note of Oracle’s active support of their variant of Oracle Linux for Altra, which makes sense given that Oracle a few months ago announced adoption of Altra systems for their own cloud-based offerings.

The only other note to make of the system is that the OS is running with 64KB pages rather than the usual 4KB pages – this either can be seen as a testing discrepancy or an advantage on the part of the Arm system given that the next page size step for x86 systems is 2MB – which isn’t feasible for general use-case testing and something deployments would have to decide to explicitly enable.

The system has all relevant security mitigations activated, including SSBS (Speculative Store Bypass Safe) against Spectre variants.

AMD - Dual EPYC 7742

For our AMD system, unfortunately we had hit some issues with our Daytona reference server motherboard, and moved over to a test-bench setup on a SuperMicro H11DSI0.

We’re also equipping the system with 256GB per socket of 8-channel/DIMM DDR4-3200 memory, matching the Altra system.

CPU ​2x AMD EPYC 7742 (2.25-3.4 GHz, 64c, 256 MB L3, 225W)
RAM 512 GB (16x32 GB) Micron DDR4-3200
Internal Disks OCZ Vector 512GB
Motherboard SuperMicro H11DSI0
PSU EVGA 1600 T2 (1600W)

As an operating system we’re using Ubuntu 20.10 with no further optimisations. In terms of BIOS settings we’re using complete defaults, including retaining the default 225W TDP of the EPYC 7742’s, as well as leaving further CPU configurables to auto, except of NPS settings where it’s we explicitly state the configuration in the results.

The system has all relevant security mitigations activated against speculative store bypass and Spectre variants.

Intel - Dual Xeon Platinum 8280

For the Intel system we’re also using a test-bench setup with the same SSD and OS image actually – we didn’t have enough RAM to run both systems concurrently.

Because the Xeons only have 6-channel memory, their maximum capacity is limited to 384GB of the same Micron memory, running at a default 2933MHz to remain in-spec with the processor’s capabilities.

CPU 2x Intel Xeon Platinum 8280  (2.7-4.0 GHz, 28c, 38.5MB L3, 205W)
RAM 384 GB (12x32 GB) Micron DDR4-3200 (Running at 2933MHz)
Internal Disks OCZ Vector 512GB
Motherboard ASRock EP2C621D12 WS
PSU EVGA 1600 T2 (1600W)

The Xeon system was similarly run on BIOS defaults on an ASRock EP2C621D12 WS with the latest firmware available.

The system has all relevant security mitigations activated against the various vulnerabilities.

Compiler Setup

For compiled tests, we’re using the release version of GCC 10.2. The toolchain was compiled from scratch on both the x86 systems as well as the Altra system. We’re using shared binaries with the system’s libc libraries.

Topology, Memory Subsystem & Latency SPEC - Single-Threaded Performance
Comments Locked

148 Comments

View All Comments

  • Brane2 - Saturday, December 19, 2020 - link

    Meh. Nothing special. it has been benchmarked on Phoronix and it performed more or less on par with Rome. 80 newest ARM cores against 64 mature x86 cores within constrained power envelope.
    Naples is just about to come out and I suspect some time after that AMD will have something like really wide new RISC-V cores.
  • Wilco1 - Saturday, December 19, 2020 - link

    It won most benchmarks on Phoronix while using significantly less power. Yes Milan is about to be released, and it will have to compete with the 128-core Altra Max. Which do you believe is going to win - 64 SMT cores or 128 real cores?
  • mode_13h - Sunday, December 20, 2020 - link

    It actually won less than half of the benchmarks on phoronix, since a number of those graphs just re-state the results in score/W. There are also questions over some of the compiler options used on those benchmarks, since many of the tests are compiled with options that won't enable AVX on benchmarks where it should be beneficial (yet, not having SVE, the N1 cores are at no such disadvantage).
  • Wilco1 - Monday, December 21, 2020 - link

    "should be beneficial" -> "might help in a few limited cases". AVX/AVX512 isn't that useful for general C/C++ code. You typically only see large gains when people optimize using intrinsics.
  • mode_13h - Monday, December 21, 2020 - link

    Intrinsics don't compile if they're for a CPU arch beyond what the compiler is being instructed to target. So, even packages where people take the time to optimize with intrinsics need to guard them with compile-time checks to ensure the CPU target is capable of executing those instructions.

    Compilers do generate vectorized code. I don't know how well GCC is doing on that front, lately, but the TNN tests should be a good way to see that. Too bad those tests don't use -march=native.

    What's interesting about TNN is I'm looking at the exact source revision Phoronix is using, and it seems they've completely dropped their backend for x86. The source/tnn/device/x86/ is simply missing. So, I wonder if they decided the compiler was good enough that they didn't need to bother with their own hand-optimized code for it, or if they just decided they don't care how fast their stuff runs on it.

    See:
    * https://openbenchmarking.org/innhold/83a730ed41d4e...
    * https://github.com/Tencent/TNN/tree/v0.2.3
  • Wilco1 - Monday, December 21, 2020 - link

    TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

    Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU. The makefile selects the right ISA variant for any files using intrinsics. But none of this is used in the TNN test.
  • mode_13h - Monday, December 21, 2020 - link

    > TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

    At this point, I probably will.

    > Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU.

    That's a whole additional level of effort for the developers. For them to bother compiling and conditionally calling different versions only makes sense if they think their main userbase aren't going to bother recompiling specifically for their hardware. In the case of specialized packages, it's reasonable to expect your users to take a little trouble for the best performance. It's really things like very low-level libs or multimedia code where you tend to see the sort of elaborate runtime detection and dynamic codepath selection that you're describing.
  • mode_13h - Monday, December 21, 2020 - link

    I think Basis Universal and High Performance Conjugate Gradient are some other cases where the wider SIMD of Zen2 and Skylake-SP should confer significant benefit.
  • Wilco1 - Monday, December 21, 2020 - link

    "should give significant benefit" -> "might give some benefit". I suggest you try out. Autovectorization is not nearly as good as you seem to believe, and the overall speedup is often disappointing even if some loops are 10-20x faster.
  • vinayshivakumar - Saturday, December 19, 2020 - link

    I am a bit puzzled why none of these processors support SMT... Can someone shed light on why this is the case ?

Log in

Don't have an account? Sign up now