Test Bed and Setup - Compiler Options

For the rest of our performance testing, we’re disclosing the details of the various test setups:

Ampere "Mount Jade" - Dual Altra Q80-33

Obviously, for the Ampere Altra system we’re using the provided Mount Jade server as configured by Ampere.

The system features 2 Altra Q80-33 processors within the Mount Jade DVT motherboard from Ampere.

In terms of memory, we’re using the bundled 16 DIMMs of 32GB of Samsung DDR4-3200 for a total of 512GB, 256GB per socket.

CPU ​2x Ampere Altra Q80-33 (3.3 GHz, 80c, 32 MB L3, 250W)
RAM 512 GB (16x32 GB) Samsung DDR4-3200
Internal Disks Samsung MZ-QLB960NE 960GB
Samsung MZ-1LB960NE 960GB
Motherboard Mount Jade DVT Reference Motherboard
PSU 2000W (94%)

The system came preinstalled with CentOS 8 and we continued usage of that OS. It’s to be noted that the server is naturally Arm SBSA compatible and thus you can run any kind of Linux distribution on it.

Ampere makes special note of Oracle’s active support of their variant of Oracle Linux for Altra, which makes sense given that Oracle a few months ago announced adoption of Altra systems for their own cloud-based offerings.

The only other note to make of the system is that the OS is running with 64KB pages rather than the usual 4KB pages – this either can be seen as a testing discrepancy or an advantage on the part of the Arm system given that the next page size step for x86 systems is 2MB – which isn’t feasible for general use-case testing and something deployments would have to decide to explicitly enable.

The system has all relevant security mitigations activated, including SSBS (Speculative Store Bypass Safe) against Spectre variants.

AMD - Dual EPYC 7742

For our AMD system, unfortunately we had hit some issues with our Daytona reference server motherboard, and moved over to a test-bench setup on a SuperMicro H11DSI0.

We’re also equipping the system with 256GB per socket of 8-channel/DIMM DDR4-3200 memory, matching the Altra system.

CPU ​2x AMD EPYC 7742 (2.25-3.4 GHz, 64c, 256 MB L3, 225W)
RAM 512 GB (16x32 GB) Micron DDR4-3200
Internal Disks OCZ Vector 512GB
Motherboard SuperMicro H11DSI0
PSU EVGA 1600 T2 (1600W)

As an operating system we’re using Ubuntu 20.10 with no further optimisations. In terms of BIOS settings we’re using complete defaults, including retaining the default 225W TDP of the EPYC 7742’s, as well as leaving further CPU configurables to auto, except of NPS settings where it’s we explicitly state the configuration in the results.

The system has all relevant security mitigations activated against speculative store bypass and Spectre variants.

Intel - Dual Xeon Platinum 8280

For the Intel system we’re also using a test-bench setup with the same SSD and OS image actually – we didn’t have enough RAM to run both systems concurrently.

Because the Xeons only have 6-channel memory, their maximum capacity is limited to 384GB of the same Micron memory, running at a default 2933MHz to remain in-spec with the processor’s capabilities.

CPU 2x Intel Xeon Platinum 8280  (2.7-4.0 GHz, 28c, 38.5MB L3, 205W)
RAM 384 GB (12x32 GB) Micron DDR4-3200 (Running at 2933MHz)
Internal Disks OCZ Vector 512GB
Motherboard ASRock EP2C621D12 WS
PSU EVGA 1600 T2 (1600W)

The Xeon system was similarly run on BIOS defaults on an ASRock EP2C621D12 WS with the latest firmware available.

The system has all relevant security mitigations activated against the various vulnerabilities.

Compiler Setup

For compiled tests, we’re using the release version of GCC 10.2. The toolchain was compiled from scratch on both the x86 systems as well as the Altra system. We’re using shared binaries with the system’s libc libraries.

Topology, Memory Subsystem & Latency SPEC - Single-Threaded Performance
Comments Locked

148 Comments

View All Comments

  • Wilco1 - Saturday, December 19, 2020 - link

    SMT gives very little benefit (only 15% faster on SPECINT and 3.5% *slower* on SPECFP), adds a lot of area and complexity, and results in very bad per-thread performance.

    It's always better to use 2 real cores instead of 1 SMT core. So if you have small cores like Neoverse N1, adding SMT makes no sense at all.
  • mode_13h - Sunday, December 20, 2020 - link

    Whether SMT makes sense depends on the workload. Many tasks have greater branch-density and less computational density, in which case SMT is a massive win. I'm compiling code all the time and see huge speedups from SMT.

    That said, of course I'll take two real cores instead of 2-way SMT, all else being equal, but that always costs more. If SMT really made as little sense as you say, then it wouldn't be nearly so widespread.
  • Wilco1 - Monday, December 21, 2020 - link

    There are certainly cases where SMT helps, but having some wins doesn't mean it is worth adding SMT. All too often people talk up the upsides and ignore the downsides. Let's see how Altra Max does vs Milan next year, that should answer which is best.

    Note almost none of the billions of CPUs sold every year have SMT (even if we exclude embedded). Adding another core is simpler, cheaper and gives more performance.
  • mode_13h - Monday, December 21, 2020 - link

    > Let's see how Altra Max does vs Milan next year, that should answer which is best.

    I disagree. That's a bit apples and oranges. The differences between Zen2/Skylake and N1 are too big. You need to look at the overhead of SMT for a particular core vs. the benefits for that core.

    These x86 cores are large not just because of SMT, but also the x86 tax, their wide vector units, and other things. It could be that SMT adds just 5% overhead, and that's not enough to increase your core count hardly at all, if you drop it.
  • Wilco1 - Monday, December 21, 2020 - link

    It's never going to be a perfect comparison - nobody will design SMT and non-SMT variants of the same core! There are many differences because they use different design principles. However it will clearly show which of these designs works out best for top-end server performance.
  • Spunjji - Monday, December 21, 2020 - link

    SMT is a clear win where you already have large cores with a lot of execution resources, whereby the extra resources required aren't a large proportion of overall die area. It also helps if your tasks are focused on overall performance for a given number of threads, rather than performance-per-thread.

    Where the cores are this small, though, simply adding more of them seems to be the better option.
  • mode_13h - Monday, December 21, 2020 - link

    Well said.
  • mode_13h - Sunday, December 20, 2020 - link

    Plus, per-thread performance is only an issue if that's how you're paying for CPU time, and then what you actually care about is thread performance per dollar, which would compensate for any cost differences due to SMT, as well.
  • Wilco1 - Monday, December 21, 2020 - link

    And Graviton 2 shows which is cheaper and faster.
  • mode_13h - Monday, December 21, 2020 - link

    That comparison is only valid for Amazon customers and in the short term. It can't be used to support a broader conclusion about SMT, because we lack transparency into the cost structure of Amazon's hosting, like whether they're subsidizing Graviton2 servers or even just charging enough for them to simply break even on the hardware.

Log in

Don't have an account? Sign up now