Just under four years ago, Arm announced their Neoverse family of infrastructure CPU designs. Deciding to double-down on the server and edge computing markets by designing Arm CPU cores specifically for those markets – and not just recycling the consumer-focused Cortex-A designs – Arm set about tackling the infrastructure market in a far more aggressive manner. Those efforts, in turn, have increasingly paid off handsomely for Arm and its partners, whom thanks to the likes of products like Amazon’s Graviton and Ampere Altra CPUs have at long last been able take a meaningful piece of the server CPU market.

But as Arm CPUs finally achieve the market penetration that eluded them in the previous decade, Arm needs to make sure it isn’t resting on its laurels. Of the company’s three lines of Neoverse core designs –the efficient E, flexible N, and high-performance V – the company is already on its second generation of N cores, aptly dubbed the N2. Now, the company is preparing to update the rest of the Neoverse lineup with the next generation of V and E cores as well, announcing today the Neoverse V2 and Neoverse E2 cores. Both of these designs are slated to bring the Armv9 architecture to HPC and other server customers, as well as significant performance improvements.

Arm Neoverse V2: Armv9 Graces High-Performance Computing

Leading the charge for Arm’s new CPU core IP is the company’s second-generation V-series design, the Neoverse V2. The complete V2 platform, codenamed Demeter, marks Arm’s first iteration on their high-performance V-series cores, as well as the transition of this core lineup from the Armv8.4 ISA to Armv9. And while this is only Arm’s second go at a dedicated high-performance core for servers, make no mistake: Arm aims to be ambitious. The company is claiming that Neoverse V2 CPUs will offer the highest single-threaded integer performance available in the market, eclipsing next-generation designs from both AMD and Intel.

While this week’s announcement from Arm is not a full-on deep-dive of the new architecture – and, more annoyingly, the company is not talking about specific PPA metrics – Arm is offering a high-level look at some of the changes and features that will be coming with the V2 platform. To be sure, the V2 IP is already finished and shipping to customers today (most notably NVIDIA), but Arm is playing coy to some degree with what they’re saying about V2 before the first chips based on the IP ship in 2023.

First and foremost, the bump to Armv9 brings with it the full suite of features that come with the latest Arm architecture. That includes the security improvements that are a cornerstone feature of the architecture (and especially handy for cloud shared environments) along with Arm’s newer SVE2 vector extensions.

On the latter, Arm is making an interesting change here by reconfiguring the width of their vector engines; whereas V1 implemented SVE(1) using a 2 pipeline 256-bit SIMD, V2 moves to 4 pipes of 128-bit SIMDs. The net result is that the cumulative SIMD width of the V2 is not any wider than V1, but the execution flow has changed to process a larger number of smaller vectors in parallel. This change makes the SIMD pipeline width identical to Arm’s Cortex parts (which are all 128-bit, the minimum size for SVE2), but it does mean that Arm is no longer taking full advantage of the scalable part of SVE by using larger SIMDs. I expect we’ll find out why Arm is taking this route once they do a full V2 deep dive, as I’m curious whether this is purely an efficiency play or something more akin to homogenizing designs across the Arm ecosystem.

Past that, it’s likely worth noting that while Arm’s presentation slides put bfloat16 and int8 matmul down as features, these are not new features. Still, Arm is promising that V2’s SIMD processing will provide microarchitecture efficiency improvements over the V1.

More broadly, V2 will also be introducing larger L2 cache sizes. The V2 design supports up to 2MB of private L2 cache per core, double the maximum size of V1. V2 will also be introducing further improvements to Arm’s integer processing performance, though the company isn’t going into further detail at this point. From an architectural standpoint, the V1 borrowed a fair bit from the Cortex-X1 CPU design, and it wouldn’t be too surprising if that was once again the case for the V2, borrowing from the X2. In which case consumer chips like the Snapdragon 8 Gen1 and Dimensity 9000 should provide a loose reference on what to expect.

For the Demeter platform Arm will be reusing their CMN-700 mesh fabric, which was first introduced for the V1 generation. CMN-700 is still a modern mesh design with support for up to 144 nodes in a 12x12 configuration, and is suitable for interfacing with DDR5 memory as well as PCIe 5/CXL 2 for I/O. As a result, strictly speaking the V2 isn’t bringing anything new at the fabric level – even the 512MB of SLC could be done with a V1 + CMN-700 setup – but this does mean that the CMN-700 mesh and its features is now a baseline moving forward with V2.

The Neoverse V2 core, in turn, is going to be the cornerstone of the upcoming generation of high-performance Arm server CPUs. The de facto flagship here will be NVIDIA’s Grace CPU, which will be one of the first (if not the first) V2 design to ship in 2023. NVIDIA had previously announced that Grace would be based on a Neoverse design, so this week’s announcement from Arm finally confirms the long-held suspicion that Grace would be based on the next-generation Neoverse V core.

NVIDIA, for its part, has their fall GTC event scheduled to take place in just a few days. So it’s likely we’ll hear a bit more about Grace and its Neoverse V2 underpinnings as NVIDIA seeks to promote the chip ahead of its release next year.

Neoverse E2: Cortex-A510 For Use With N2

Alongside the Neoverse V2 announcement, Arm is also using this week’s briefing to announce the Neoverse E2 platform. Unlike the V2 reveal, this is a much smaller scale announcement, and Arm is only offering a handful of technical details. Ultimately, E2’s day in the sun will be coming a bit later on.

That said, the E2 platform is being delivered to partners with an eye towards interoperability with the existing N2 platform. For this, Arm has paired the Cortex-A510 CPU, Arm’s little/high-efficiency Cortex CPU core, and paired that with the CMN-700 mesh. This is intended to give server operators/vendors further flexibility by providing an alternative CPU core to the N2, while still offering the modern I/O and memory features of Arm’s mesh. Underscoring this, the E2 system backplane is even compatible with the N2 backplane.

Neoverse Next: Poseidon, N-Next, and E-Next

Finally, Arm’s announcement this week provides a glimpse at the company’s future roadmap for all three Neoverse platforms, where, unsurprisingly, Arm is working on updated versions of each of the platforms.

Notably, all three platforms call for adding PCIe 6 support as well as CXL 3.0 support. This would come from the next iteration of Arm’s CMN mesh network, which as Arm already does today, is shared between all three platforms.

Meanwhile, it’s interesting to see the Poseidon name once again pop up in Arm’s roadmaps. Going back to Arm’s very first Neoverse roadmap, Poseidon was the name attached to Arm’s 5mn/2021 platform, a spot since taken by N2 and V1/V2 in various forms. With V2 not landing in hardware until 2023, Poseidon/V3 is still years off, but there’s likely some significance to Arm keeping the codename (such as new microarchitecture).

But first out of the gate will be the N-Next platform – the presumable Neoverse N3. With the Neoverse N platform a generation ahead of the rest (N2 was first announced in 2020), it’ll be the next platform due for a refresh. N3 is due to be available to partners in 2023, with Arm broadly touting generational performance and efficiency improvements.

Comments Locked


View All Comments

  • brucethemoose - Thursday, September 15, 2022 - link

    "This change makes the SIMD pipeline width identical to Arm’s Cortex parts (which are all 128-bit, the minimum size for SVE2), but it does mean that Arm is no longer taking full advantage of the scalable part of SVE by using larger SIMDs. I expect we’ll find out why Arm is taking this route once they do a full V2 deep dive, as I’m curious whether this is purely an efficiency play or something more akin to homogenizing designs across the Arm ecosystem"

    I propose an even simpler reason: faster NEON performance, which is what basically all existing hand coded ARM intrinsics use now.

    And other customers are probably using binaries without SVE2, or compilers that can't emit SVE2 instructions from autovectorized code yet.
  • mode_13h - Saturday, September 17, 2022 - link

    I think there's wisdom in your point about the amount of Neon code in the wild. But it really makes you think. If narrow SVE isn't such a disadvantage, then maybe the whole SVE concept is overblown. Going with 4x 128-bit suggests the CPU is good enough at speculative execution that it can blast through SVE loops and dispatch instructions fast enough to keep the pipelines full. Otherwise, I doubt they'd have sacrificed width in what's meant to be their most HPC-oriented core.

    There's no way they would accept a regression in floating point IPC, going from V1 -> V2. So, maybe SVE > 128-bit is dead in any incarnation other than proprietary HPC cores that really prioritize FLOPS/W.
  • brucethemoose - Saturday, September 17, 2022 - link

    I bet the 4x 128 bit units are less area efficient than 2x 256, at the very least. But yeah, its probably not a huge regression if they did it.

    However, the whole point of SVE2 is long term standardization. There will come a point where devs on some platforms can basically assume an ARM CPU is an SVE2 capable one, as is happening with AVX/AVX2 now. And one doesn't want to run 8+ 128-bit units on those future CPUs.

    TBH there seem to be some much better HPC designs than Neoverse in the pipe, like the SiPearl Rhea or whatever Fujitsu comes up with.
  • mode_13h - Monday, September 19, 2022 - link

    One thing about SVE is that the *code* doesn't have to scale, even if the hardware can. As we saw in a recent glibc memcpy optimization, which only used up to 256-bit vectors. Running such code on a 512-bit hardware implementation should give no advantage over 256-bit.

    BTW, one thing I find a little bit shocking is how vendor-specific some of the glibc optimizations are. It's not just ARMv8.4 or whatever, but even going so far as to look at the model ID of the specific CPU.
  • Wilco1 - Monday, September 19, 2022 - link

    It still scales: it works correctly at different vector sizes without needing to rewrite your code or write a new compiler backend for yet another vector extension. There is no guarantee that wider vectors give a speedup since that is highly dependent on the microarchitecture. A64FX and Neoverse V2 are wildly different design points - wide vectors make sense for a low-clocked supercomputer but a wide OoO with narrow vectors wins all general purpose code.

    And yes one needs to test for specific cpus and features in string functions since not all microarchitectures have a fast SIMD unit unfortunately (now that's the part that should be shocking...).
  • Dante Verizon - Thursday, September 15, 2022 - link

    They can't even beat Apple and now want to face AMD... What a joke.
  • Lakados - Thursday, September 15, 2022 - link

    You seem to be under the impression that the N1 chips didn’t already stand up against all the Intel and AMD server offerings at the time of its release.

    For the right workloads the N1 CPUs are absolute beasts. The N2s look promising especially if it’s true we can expect up to a 40% increase in performance.
  • smilingcrow - Thursday, September 15, 2022 - link

    Not a particularly good joke as Apple don't even compete in the server market.
    Even their workstation offerings are still half baked.
  • Samus - Friday, September 16, 2022 - link

    Seriously, talk about comparing Apples to Oranges. Apple left the server industry ages ago, which is too bad because OSX Server was a potent OS. The hardware was total shit for any application larger than a very small business. It's almost as if they intended OSX Server for residential use.

    Now, though, the Apple M CPU could be repurposed in the server space but I doubt it or the underlying architecture would be appropriate for data centers. The GPU is proprietary and too weak, the cache hierarchy favors singular multithreaded desktop workloads, not wide parallelism, and it is lacking basic pipeline and die-level security features, specifically microsegmentation.

    Most obviously, though, Apple hasn't yet demo'd a scalable implementation. They just keep blowing the chip up (like Pro, Max) and gluing them together (Ultra.) I doubt they will ever use the M CPU's in their own data centers let alone making them for consumers.

    Obviously Apple has the talent to make a data center CPU, but they currently have not done that and as a company obsessed with margins, I doubt they will unless there is a cost benefit.
  • Threska - Friday, September 16, 2022 - link

    I don't think anywhere in their history they would have the interest or expertise for entering the server space. The prosumer server would be a very niche market.

Log in

Don't have an account? Sign up now