Another year, another TechDay from Arm. Over the last several years Arm’s event has come as clockwork in the May timeframe and has every time unveiled the newest flagship CPU and GPU IPs. This year is no exception as the event is back on the American side of the Atlantic in Austin Texas where Arm has one of its major design centres.

Two years ago during the unveiling of the Cortex A73 I had talked a bit more about Arm’s CPU design teams and how they’re spread across locations and product lines. The main design centres for Cortex-A series of CPUs are found in Austin, Texas; Cambridge, the United Kingdom, and Sophia-Antipolis in the south of France near Nice. For the last two years the Cortex A73 and Cortex A75 were designs that mainly came out of the Sophia team while the Cortex A53 and more recently the A55 were designs coming out of Cambridge. This means that we haven’t seen any recent designs coming out of Austin and the last of the “Austin family” of CPUs were the A57 and A72.

The project being worked on in Austin had been hyped up for several years – I remember even as early as the A73 release back in 2016 the company had pulled forward some elements from an advanced future microarchitecture on the back-end pipelines, especially on the FP/SIMD side. The Cortex A75 was further remarked as pulling more elements from this new mysterious project.

Today we can finally unveil what the Austin team has been working on – and it’s a big one. The new Cortex A76 is a brand new microarchitecture which has been built from scratch and lays the foundation for at least two more generations for what I’ll call “the second generation of Austin family” of CPUs.

The Cortex A76 is important for Arm for a design perspective as it represents a new start from a clean sheet. It’s rare for IP claim to be able to do this as it represents a great resource and time investment and if it weren’t for the Sophia design team taking over the steering wheel for the last two generations of products it wouldn’t have been reasonable to execute. The execution of the CPU design teams should be emphasised in particular as Arm claims this is the 5th generation “annual beat” product where the company delivers a new microarchitecture every new year. Think of it as an analogue to Intel’s past Tick-Tock strategy, but rather Tock-Tock-Tock for Arm with steady CAGR (compound annual growth rate) of 20-25% every generation coming from µarch improvements.

So what is the Cortex A76? In Arm’s words, it’s a “laptop-class” performance processor with mobile efficiency. The vision of the A76 as a laptop-class processor had been emphasised throughout the TechDay presentation so it seems Arm is really taking advantage of the large performance boost of the IP to cater to new market segments such as the emerging “Always connected PCs” which Qualcomm is spearheading with their SoC platforms.

The Cortex A76 microarchitecture has been designed with high performance while maintaining power efficiency in mind. Starting from a clean sheet allowed the designers to remove bottlenecks throughout the design and to break previous microarchitectural limitations. The focus here was again maximum performance while remaining within energy efficiency that is fit for smartphones.

In broad metrics, what we’re promised in actual products using the A76 is the follows: a 35% performance increase alongside 40% improved power efficiency. We’ll also see a 4x improvements in machine learning workloads thanks to new optimisations in the ASIMD pipelines and how dot products are handled. These figures are baselined on A75 configurations running at 2.8GHz on 10nm processes while the A76 is projected by Arm to come in at 3GHz on 7nm TSMC based products.

The new CPU is naturally still compatible with DynamIQ’s common cluster topology and Arm envisions designs to be paired with Cortex A55s as the little more power efficient CPUs. The configuration scalability of the DynamIQ IP again was reiterated and we were presented with example configurations such as 1+7 or 2+6 with either Cortex A75 or A76 CPU IP. This presentation slide was one of the rare ones where Arm referred to the area size of the A76, pointing out that the A75 still had better PPA and thus might still be a valid design choice for companies, depending on their needs. One comparison that was made during the event is that in terms of area, three A76’s with larger caches would fit inside the size of a Skylake core – all while within 10% of the IPC of the Intel CPU, but obviously there’s also process node scaling considerations to take into account.

A standout claim is that Arm aims to outperform the competition at half the area and half the power. Arm was slightly beating around the bush here in what it considers the competition, but generally the answer was that it was considering everybody the competition. Taking into account Intel, AMD or Samsung it’s actually not that hard to imagine Arm beating them in PPA as historically the company always had the smallest CPU designs and that directly translates into more efficient microarchitectures.

Before we get into more detailed breakdowns of the performance and power improvements and what I’m expecting to happen into products, let’s see the microarchitectural improvements on the core and how Arm managed to extract this much performance while maintaining power efficiency.

Cortex A76 µarch - Frontend
Comments Locked

123 Comments

View All Comments

  • name99 - Friday, June 1, 2018 - link

    FFS. the issue is NOT "Older batteries might not be able to supply enough power for a big core", it is that the battery cannot supply enough CURRENT.

    If you can't be bothered to understand the underlying engineering issue and why the difference between current and power matters, then your opinions on this issue are worthless.
  • serendip - Friday, June 1, 2018 - link

    Whoa, chill there buddy, I'm not an electrical engineer.
  • name99 - Friday, June 1, 2018 - link

    "Does anyone actually use the full performance of the A11 or A12 in daily tasks? "
    Absolutely. I've updated iPhones every two years, and every update brings a substantial boost in "fluidity" and just general not having to wait. I can definitely feel the difference between my iPhone 7 and my friend's iPhone X; and I expect I will likewise feel the difference when I get my iPhone 2018 edition (whatever they are naming them this year...)

    Now if you want to be a tool, you can argue "that's because Apple's software sux. Bloat, useless animations, last good version of iOS was version 4, blah blah". Whatever.
    MOST people find more functionality distributed throughout the dozens of little changes of each new version of the OS, and MOST people find the "texture" of the OS (colors, animations, etc) more pleasant than having some sort of text only Apple II UI, though doubtless that could run at a 10,000 fps.

    So point is, yeah, you DO notice the difference on phones. Likewise on iPads. I use my iPad to read technical PDFs, and again, each two year update provides a REALLY obvious jump in how quickly complicated PDF pages render. With my very first iPad 1 there was a noticeable wait almost every page (only hidden, usually, because of page caching). By the A10X iPad Pro it's rare to encounter a PDF page that ever makes you wait, cached or not.

    I've also talked about in the past about Wolfram Player, a subset of Mathematica for iPad. This allows you to interact with Mathematica "animations" (actually they're 3D interactive objects you construct that change what is displayed depending on how you move sliders or otherwise tweak parameters). These are calculating what's to be displayed (which might be something like numerically solving a partial differential equation, then displaying the result as a 3D object) in realtime as you move a slider.
    Now this is (for now) pretty specialized stuff. But Wolfram's goal, as they fix the various bugs in the app and implement the bits of Mathematica that don't yet work well (or at all), is for these things to be the equivalent of video today. We used to put up with explanations (in books, or newspapers) that were just words. Then we got BW diagrams. Then we got color diagrams. Then we got video. Now we have web sites like NYT and Vox putting up dynamic explainers where you can move sliders --- BUT they are limited to the (slow) performance of browsers, and are a pain to construct (both the UI, and the underlying mathematical simulation). Something like Mathematica's animations are vastly more powerful, and vastly easier to create. One day these will be as ubiquitous as video is today, just one more datatype that gets passed around. But for them to work well requires a CPU that can numerically solve PDEs in real time on your mobile device...
  • techconc - Tuesday, June 5, 2018 - link

    The benefits of having a fast single core are seen on most common operations, including UI and scrolling, etc. Moreover, Apple has demonstrated that a powerful core can in fact be more efficient in race to sleep conditions whereby it completes the work more quickly then sleeps. The overall effect is a more responsive system that is just as efficient overall.
  • tipoo - Tuesday, September 4, 2018 - link

    Sure, every time I render a webpage.
  • ZolaIII - Friday, June 1, 2018 - link

    Well let's put it this way the A73 which is two instructions wide had a no problems on 14 nm FinFET, A76 is 4 instruction wide & for a sakes of argument let's say 2x the size. So switching from 14 nm to 7nm (60% reduction on power) cower it, A76 is approximately 65% faster than A73 MHz per MHz so its able to deliver approximately the 1.8x performance per same DTP. Second part is a manufacturing process in comparison to the core size. The FinFET structure transistors leak as hell when the 2.1~2.2 GHz limit is reached disregarding of OEM, vendor/foundry. So if you employ 50% wider core's (6 instructions wide) that won't cross the 2.1~2.2 GHz limit it's not the same as if you push the limit of the 4 instructions wide one to 3GHz as the power consumption will be doubled compared to the same one operating on 2.1~2.2 GHz & in the end you lose both on theoretical true output (performance) and power consumption metric but it still costs you 33% less. In reality it's much harder to feed optimally the wider core (especially on something which is mobile OS). ARM (cowboy camp) did a great work optimising instruction latency and cache latency/true output which will both increase real instruction output per clock & help predictor without significant increase in needed resources (cost - size) & A76 is a first of it's kind (CPU ever) regarding implanted solution for this. However thing that ARM didn't deliver is a better primary work horse which could make a difference in base user experience. A55 aren't exactly the power haus regarding performance & now their is more headroom regarding power when scaled down to the 7 nm, enough for let's say A73 on slightly lower clocks to replace the A55 (A73 is 1.6x integer performance of A55 MHz/MHz so A73 @ 1.7GHz = A55 @ 2.7 GHz while switching from 14 to 7nm would make DTP of A55 to A73 the same). But A73 doesn't work on DinamIQ cluster. So there is a need for the new two instructions wide OoO core with merged architectural advancements (front end, predictor, cache, ASIMD...) as in order ones did hit the brick wall long time ago.
  • vladx - Friday, June 1, 2018 - link

    > So switching from 14 nm to 7nm (60% reduction on power)

    That might've been true if both 14nm and 7nm fab processes were actually the real deal. But alas, they are not.
  • ZolaIII - Saturday, June 2, 2018 - link

    Based on the TSMC projections 60% power reduction.
  • beginner99 - Monday, June 4, 2018 - link

    The things is that CPU power use might already be a small part of phone power use. The display usually being the main consumer and when the display isn't running, most likely the big core will also not be running. Saving 40% power sounds great on paper. But in real designs it will already be smaller and the total impact on phone battery life will be much, much smaller. Single-digit percentage probably depending on how much you use. The more it is idle, the less the big core efficiency matters.
  • Dazedconfused - Thursday, May 31, 2018 - link

    I get this, but when comparing an iPhone x and say an Android flagship next to each other in pretty much every day to day task, they appear evenly matched. There are some good comparisons on YouTube. There are definitely strengths to each platform, but it's not clear cut at all

Log in

Don't have an account? Sign up now