HiSilicon Kirin 970 - Android SoC Power & Performance Overview
by Andrei Frumusanu on January 22, 2018 9:15 AM ESTCPU Performance: SPEC2006
SPEC2006 has been a natural goal to aim for as a keystone analysis benchmark as it’s a respected industry standard benchmark that even silicon vendors use for architecture analysis and development. As we saw SPEC2017 released last year SPEC2006 is getting officially retired on January 9th, a funny coincidence as we now finally start using it.
As Android SoCs improve in power efficiency and performance it’s now becoming more practical to use SPEC2006 on consumer smartphones. The main concerns of the past were memory usage for subtests such as MCF, but more importantly sheer test runtimes for battery powered devices. For a couple of weeks I’ve been busy in porting over SPEC2006 to a custom Android application harness.
The results are quite remarkable as we see both the generational performance as well as efficiency improvements from the various Android SoC vendors. The Kirin 970 in particular closes in on the efficiency of the Snapdragon 835, leapfrogging the Kirin 960 and Exynos SoCs. We also see a non-improvement in absolute performance as the Kirin 970 showcases a slight performance degradation over the Kirin 960 – with all SoC vendors showing just meagre performance gains over the past generation.
Going Into The Details
Our new SPEC2006 harness is compiled using the official Android NDK. For this article the NDK version used in r16rc1 and Clang/LLVM were used as the compilers with just the –Ofast optimization flags (alongside applicable test portability flags). Clang was chosen over of GCC because Google has deprecated GCC in the NDK toolchain and will be removing the compiler altogether in 2018, making it unlikely that we’ll revisit GCC results in the future. It should be noted that in my testing GCC 4.9 still produced faster code in some SPEC subtests when compared to Clang. Nevertheless the choice of Clang should in the future also facilitate better Androids-to-Apples comparisons in the future. While there are arguments that SPEC scores should be published with the best compiler flags for each architecture I wanted a more apples-to-apples approach using identical binaries (Which is also what we expect to see distributed among real applications). As such for this article the I’ve chosen to pass to the compiler the –mcpu=cortex-a53 flag as it gave the best average overall score among all tested CPUs. The only exception was the Exynos M2 which profited from an additional 14% performance boost in perlbench when compiled with its corresponding CPU architecture target flag.
As the following SPEC scores are not submitted to the SPEC website we have to disclose that they represent only estimated values and thus are not officially validated submissions.
Alongside the full suite for CINT2006 we are also running the C/C++ subtests of CFP2006. Unfortunately 10 out of the 17 tests in the CFP2006 suite are written in Fortran and can only be compiled with hardship with GCC on Android and the NDK Clang lacks a Fortran front-end.
As an overview of the various run subtests, here are the various application areas and descriptions as listed on the official SPEC website:
SPEC2006 C/C++ Benchmarks | ||||||
Suite | Benchmark | Application Area | Description | |||
SPECint2006 (Complete Suite) |
400.perlbench | Programming Language | Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs). | |||
401.bzip2 | Compression | Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O. | ||||
403.gcc | C Compiler | Based on gcc Version 3.2, generates code for Opteron. | ||||
429.mcf | Combinatorial Optimization | Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport. | ||||
445.gobmk | Artificial Intelligence: Go | Plays the game of Go, a simply described but deeply complex game. | ||||
456.hmmer | Search Gene Sequence | Protein sequence analysis using profile hidden Markov models (profile HMMs) | ||||
458.sjeng | Artificial Intelligence: chess | A highly-ranked chess program that also plays several chess variants. | ||||
462.libquantum | Physics / Quantum Computing | Simulates a quantum computer, running Shor's polynomial-time factorization algorithm. | ||||
464.h264ref | Video Compression | A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2 | ||||
471.omnetpp | Discrete Event Simulation | Uses the OMNet++ discrete event simulator to model a large Ethernet campus network. | ||||
473.astar | Path-finding Algorithms | Pathfinding library for 2D maps, including the well known A* algorithm. | ||||
483.xalancbmk | XML Processing | A modified version of Xalan-C++, which transforms XML documents to other document types. | ||||
SPECfp2006 (C/C++ Subtests) |
433.milc | Physics / Quantum Chromodynamics | A gauge field generating program for lattice gauge theory programs with dynamical quarks. | |||
444.namd | Biology / Molecular Dynamics | Simulates large biomolecular systems. The test case has 92,224 atoms of apolipoprotein A-I. | ||||
447.dealII | Finite Element Analysis | deal.II is a C++ program library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non-constant coefficients. | ||||
450.soplex | Linear Programming, Optimization | Solves a linear program using a simplex algorithm and sparse linear algebra. Test cases include railroad planning and military airlift models. | ||||
453.povray | Image Ray-tracing | Image rendering. The testcase is a 1280x1024 anti-aliased image of a landscape with some abstract objects with textures using a Perlin noise function. | ||||
470.lbm | Fluid Dynamics | Implements the "Lattice-Boltzmann Method" to simulate incompressible fluids in 3D | ||||
482.sphinx3 | Speech recognition | A widely-known speech recognition system from Carnegie Mellon University |
It’s important to note one extremely distinguishing aspect of SPEC CPU versus other CPU benchmarks such as GeekBench: it’s not just a CPU benchmark, but rather a system benchmark. While benchmarks such as GeekBench serve as a good quick view of basic workloads, the vastly greater workload and codebase size of SPEC CPU stresses the memory subsystem to a much greater degree. To demonstrate this we can see the individual subtest performance differences when solely limiting the memory controller frequency, in this case on the Mate 10 Pro with the Kirin 970.
An increase in main memory latency from just 80ns to 115ns (Random access within access window) can have dramatic effects on many of the more memory access sensitive tests in SPEC CPU. Meanwhile the same handicap essentially has no effect on the GeekBench 4 single-threaded scores and only marginal effect on some subtests of the multi-threaded scores.
In general the benchmarks can be grouped in three categories: memory-bound, balanced memory and execution-bound, and finally execution bound benchmarks. From the memory latency sensitivity chart it’s relatively easy to find out which benchmarks belong to which category based on the performance degradation. The worst memory bound benchmarks include the infamous 429.mcf but alongside we also see 433.milc, 450.soplex, 470.lbm and 482.sphinx3. The least affected such as 400.perlbench, 445.gobmk, 456.hmmer, 464.h264ref, 444.namd, 453.povray and with even 458.sjeng and 462.libquantum slightly increasing in performance pointing out to very saturated execution units. The remaining benchmarks are more balanced and see a reduced impact on the performance. Of course this is an oversimplification and the results will differ between architectures and platforms, but it gives us a solid hint in terms of separation between execution and memory-access bound tests.
As well as tracking performance (SPECspeed) I also included a power tracking mechanisms which relies on the device’s fuel-gauge for current measurements. The values published here represent only the active power of the platform, meaning it subtracts idle power from total absolute load power during the workloads to compensate for platform components such as the displays. Again I have to emphasize that the power and energy figures don't just represent the CPU, but the SoC system as a whole, including interconnects, memory controllers, DRAM, and PMIC overhead.
Alongside the current generation SoCs I also included a few predecessors to be able to track the progress that has happened over the last 2 years in the Android space and over CPU microarchitecture generations. Because the runtime of all benchmarks is in excess of 5 hours for the fastest devices we are actively cooling the phones with an external fan to ensure consistent DVFS frequencies across all of the subtests and that we don’t favour the early tests.
116 Comments
View All Comments
zorxd - Monday, January 22, 2018 - link
Samsung isn't vertically integrated? They also have their own SoC and even fabs (which Huawei and Apple don't)Myrandex - Monday, January 22, 2018 - link
Agreed, but they don't even always use their own components like Huawei and Apple do, I think I've yet to own a Samsung Smartphone that uses a Samsung SoC. I think they will eventually get there though although I can't say I have any idea why they aren't doing it today.jospoortvliet - Saturday, January 27, 2018 - link
Samsung always uses their own SOCs unless they are legally forced not to, like in the US...levizx - Saturday, January 27, 2018 - link
They are DEFINITELY NOT legally forced to use Qualcomm chips in China, especially for the worlds biggest carrier China Mobile. As for the US, if Huawei is not forced to use Qualcomm, I can't imagine why Samsung would unless they signed a deal with Qualcomm - then again that's just by choice.Andrei Frumusanu - Monday, January 22, 2018 - link
Samsung's mobile division (which makes the phones) still makes key use of Snapdragon SoCs for certain markets. Whatever the reason for this and we can argue a lot about it, fact is that the end product more often than not ends up being as the lowest common denominator in terms of features and performance between the two SoC's capabilities. In that sense, Samsung is not vertically integrated and does not control the full stack in the same way Apple and Huawei do.Someguyperson - Monday, January 22, 2018 - link
No, Samsung simply isn't so vain as to use it's own solutions when they are inferior. Samsung skipped the Snapdragon 810 because their chip was much better. Samsung used the 835 instead of their chip last year because the 835 performed nearly exactly the same as the Samsung chip, but was smaller, so they could get more chips out of an early 10 nm process. Huawei chooses their chips so they don't look stupid by making an inferior chip that costs more compared to the competition.Samus - Monday, January 22, 2018 - link
Someguyperson, that isn't the case at all. Samsung simply doesn't use Exynos in various markets for legal reasons. Qualcomm, for example, wouldn't license Exynos for mobile phones as early as the Galaxy S III, which is why a (surprise) Qualcomm SoC was used instead. Samsung licenses Qualcomm's modem IP, much like virtually every SoC designer, for use in their Exynos. The only other option has historically been Intel, who until recently, made inferior LTE modems.I think it's pretty obvious to anybody that if Samsung could, they would, sell their SoC's in all their devices. They might even sell them to competitors, but again, Qualcomm won't let them do that.
lilmoe - Monday, January 22, 2018 - link
Since their Shannon modem integration in the Exynos platform, I struggled to understand why...My best guess would be a bulk deal they made with Qualcomm in order for them to build Snapdragons on both their 14nm and 10nm. Samsung offered a fab deal, Qualcomm agreed to build using Samsung fabs and provide a generous discount in Snapdragon resale for Galaxies, but in the condition to buy a big minimum amount of SoCs. That minimum quantity was more than what was needed for the US market. Samsung did the math, and figured that it was more profitable to keep their fabs ramped up, and save money on LTE volume licensing. So Samsung made a bigger order and included Chinese variants in the bulk.
I believe this is all a bean counter decision, not technical or legal.
KarlKastor - Thursday, January 25, 2018 - link
That's easy to answer. Samus is right, it's a legal problem. The reason is named CDMA2000.Qualcomm owns all IP concerning CDMA2000.
Look at the regions where a Galaxy S is shipped with a Snapdragon and look a the Countries using CDMA2000. That's North America, Chna and Japan.
Samsung has two choices: Using a Snapdragon SoC with integrated QC Modem or plant a dedicated QC Modem alongside their own SoC.
The latter is a bad choice concerning space and i think it's more expensive to buy an extra chip instead of just using a Snapdragon.
I bet all this will end when Verizon quits CDMA2000 in late 2019 and Samsung will use their Exynos SoCs only. CDMA200 is useless since LTE and is just maintained for compatibility reasons.
In all regions not using this crappy network, Samsung uses Exynos SoCs in every phone from low cost to high end.
So of course Samsung IS vertically integrated. Telling something else is pretty ridicoulous.
They have theor own fabs, produce and develope their own SoC, modem, DRAM and NAND Flash and have their own CPU and modem IP. They only lack their own GPU IP.
So who is more vertically integrated?
KarlKastor - Thursday, January 25, 2018 - link
I forgot their own displays and cameras. Especially the first is very important. The fact, that they make their own displays enabled more options in design.Think of their Edge-Displays, you may like them or not, but with them the whole design differed much from their competitors.