Exploring the Floating Point Performance of Modern ARM Processors
by Rahul Garg on June 2, 2013 2:30 PM EST- Posted in
- Krait
- CPUs
- Snapdragon
- Arm
- Cortex A9
- Cortex A15
- Laptops
- Snapdragon 600
As a programmer who wants to write decent performing code, I am very interested in understanding the architectures of CPUs and GPUs. However, unlike desktop and server CPUs, mobile CPU and GPU vendors tend to do very little architectural disclosure - a fact that we've been working hard to change over the past few years. Often times all that's available are marketing slides with fuzzy performance claims. This situation frustrates me to no end personally. We've done quite a bit of low-level mobile CPU analysis at AnandTech in pursuit of understanding architectures where there is no publicly available documentation. In this spirit, I wrote a few synthetic tests to better understand the performance of current-gen ARM CPU cores without having to rely upon vendor supplied information. For this article I'm focusing exclusively on floating point performance.
We will look at 5 CPU cores today: the ARM Cortex A9, ARM Cortex A15, Qualcomm Scorpion, Qualcomm Krait 200 and Qualcomm Krait 300. The test devices are listed below.
Device | OS | SoC | CPU | Frequency | Number of cores |
Samsung Galaxy SIIX (T989D) | Android 4.0 | Qualcomm APQ8060 | Scorpion | 1.5GHz | 2 |
Boundary devices BD-SL-i.mx6 | Ubuntu Oneiric | Freescale i.mx6 | Cortex-A9 | 1.0GHz | 4 |
Blackberry Z10 | Blackberry 10 (10.1) | Qualcomm MSM8960 | Krait 200 | 1.5GHz | 2 |
Google Nexus 10 | Android 4.2.2 | Samsung Exynos 5250 | Cortex-A15 | 1.7GHz | 2 |
HTC One | Android 4.1.2 | Qualcomm Snapdragon 600 | Krait 300 | 1.7GHz | 4 |
I wanted to test the instruction throughput of various floating point instructions. I wrote a simple benchmark consisting of a loop with a large number of iterations. The loop body consisted of many (say 20) floating point instructions with no data dependence between them. The tests were written in C++ with gcc NEON intrisincs where required, and I always checked the assembler to verify that the generated assembly was as expected. There were no memory instructions inside the loop and thus memory performance was not an issue. There were minimal dependencies in the loop body. I tested the performance of scalar addition, multiplication and multiply-accumulate for 32-bit and 64-bit floating point datatypes. All the tested ARM processors also support the NEON instruction set, which is a SIMD (single instruction multiple data) instruction set for ARM for integer and floating point operations. I tested the performance of 128-bit floating point NEON instructions for addition, multiplication and multiply-accumulate.
Apart from testing throughput of individual instructions, I also wrote a test for testing throughput of a program consisting of two types of instructions: scalar addition and scalar multiplication instructions. The instructions were interleaved, i.e. the program consisted of an addition followed by a multiply, followed by another add, then another multiply and so on. There were no dependencies between the additions and following multiplies. You may be wondering the reasoning behind this mixed test. Some CPU cores (such as AMD's K10 core) have two floating point units but the two floating point units may not be identical. For example, one floating point unit may only support addition while another may only support multiplication. Thus, if we only test the additions and multiplications separately, we will not see the peak throughput on such a machine. We perform the mixed test to identify such cases.
All the tests mentioned above measure the amount of time taken for a particular number of instructions and thus we get the instructions executed per-second. We also need to know the frequency to get the instructions executed per-cycle. Knowing the peak frequency of the device is not enough because CPUs have multiple frequency states and the tests may not be running at the advertised peak speeds. Thus, I also wrote code to monitor the percentage of time spent in each frequency state as reported by the kernel. The frequency was calculated as the average of the frequency states weighted by percentage of time spent in each state. The observed frequency on Scorpion (APQ8060) , Cortex A9 (i.mx6) and Cortex A15 (Exynos 5250) were 1.242 GHz, 992MHz and 1.7GHz respectively on all tests except where noted in the results below.
However, as it turns out, the method I used for measuring the time spent in each frequency state does not work on aSMP designs like the Krait 200 based Snapdragon S4 and Krait 300 based Snapdragon 600. For Krait 200, the results reported here are for MSM8960 which shouldn't really have thermal throttling issues. My results on the MSM8960 also line up quite neatly with the assumption that the CPU spent most or all of its time in the test in the peak frequency state. Brian also ran the test on a Nexus 4 and the results were essentially identical as both have the same peak, which is additional confirmation that our results are likely correct. Thus I will assume a frequency of 1.5 GHz while discussing Krait 200 results. Results on Krait 300 (Snapdragon 600) however are more mixed. I am not sure if it is reaching peak frequency on all the tests and thus I am less sure of the per-cycle estimates on this chip. Brian also ran the tests on another handset (LG Optimus G Pro) with the same Snapdragon 600, and the results were qualitatively very similar.
Now the results. First up, the raw data collected from the tests in gigaflops:
Scorpion (APQ8060) |
Cortex-A9 (i.mx6) |
Krait 200 (MSM8960) |
Cortex-A15 (Exynos 5250) |
Krait 300 (Snapdragon 600) |
|
Add (fp64) | 1.23 | 0.99 | 1.33 | 1.55 @ 1.55 GHz | 1.6 |
Add (fp32) | 1.19 | 0.99 | 1.46 | 1.69 | 1.72 |
Mul (fp64) | 0.61 | 0.50 | 1.48 | 1.69 | 1.72 |
Mul (fp32) | 1.22 | 0.99 | 1.49 | 1.69 | 1.72 |
Mixed (fp64) | 0.82 | 0.99 | 1.48 | 1.63 | 1.72 |
Mixed (fp32) | 1.23 | 0.99 | 1.47 | 1.69 | 1.72 |
MAC (fp64) | 1.23 | 0.99 | 1.48 | 3.35 | 2.65 |
MAC (fp32) | 2.47 | 1.98 | 1.47 | 3.39 | 3.13 |
Add (fp32 NEON) | 4.94 | 1.99 | 5.86 | 6.77 | 6.89 |
Mul (fp32 NEON) | 4.89 | 1.99 | 5.76 | 6.77 | 6.89 |
MAC (fp32 NEON) | 9.88 | 3.98 | 5.91 | 13.55 | 12.5 |
Before we discuss the results, it is important to keep in mind that the results and per-cycle timing estimates reported are what I observed from the tests. I did my best to ensure that the design of the tests was very conducive to achieving high throughput. However, it is possible there may be some cases where an architecture can achieve higher performance than what what I was able to get out of my tests. With that out of the way, lets look at the results.
In the data, we need to distinguish between number of instructions and number of flops. I count scalar addition and multiply as one flop and scalar MACs as two flops. I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops. Thus, we get the following per-cycle instruction throughput estimates:
Scorpion | Cortex A9 | Krait 200 | Cortex A15 | Krait 300 | |
Add (fp64) | 1 | 1 | 1 | 1 | 1 |
Add (fp32) | 1 | 1 | 1 | 1 | 1 |
Mul (fp64) | 1/2 | 1/2 | 1 | 1 | 1 |
Mul (fp32) | 1 | 1 | 1 | 1 | 1 |
Mixed (fp64) | 2/3 | 1 | 1 | 1 | 1 |
Mixed (fp32) | 1 | 1 | 1 | 1 | 1 |
MAC (fp64) | 1/2 | 1/2 | 1/2 | 1 | 7/9 |
MAC (fp32) | 1 | 1 | 1/2 | 1 | 10/11 |
Add (fp32 NEON) | 1 | 1/2 | 1 | 1 | 1 |
Mul (fp32 NEON) | 1 | 1/2 | 1 | 1 | 1 |
MAC (fp32 NEON) | 1 | 1/2 | 1/2 | 1 | 10/11 |
We start with the Cortex A9. Cortex A9 achieves throughput of 1 operation/cycle for most scalar instructions, except for fp64 MUL and fp64 MAC, which can only be issued once every two cycles. The mixed test reveals that though fp64 muls can only be issued every two cycles, Cortex A9 can issue a fp64 add in the otherwise empty pipeline slot. Thus, in the mixed test it was able to achieve throughput of 1 instruction/cycle. NEON implementation in Cortex A9 has a 64-bit datapath and all NEON instructions take 2 cycles. Qualcomm's Scorpion implementation of scalar implementations is similar to Cortex A9 except that it seems unable to issue fp64 adds immediately after fp64 muls in the mixed test. Scorpion uses a full 128-bit datapath for NEON and has twice the throughput of Cortex A9.
Krait 200 features an improved multiplier, and offers 1 instruction/cycle throughput for most scalar and NEON instructions. Interestingly, Krait 200 has half the per-cycle throughput for MAC instructions, which is a regression compared to Scorpion. Krait 300 improves the MAC throughput compared to Krait 200, but still appears to be unable to reach throughput of 1 instruction/cycle possibly revealing some issues in the pipeline. An alternate explanation is that Snapdragon 600 reduced the frequency in the MAC tests for some unknown reason. Without accurate frequency information, currently it is difficult to make that judgment. Cortex A15 is the clear leader here, and offers throughput of 1 FP instruction/cycle in all our tests.
In the big picture, readers may want to know how the the floating point capabilities of these cores compares to x86 cores. I consider Intel's Ivy Bridge and Haswell as datapoints for big x86 cores, and AMD Jaguar as a datapoint for a small x86 core. For double-precision (fp64), current ARM cores appear to be limited to 2 flops/cycle for FMAC-heavy workloads and 1 flops/cycle for non-FMAC workloads. Ivy Bridge can have a throughput of up to 8 flops/cycle and Haswell can do 16 flops/cycle with AVX2 instructions. Jaguar can execute up to 3 flops/cycle. Thus, current ARM cores are noticeably behind in this case. Apart from the usual reasons (power and area constraints, very client focused designs), current ARM cores also particularly lag behind in this case because currently NEON does not have vector instructions for fp64. ARMv8 ISA adds fp64 vector instructions and high performance implementations of the ISA such as Cortex A57 should begin to reduce the gap.
For fp32, Ivy Bridge can execute up to 16 fp32 flops/cycle, Haswell can do up to 32 fp32 flops/cycle and AMD's Jaguar can perform 8 fp32 flops/cycle. Current ARM cores can do up to 8 flops/cycle using NEON instructions. However, ARM NEON instructions are not IEEE 754 compliant, whereas SSE and AVX floating point instructions are IEEE 754 compliant. Thus, comparing flops obtained in NEON instructions to SSE instructions is not apples-to-apples comparison. Applications that require IEEE 754 compliant arithmetic cannot use NEON but more consumer oriented applications such as multimedia applications should be able to use NEON. Again, ARMv8 will fix this issue and will bring fully IEEE 754-compliant fp32 vector instructions.
To conclude, Cortex A15 clearly leads amongst the CPUs tested today with Krait 300 very close behind. It is also somewhat disappointing that none of the CPU cores tested displayed a throughput of more than 1 FP instruction/cycle in these tests. I end at a cautionary note that the tests here are synthetic tests that only stress the FP units. Floating point ALU peaks are only a part of a microarchitecture. Performance of real-world applications will depend upon rest of the microarchitecture such as cache hierarchy, out of order execution capabilities and so on. We will continue to make further investigations into these CPUs to understand them better.
39 Comments
View All Comments
skiboysteve - Monday, June 3, 2013 - link
My work is going to be using cortex a9 for a project soon and that team is deciding on NEON vs vFPU3. Can you comment on the precision and performance tradeoffs?thanks for the great article!
Wilco1 - Monday, June 3, 2013 - link
Neon supports 32-bit float only, but with Neon A9 can do 2 FMACs per cycle rather than 1 with VFP. There is no tradeoff in precision if your code already uses 32-bit floats (Neon flushes denormals to zero by default, with VFP you can choose - either way, it doesn't affect any real code).eiriklf - Monday, June 3, 2013 - link
Is there any chance to see the scores from a third krait 200 device, for instance a krait based one x, GS3 or Optimus G? I know all of those devices have about 3x the performance of the nexus 4 in linpack pr. core, so I would love to know if you found a difference with your script.srihari - Monday, June 3, 2013 - link
can you compare with Intel ? i understand you have Neon instructions in your test but, x86 Vs ARM will be good comparision.srihari - Monday, June 3, 2013 - link
Performance is not the only criteria to compare. i would conclude Krait 300 clearly leads Considering performance+power.banvetor - Wednesday, June 5, 2013 - link
Great article, thanks for the work. Looking forward for more in the series... :)Parhelion69 - Wednesday, June 5, 2013 - link
Could you update this article with numbers from Exynos 5 octa, from the SGS IV?I've run some benchmarks and its A15 seems like quite a beast
Antutu 28086, CPU float-point: 5923
javascript:
sunspider: 652 ms
kraken: 6392ms
Riabench focus: 1468 ms
I don't have geekbench but found these numbers:
http://browser.primatelabs.com/geekbench2/2014946
Geekbench score: 3598, floating point: 6168
Arkantus - Wednesday, June 19, 2013 - link
Hello, just a dumb question: the article says "I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops.", and the A9 Add(fp32 NEON) is rated for 1/2 flop/cycle.So this means that the Add(fp32 NEON) is slower than it's vfp counterpart? since for each cycle the neon version only perform half an operation according to this table.
Thanks
sonsequence@HOTMAIL.COM - Friday, June 27, 2014 - link
Hey this is good stuff. Can anybody here help explain something for me though?I'm a database apps and integration guy, not formally trained and just starting to get interested in this kind of low level stuff. I've just been reading up on DMips and wondering how they relate to flops.
What I think I know so far:
A flop is floating point calculation.
The "ip" in "Mip" is an instruction so a broader term (is a flop a type of ip or does it take 2 ips to make a flop drop?)
Instructions per second is about the rawest, most non-contextualised metric of computing power you can get. Flops are a close second.
Squeezing more instructions out of a single CPU cycle is the hard problem. There aren't massive variances in what can be done in this regard. In the Krait 300 manages about 3.3 instructions per cycle which on 4 cores at 1.7Ghz works out at about 22 GigaIps (semi -source: http://investorshub.advfn.com/boards/read_msg.aspx...
My question is, firstly why are GPUs seemingly never measured in DMips and CPUs rarely in flops?
Secondly, would knowing the answer to the "firstly" explain why despite no huge variance in DMips/Mhz across different devices the top GPUs manage 1000x faster performance measured in than these ARM chips. They get tera not gigaflops whilst using a similar number of cores and lower frequency.
Obviously the consume a tonne more power to do it so I know it's not something for nothing but what's the heart of that something when it comes to how much you can do in a cycle?
Ah. It's just occurred to me. Is it that an "instruction" refers to an item in a linear thread but that just 1 of them in a GPU might include setting the RGB values for all pixels in a frame all at once? That's be a few million flops in parallel for one instruction?
Hmmm, well the real world numbers don't add up for that but is that along the right lines? If so why are these Gigaflop numbers lower than Gigamips?
Sorry it was a long one. It can be very hard to find an intermediate starting point when you google an advanced subject.