Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linuxby Ian Cutress on August 20, 2017 11:00 AM EST
Enterprise Features: Security
With security being a strong focal point in data center tasks, all of the major players that want to provide processors for cloud deployments have been getting their hands dirty and talking security. The ability to provide security keys for hypervisors, VMs, and everything else that can be sandboxed from other users is paramount. To this extent, the Centriq 2400 is supporting two levels of security: EL3 and EL2. This means TrustZone at a system level (EL3) as well as a hypervisor level (EL2), although Qualcomm has not gone into detail if this extends through to having some VMs secure and others not within the same hypervisor environment. Where some of Qualcomm’s competitors are using ARM’s TrustZone implementation – which is an ARM Cortex-A5 for the management – Qualcomm has stated that their solution is not ARM based but a custom design that is TrustZone compliant. We confirmed that this wasn’t another re-use of an ARM architecture license.
Also for security, Qualcomm has added instructions geared towards cryptography acceleration, supporting AES, SHA1, and SHA2-256.
Enterprise Features: Secure Boot
Implementing a Root of Trust has also being making the rounds in recent years. With nefarious code potentially rewriting firmware, or zero-day flaws in technology being exploited by friend and foe, being able to verify the underlying system is as intended and only as intended becomes paramount. Qualcomm’s Centriq 2400 will use Secure Boot functionality.
This is accomplished by providing an Immutable Boot ROM via an integrated management controller, with burned in code and cryptographic keys to authenticate firmware and software before any other firmware is loaded. Qualcomm states that this guarantees knowledge of ownership at the base level, as it allows customers to store (at purchase) public keys from Qualcomm, the OEM or the customer to authenticate secondary and tertiary bootloaders with an anti-rollback check. The management controller also supports accelerated cryptography on SHA for digital signatures and RSA public key operations.
Enterprise Features: QoS
Also on the cards is L3 quality of service. In shared resource environments, mission critical applications can be disturbed by ‘noisy neighbors’. With multiple virtual machines vying for the same resources on a single machine, issues such as shared cache contention have flared up in recent quarters for data center use. If one VM is relying on consistent performance from memory accesses from cache but another program is thrashing it and causing inconsistent performance, the user experience can noticeably be disturbed.
There are multiple ways to tackle this, such as increasing the amount of private cache per core/VM, or by providing L3 cache Quality of Service (QoS) features. Intel has done both in recent years, such as increasing the amount of L2 private cache on the Skylake-SP Xeons from 256KB to 1MB, as well as offering L3 QoS since Broadwell-EP. AMD uses 512KB of L2 private cache, and also has QoS in play. Qualcomm isn’t disclosing the amount of L2 or L3 cache in today’s announcement, but were happy to discuss their QoS strategy.
Qualcomm has stated (despite some odd diagrams perhaps suggesting otherwise) that the L3 cache in the Centriq family is a distributed cache, which likely means that each core (or Duplex, more on that later) has a certain amount of associated L3 cache and L3 cache tags with it. By using a hardware abstracted QoS identification method per client, the SoC can monitor resources and enforce L3 QoS policies per domain ID and per L3 segment, down to the instruction and data level granularity. This is done using Way-based allocation, and policies can be adjusted or fine-tuned on the fly per thread or class of threads. Qualcomm’s implementation can support up to 256 defined environments, one of which can be designated or the SoC IO.
Enterprise Features: Memory Bandwidth Compression
One of Qualcomm’s angles in the data center space is going to be that many data center workloads are memory bandwidth constrained. ‘Feeding the Beast’ is the limit for the markets they want to enter, so by enabling transparent memory compression out to DRAM, Qualcomm is attempting to address the issue. This feature will be transparent to any software, with the effect seen mostly in compressible data streams and memory streaming benchmarks.
By using a proprietary algorithm, Qualcomm’s inline compression will attempt to reduce a 128-byte cache line to 64-bytes with ECC as it moves into main memory. When recalling the data back into the core or for committing to storage, decompression adds an additional 2-4 cycles (1-2% on 250-cycle latency) but aims to bring more data in per request than uncompressed data. There could be a slight added benefit of lower power consumption as well, as less data is transferred. We’ve seen these techniques in the GPU space for a number of years.
From the software perspective, the effect will vary considerably from test to test depending on the workload. The Centiq 2400 series comes with six DDR4 memory channels, supporting two DIMMs per channel and up to DDR4-2667, so there’s going to be a lot of bandwidth to begin with – but sometimes that just isn’t enough.