Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling
by Anton Shilov on December 16, 2016 6:00 PM ESTQualcomm this month demonstrated its 48-core Centriq 2400 SoC in action and announced that it had started to sample its first server processor with select customers. The live showcase is an important milestone for the SoC because it proves that the part is functional and is on track for commercialization in the second half of next year.
Qualcomm announced plans to enter the server market more than two years ago, in November 2014, but the first rumors about the company’s intentions to develop server CPUs emerged long before that. In fact, being one of the largest designers of ARM-based SoCs for mobile devices, Qualcomm was well prepared to move beyond smartphones and tablets. However, while it is not easy to develop a custom ARMv8 processor core and build a server-grade SoC, building an ecosystem around such chip is even more complicated in a world where ARM-based servers are typically used in isolated cases. From the very start, Qualcomm has been rather serious not only about the processors themselves but also about the ecosystem and support by third parties (Facebook was one of the first companies to support Qualcomm’s server efforts). In 2015, Qualcomm teamed up with Xilinx and Mellanox to ensure that its server SoCs are compatible with FPGA-based accelerators and data-center connectivity solutions (the fruits of this partnership will likely emerge in 2018 at best). Then it released a development platform featuring its custom 24-core ARMv8 SoC that it made available to customers and various partners among ISVs, IHVs and so on. Earlier this year the company co-founded the CCIX consortium to standardize various special-purpose accelerators for data-centers and make certain that its processors can support them. Taking into account all the evangelization and preparation work that Qualcomm has disclosed so far, it is evident that the company is very serious about its server business.
From the hardware standpoint, Qualcomm’s initial server platform will rely on the company’s Centriq 2400-series family of microprocessors that will be made using a 10 nm FinFET fabrication process in the second half of next year. Qualcomm does not name the exact manufacturing technology, but the timeframe points to either performance-optimized Samsung’s 10LPP or TSMC’s CLN10FF (keep in mind that TSMC has a lot of experience fabbing large chips and a 48-core SoC is not going to be small). The key element of the Centriq 2400 will be Qualcomm’s custom ARMv8-compliant 64-bit core code-named Falkor. Qualcomm has yet has to disclose more information about Falkor, but the important thing here is that this core was purpose-built for data-center applications, which means that it will likely be faster than the company’s cores used inside mobile SoCs when running appropriate workloads. Qualcomm currently keeps peculiarities of its cores under wraps, but it is logical to expect the developer to increase frequency potential of the Falkor cores (vs mobile ones), add support of L3 cache and make other tweaks to maximize their performance. The SoCs do not support any multi-threading or SMP technologies, hence boxes based on the Centriq 2400-series will be single-socket machines able to handle up to 48 threads. The core count is an obvious promotional point that Qualcomm is going to use over competing offerings and it is naturally going to capitalize on the fact that it takes two Intel multi-core CPUs to offer the same amount of physical cores. Another advantage of the Qualcomm Centriq over rivals could be the integration of various I/O components (storage, network, basic graphics, etc.) that are now supported by PCH or other chips, but that is something that the company yet has to confirm.
From the platform point of view, Qualcomm follows ARM’s guidelines for servers, which is why machines running the Centriq 2400-series SoC will be compliant with ARM’s server base system architecture and server base boot requirements. The former is not a mandatory specification, but it defines an architecture that developers of OSes, hypervisors, software and firmware can rely on. As a result, servers compliant with the SBSA promise to support more software and hardware components out-of-the-box, an important thing for high-volume products. Apart from giant cloud companies like Amazon, Facebook, Google and Microsoft that develop their own software (and who are evaluating Centriq CPUs), Qualcomm targets traditional server OEMs like Quanta or Wiwynn (a subsidiary of Wistron) with the Centriq and for these companies having software compatibility matters a lot. On the other hand, Qualcomm’s primary server targets are large cloud companies, whereas server makers do not have their samples of Centriq yet.
During the presentation, Qualcomm demonstrated Centriq 2400-based 1U 1P servers running Apache Spark, Hadoop on Linux, and Java: a typical set of server software. No performance numbers were shared and the company did not open up the boxes so not to disclose any further information about the CPUs (i.e., the number of DDR memory channels, type of cooling, supported storage options, etc.).
Qualcomm intends to start selling its Centriq 2400-series processors in the second half of next year. Typically it takes developers of server platforms a year to polish off their designs before they can ship them, normally it would make sense to expect Centriq 2400-based machines to emerge in the latter half of 2H 2017. But since Qualcomm wants to address operators of cloud data-centers first and companies like Facebook and Google develop and build their own servers, they do not have to extensively test them in different applications, but just make sure that the chips can run their software stack.
As for the server world outside of cloud companies, it remains to be seen whether the server industry is going to bite Qualcomm’s server platform given the lukewarm welcome for ARMv8 servers in general. For these markets, performance, compatibility, and longevity are all critical factors in adopting a new set of protocols.
Related Reading:
- Evaluating Futuremark's Servermark VDI on the Supermicro SYS-5028D-TN4T
- New GIGABYTE Server Motherboards Show Xeon D Round 2
- AMD Exits Dense Microserver Business, Ends SeaMicro Brand
Source: Qualcomm
88 Comments
View All Comments
deltaFx2 - Tuesday, December 20, 2016 - link
Constant reference to 32-bit ARM: Well, Wilco brought it up while arguing code density. I'm assuming that Thumb wouldn't be less dense than Aarch64; the referenced paper didn't have Aarch64 (circa 2009), so you have to read between the lines. I know most server players don't support Aarch32, and Apple is phasing it out. (I think they got rid of it in their ecosystem, but apps may be A32).Re. x86 being ld-store: That's a stretch. How else would you build an out-of-order pipeline? You could build a pipeline that handles fused ld-ex in the same pipeline, but that would be stupid because there are plenty of combinations of EX. Splitting it across two pipes is the logical thing to do. In an alternative universe where no RISC existed, do you really believe this would not be how an OoO x86 pipeline is designed?
I think I stated earlier that x86's biggest bottleneck is the variable length decode, not "CISC".
Spill-Fill of dead regs: It would be nice if compilers could do analysis across multiple function calls. Some do better than others, but it's not uncommon for compilers to be conservative or simply unable to do inter-proc optimizations beyond a point. The other thing is, the extra 16 regs are architected state. So, you need dedicated storage in the PRF for them, per thread. An x86 SMT-2 machine needs 16*2 PRFs for dedicated architected state. An ARM SMT-2 machine needs 32*2. Intel in Haswell, I think, had roughly 172 entry PRF, so 140 entries for speculative state. In ARM, it would be 108 entries. ARM pays this tax even when I can make do with 8 architected registers. Since the x86 temp has only one consumer, one can reuse the destination of the ld-op as the temp.
Consistency Model: That's an interesting one. Do you burden the programmer with the responsibility of inserting serializing barriers in their code (DSB/ISB, etc in ARM), or do you burden the hardware designers? There are far more programmers than hardware designers, after all, and a release consistency like memory model is likely to be painful to debug. I have heard of (apocryphal) instances in which programmers go to town with inserting barriers because of "random" ordering. Note also, that in x86 and other strong memory models, the default speculation is that there is no ordering violation (this is in the core itself, in the uncore one has to obey it). If, an ordering violation is found after the fact (like older load saw a younger store and a younger load saw an older store), then the core flushes and retries. As opposed to a DSB which has to be obeyed all the time. IDK. In the common case, I would think speculation is more effective. That's the whole point of TSX, right? It's hard to compare though.
Just one final note on decode width: It only really matters after a flush, or in very high IPC workloads. IBM needs a wide decode (8) because it's SMT-8. Apple has 6-wide decode @2.3GHz fmax. They would've probably needed more decode stages if they targeted 4GHz (fewer than x86 for sure). The op-cache in Intel is as much a power feature as it is a performance feature. It allows 6-wide dispatch, shuts down decode, and also cuts the mispredict latency. You could build a 6-wide straight decode in x86 at more power, or have this op-cache. My guess is that the op-cache won, and I would guess it was due to variable length. Would a high-freq, high performane ARM design benefit from an op-cache? Not as much as x86, but I'm sure a couple of cycles of less mispredict latency alone should make it interesting.
Meteor2 - Tuesday, December 20, 2016 - link
Who says Anandtech is going downhill? Best discussion I've seen in years. Would love a conclusion though...Mine is that ISA makes no significant difference to power/performance. It's all about the process now. And with EUV looking like the end of the road, at least for conventional silicon, I think everything will converge at about '5' nm in the early 2020s.
In which case it's probably not worth investing in a software ecosystem different than x86.
deltaFx2 - Tuesday, December 20, 2016 - link
@Meteor2 : I agree. ARM has to bring something awesome to the table that x86 does not have. x86 has decades of debugged and optimized software (including system software), similar cost structure to ARM (arguably better since the cores also sell in the PC market as opposed to ARM server makers), and higher single-threaded and multithreaded performance (at the moment). With AMD's future products looking promising (we'll know for sure next year), there's also the competition aspect that ARM vendors keep harping about. But let's see. SHould be interesting. Fujitstu has announced that they will use ARM in their HPC servers with vector extensions, so we'll see.deltaFx2 - Sunday, December 18, 2016 - link
"what folks are missing is the obvious: X86 is a CISC ISA which is "decoded" to RISC (we don't, so far as I know, know the ISA of this real machine)." Before RISC became the buzzword, this was called microcode. It's always been around, just that RISC started killing old CISC machines like VAX by promising smaller cores and higher frequencies (both true until out-of-order became mainstream and it mattered much less). Intel's marketing response was to say that we're RISC underneath, but honestly, how else would you implement an instruction like rep movsb (memcpy in 1 instruction, essentially) in an out-of-order machine?Threska - Tuesday, January 3, 2017 - link
With virtualization I could see it being viable.evilspoons - Saturday, December 17, 2016 - link
Anandtech hits you with Wall of Text! It's super effective!Euughh, could use a couple more paragraphs, maybe some bullet points between those first two images, this is making me cross-eyed.
cocochanel - Saturday, December 17, 2016 - link
One important fact is being overlooked by many posting comments. Qualcomm is a big company and their market analysis showed there is a market for this product. A company of this size would not spend big bucks on a new server architecture just because they have nothing else to do. The x86 vs ARM debate has been around for a while with both camps digging in rather hard. Only future will decide the winner. On a side note, ARM efficiency is a big advantage plus the ability to scale and not to mention licensing advantages. As hard as Intel and AMD try, is hard to squeeze more and more from the old x86. I mean, look at AMD. They spent a fortune and 4-5 years on Zen ( or Ryzen ) and what are the results ? A processor that is not that much better than Intel's current lineup.deltaFx2 - Sunday, December 18, 2016 - link
@ cocochanel: Re Zen comparison, that comparison would make sense if ARM (or Power) wipe the floor with x86 in performance. Clearly they do not. Power burns a ton of power to be competitive with x86 in multithreaded performance (1T is still behind), and ARM isn't in the ballpark.deltaFx2 - Sunday, December 18, 2016 - link
One more thing to add: QC is feeling pressured by low cost providers in the cellphone space, and it needs to move out. As servers are a high margin business, it makes sense for QC to try. Anand Chandrashekhar, who heads this project at QC said the market wants choice. The question is, does it want choice in providers or choice in ISA? In other words, is the existence of Zen sufficient to provide an alternative to Intel, obviating the need for ARM? After all, there are other ISAs around. Power is here; it has high performance; It's not IBM's first rodeo. Power probably has a software stack ready to go. Where's Power in the data center?smilingcrow - Sunday, December 18, 2016 - link
Large companies looking to move into another sector is a Riscy business. ;)Sure they'll have analysed it and part of their desire to change tack is a defensive move so they will take a risk as a punt on survival as well as in hoping to expand.
So it may be just as much a defensive move as anything as diversification often is; don't have all your eggs in one basket.