Yeah, I'm really curious about the form factor of a socketed variant. It sounds like it could be an extremely interesting combo with a dual-socket Xeon board for massively parallel workloads, perhaps even eating straight into NVIDIA and AMD's GPGPU lunch.
Probably as much as it did eat from their GPU lunches. AMD already has a part delivering 2.6 DP tflops at 28 nm. And knowing Intel, this chip will probably cost an arm and a leg, and a set of good kidneys.
Traditional BGA packaged memory doesn't dissipate heat well because of the plastic case. On-package memory shouldn't be harder to cool than any other MCM.
DRAM memory doesn't dissipate very much power in the first place so cooling it should not be a big challenge. Traditionally, most of the power burnt by high-bandwidth memory ICs is because of the large, high-power transcievers needed. 7GHz GDDR5 is >50% transciever IIRC. Stacked memory on the CPU substrate itself should require much less power... The Crystalwell eDRAM used in Haswell GT4 CPUs burns what, a watt or two at most? The tech used here is probably more advanced*.
Very interesting architecture that I'll be interested in seeing the impact of. However I'm wondering, with these massively paralel chips with up to 72 cores, if the x86 ISA isn't going to be a huge hindrance. It's still a fairly big part of the chip, if you look at say a Jaguar floorplan it's about as big as an FPU, one of the biggest functional units on a CPU core. Times that by 72, and you're using a lot of die size and power just for ISA compatibility. I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously, lest ARM swoop in with its smaller ISA and do something better in less die area.
No, Itanium failed because nobody could make a good compiler for it, including Intel. VLIW sucks for general-purpose computing. It works for a few workloads (graphics, mainly) but on anything relying on memory performance it's worthless.
x86 decoding isn't hard - it's instruction reordering that takes up all that die space. Itanium tried to do all that at compile-time, which doesn't work because it needs to know which data is cached or not.
VLIW has been around for a while, and you can make some ridiculously fast programs with it... In assembly. Even AMD dabbled with it in their 5000-6000 series GPUs, but yes, compiling generalized problems is a mess and why Intel positioned Itanium for infrastructure and application specific computing in recent years
The Itanium business is still larger than all of AMD, hah. But yes, that didn't catch on so much that it was ready to replace x86, but that was going in the opposite direction as far as I know. Instead of reducing ISA complexity, it added some with even longer instructions. Intel is more limited going downwards to low power places rather than high power, the more you shrink cores to low power budgets the more that x86 ISA starts to look too large.
Having more complicated instructions is beneficial in the long run especially when you break it into micro code. They can change how the microcode works with every generation as opposed to ARM or other CISC where they stay with the SAME instruction internally for years and years which bottlenecks performance. It helps somewhat with power, but as you scale up that is irrelevant.
ARM has an instruction decode block too. It's probably not terribly far off from an areal perspective, at a given level of performance. It may have been the case back in the 80s/90s, but the differences between RISC and CISC are much more nuanced now.
RealWorldTech has an article from the year 2000 showing that while the differences between RISC and CISC had closed considerably, they still mattered. However, that was 14 years ago.
The ISA is basically meaningless today, in the context you're speaking of. The underlying architecture and its implementation is far, far more critical.
Take the Apple A7 vs. Intel's Silvermont. Despite being fairly close in performance, two Silvermont cores are significantly smaller than two Cyclone cores. The area difference worsens when you include the A7's L3 cache, which Silvermont designs do not rely on.
Even ARM to ARM comparisons can vary widely, even between custom implementations. The A7 is a dual core design that outpaces the quad core Krait 400. The area between the CPUs is fairly similar, although the A7 probably loses out by a small margin with the L3 cache included.
Back to RISC vs. CISC, The University of Wisconsin published a paper comparing the A9 vs. AMD's Bobcat and Intel's Saltwell Atom. Their conclusion: "Our study shows that RISC and CISC ISA traits are irrelevant to power and performance characteristics of modern cores."
True. I too am sick and tired of hearing the mentioning of ARM as this huge, inevitable looming monster on the horizon for Intel and proponents of x86. It isn't. Saying that Krait 400 = Cortex-A15 and Cyclone = A57 is like placing Bulldozer beside K10 and making direct comparisons. The different implementations of ARMv7 and ARMv8 in the ARM world are wholly different entities in terms of CPU and cache design.
I believe you have a rather valid point there: Creating a massively parallel SIMD enhanced general purpose CPU to compete with GPU was certainly a valid exercise, because it could be productive and effective on a far wider range of problems, a far wider existing code base and a far wider population of programmers.
With ARM64 (or MIPS or any other new/cleaen 64-bit CPU design for that matter) more silicon real-estate might be used to create additional cores. How much or how many and how relevant that would be vs. the die area used for caches I don't know. Perhaps yields could be improved, because smaller complex logic core size means perhaps more cores for the same compute power but less is lost for a defect.
What I could not gauge is how the code using these SIMD AVX-512F instructions would actually be written or coded these days: x86/AVX-512F assembler won't easily convert to similar AVX-512F instructions on ARM64, but high level language code just might--with the right compiler.
Because Intel doesn't just do the CPU but (I believe) provides compilers and libraries around them, they most likely have a lead of a couple of years against any direct ARM competition.
But these days it's become far too easy to add and use FPGA or other special function IP blocks on ARM SoCs and all of a sudden Intel might potentially find itself in an arena with far more "knights" they ever imagined.
They can't quite escape the fact that any fixed workload on a general purpose architecture can be outperformed by a specific purpose one. More than ever chips aren't "best" or "better" but their quality or fitness depends on the mix of problems they are used for.
The thing with ARM is that some instructions in earlier iterations (I'd have to check ARMv8) that don't need to be decoded at all. That's where mainly the RISC vs. CISC differences come into play today: die size. The area savings by having a simpler decoder In the case of Knight's Landing, how many cores could be added if the decoder was half its current size? 80 instead of 72?
That's the problem that Intel is currently facing. Intel isn't using their high IPC cores but ones typical found in Atom. ARM designs can reach similar levels of IPC but potentially at a smaller die area. Thus even if Silvermont wins on IPC the competing RISC based SoC's could still have higher throughput due to more cores.
Intel does have an ace up their sleeve by having their own fabs. This enables them to produce larger dies and maintain similar raw areas for CPU's by having a fab process advantage. However, the fab process advantage is not going to be long lived due to the difficulties of future shrinks.
@Kevin G. Not only that, chemical processes and material processes restrictions. Also the biggest advantage for INTC will always be, Design and FAB in one house. ARM minions will have to rely on "others" and in many occasions that will not work.
x86/x64 ISA compatibility really isn't the problem -- you could make the argument that these ISAs are large, and kind of ugly -- but its affect on die-size is minimal. In fact, only between 1-2 percent of your CPU is actually doing direct computation -- and nearly all the rest of the complexity is wiring or latency-hiding mechanisms (OoO execution engines, SMT, buffers, caches). Silvermont, IIRC, is something of a more-direct x86/x64 implementation than is typical though (others translate all x86/x64 ISA to an internal RISC-like ISA), so its hard to say, but I think in general these days that ISA has very little impact on die size, performance, or power draw -- regardless of vendor or ISA, there's a fairly linear scale relating process-size, die-size, power consumption, and performance.
You'll undoubtedly see it, but the question is when. Stacked DRAM won't be debuting in consumer products until late this year at the earliest. AMD will likely be the first to market, with their partnership with Hynix for HBM. There was a chance Intel would put Xeon Phi out first, but with a launch in the second half of 2015, that's pretty much erased.
Still, initial applications will be for higher margin parts, like this Xeon Phi, and AMD's higher-end GPUs. It will be some time before costs come down to be used in the more cost-sensitive APU segment.
Well a slightly different type of stacked DRAM is already in wide use on consumer devices today: My Samsung Galaxy Note uses a nice stack of six dies mounted on top of the CPU.
It's using BGA connections rather than silicon through vias, so it's not quite the same.
What I don't know (and would love to know) is whether that BGA stacking allows Samsung to maintain CMOS logic between the DRAM chips or whether they need to switch to "TTL like" signals and amplifiers already.
If I don't completely misunderstand this ability to maintain CMOS levels and signalling across dice would be one of the critical differentiators for silicon vias.
But while stacked DRAM (of both types) may work ok on smartphone SoCs not exceeding 5 Watt of power dissipation I can't help but imagine that with ~100Watts of Knights Landing power consumption below, stacked DRAM on top may add significant cooling challenges.
Perhaps these silicon through vias will be coupled with massive copper (cooling) towers going through the entire stack of CPU and DRAM dice, but again that seems to entail quite a few production and packaging challenges!
Say you have 10x10 cooling towers across the chip stack: Just imagine the cost of a broken drill on hole #99.
I guess actually thermally optimized BGA interconnects may be easier to manage overall, but at 100Watts?
I can certainly see how this won't be ready tomorrow and not because it's difficult to do the chips themselves.
Putting that stack together is an entirely new ball game (grid or not ;-) and while a solution might never pay for itself in the Knights Corner ecosystem, that tech in general would fit on anything else Intel does produce.
My bet: They won't actually put the stacked DRAM on top of the CPU but go for a die carrier solution very much like the eDRAM on Iris Pro.
All the talk about the importance of co-processors, anyone else remember how big a deal it was when integrating math co-processors way back when? History sure likes to repeat itself (in a good way this time at least)
Yea, I remember back in the late 80's using chromatography control and analytical software that had a math co-processor. I think the computer's main cpu was something like 50 mhz, all the programs and data were stored on floppy disks. Installing the board for the coprocessor and getting the software to work properly was a real bear.
if i remember correctly the 486 dx had a co processor within the CPU, the sx variant didn't and had to be added, although I'm old and that was 20 + years ago (and cba to check)
I don't see these competing that well against gpus. Binary compatibility isn't usually that much of an issue in HPC. Using something like OpenCL allows you to run across a wide variety of hardware architectures. The run-time compilation should not be much of a factor since it is very fast compared to the rest of the processing. Also, adding more extensions to the ISA seems unnecessary and possibly counter productive. It gets you "closer to the metal", but it also locks the implementation in place. Using an intermediate layer allows greater changes to the underlying implementation in the future.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
41 Comments
Back to Article
Morawka - Thursday, June 26, 2014 - link
holy batman that's a lot of x86 cores. 9300 nodes * 72 Cores per node = 669,600 baytrail cores with AES.kind of crazy when you think about it.
i wonder if cooling a socketed version with this Hyper cubed memory will be a real challenge. heat doesn't dissipate through memory very well.
Friendly0Fire - Thursday, June 26, 2014 - link
Yeah, I'm really curious about the form factor of a socketed variant. It sounds like it could be an extremely interesting combo with a dual-socket Xeon board for massively parallel workloads, perhaps even eating straight into NVIDIA and AMD's GPGPU lunch.ddriver - Saturday, June 28, 2014 - link
Probably as much as it did eat from their GPU lunches. AMD already has a part delivering 2.6 DP tflops at 28 nm. And knowing Intel, this chip will probably cost an arm and a leg, and a set of good kidneys.quixver - Wednesday, July 9, 2014 - link
Shut up and take my money ... and my body parts!bassportout - Monday, July 21, 2014 - link
I'd give my left nut for this. Jkmadmilk - Thursday, June 26, 2014 - link
Traditional BGA packaged memory doesn't dissipate heat well because of the plastic case. On-package memory shouldn't be harder to cool than any other MCM.FaaR - Friday, June 27, 2014 - link
DRAM memory doesn't dissipate very much power in the first place so cooling it should not be a big challenge. Traditionally, most of the power burnt by high-bandwidth memory ICs is because of the large, high-power transcievers needed. 7GHz GDDR5 is >50% transciever IIRC. Stacked memory on the CPU substrate itself should require much less power... The Crystalwell eDRAM used in Haswell GT4 CPUs burns what, a watt or two at most? The tech used here is probably more advanced*.*Blind guess, but seems logical. :)
tipoo - Thursday, June 26, 2014 - link
Very interesting architecture that I'll be interested in seeing the impact of. However I'm wondering, with these massively paralel chips with up to 72 cores, if the x86 ISA isn't going to be a huge hindrance. It's still a fairly big part of the chip, if you look at say a Jaguar floorplan it's about as big as an FPU, one of the biggest functional units on a CPU core. Times that by 72, and you're using a lot of die size and power just for ISA compatibility. I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously, lest ARM swoop in with its smaller ISA and do something better in less die area.p3ngwin1 - Thursday, June 26, 2014 - link
"I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously"Well, you know what happened last time.....Itanium.....
Cogman - Thursday, June 26, 2014 - link
Which is a vastly superior uarch to x86. One of the main reasons it didn't catch on because its x86 emulation wasn't fast enough.Itanium can kick the crap out of x86 for a large number of workloads.
makerofthegames - Friday, June 27, 2014 - link
No, Itanium failed because nobody could make a good compiler for it, including Intel. VLIW sucks for general-purpose computing. It works for a few workloads (graphics, mainly) but on anything relying on memory performance it's worthless.x86 decoding isn't hard - it's instruction reordering that takes up all that die space. Itanium tried to do all that at compile-time, which doesn't work because it needs to know which data is cached or not.
basroil - Monday, June 30, 2014 - link
VLIW has been around for a while, and you can make some ridiculously fast programs with it... In assembly. Even AMD dabbled with it in their 5000-6000 series GPUs, but yes, compiling generalized problems is a mess and why Intel positioned Itanium for infrastructure and application specific computing in recent yearstipoo - Thursday, June 26, 2014 - link
The Itanium business is still larger than all of AMD, hah. But yes, that didn't catch on so much that it was ready to replace x86, but that was going in the opposite direction as far as I know. Instead of reducing ISA complexity, it added some with even longer instructions. Intel is more limited going downwards to low power places rather than high power, the more you shrink cores to low power budgets the more that x86 ISA starts to look too large.dylan522p - Thursday, June 26, 2014 - link
Having more complicated instructions is beneficial in the long run especially when you break it into micro code. They can change how the microcode works with every generation as opposed to ARM or other CISC where they stay with the SAME instruction internally for years and years which bottlenecks performance. It helps somewhat with power, but as you scale up that is irrelevant.Homeles - Thursday, June 26, 2014 - link
ARM has an instruction decode block too. It's probably not terribly far off from an areal perspective, at a given level of performance. It may have been the case back in the 80s/90s, but the differences between RISC and CISC are much more nuanced now.RealWorldTech has an article from the year 2000 showing that while the differences between RISC and CISC had closed considerably, they still mattered. However, that was 14 years ago.
The ISA is basically meaningless today, in the context you're speaking of. The underlying architecture and its implementation is far, far more critical.
Take the Apple A7 vs. Intel's Silvermont. Despite being fairly close in performance, two Silvermont cores are significantly smaller than two Cyclone cores. The area difference worsens when you include the A7's L3 cache, which Silvermont designs do not rely on.
Even ARM to ARM comparisons can vary widely, even between custom implementations. The A7 is a dual core design that outpaces the quad core Krait 400. The area between the CPUs is fairly similar, although the A7 probably loses out by a small margin with the L3 cache included.
Back to RISC vs. CISC, The University of Wisconsin published a paper comparing the A9 vs. AMD's Bobcat and Intel's Saltwell Atom. Their conclusion: "Our study shows that RISC and CISC ISA traits are irrelevant to power and performance characteristics of modern cores."
http://research.cs.wisc.edu/vertical/papers/2013/h...
There's no doubt RISC had a considerable lead over CISC, but that was decades ago.
tabascosauz - Thursday, June 26, 2014 - link
True. I too am sick and tired of hearing the mentioning of ARM as this huge, inevitable looming monster on the horizon for Intel and proponents of x86. It isn't. Saying that Krait 400 = Cortex-A15 and Cyclone = A57 is like placing Bulldozer beside K10 and making direct comparisons. The different implementations of ARMv7 and ARMv8 in the ARM world are wholly different entities in terms of CPU and cache design.abufrejoval - Thursday, June 26, 2014 - link
I believe you have a rather valid point there: Creating a massively parallel SIMD enhanced general purpose CPU to compete with GPU was certainly a valid exercise, because it could be productive and effective on a far wider range of problems, a far wider existing code base and a far wider population of programmers.With ARM64 (or MIPS or any other new/cleaen 64-bit CPU design for that matter) more silicon real-estate might be used to create additional cores. How much or how many and how relevant that would be vs. the die area used for caches I don't know. Perhaps yields could be improved, because smaller complex logic core size means perhaps more cores for the same compute power but less is lost for a defect.
What I could not gauge is how the code using these SIMD AVX-512F instructions would actually be written or coded these days: x86/AVX-512F assembler won't easily convert to similar AVX-512F instructions on ARM64, but high level language code just might--with the right compiler.
Because Intel doesn't just do the CPU but (I believe) provides compilers and libraries around them, they most likely have a lead of a couple of years against any direct ARM competition.
But these days it's become far too easy to add and use FPGA or other special function IP blocks on ARM SoCs and all of a sudden Intel might potentially find itself in an arena with far more "knights" they ever imagined.
They can't quite escape the fact that any fixed workload on a general purpose architecture can be outperformed by a specific purpose one. More than ever chips aren't "best" or "better" but their quality or fitness depends on the mix of problems they are used for.
Kevin G - Friday, June 27, 2014 - link
The thing with ARM is that some instructions in earlier iterations (I'd have to check ARMv8) that don't need to be decoded at all. That's where mainly the RISC vs. CISC differences come into play today: die size. The area savings by having a simpler decoder In the case of Knight's Landing, how many cores could be added if the decoder was half its current size? 80 instead of 72?That's the problem that Intel is currently facing. Intel isn't using their high IPC cores but ones typical found in Atom. ARM designs can reach similar levels of IPC but potentially at a smaller die area. Thus even if Silvermont wins on IPC the competing RISC based SoC's could still have higher throughput due to more cores.
Intel does have an ace up their sleeve by having their own fabs. This enables them to produce larger dies and maintain similar raw areas for CPU's by having a fab process advantage. However, the fab process advantage is not going to be long lived due to the difficulties of future shrinks.
Vlad_Da_Great - Thursday, July 23, 2015 - link
@Kevin G. Not only that, chemical processes and material processes restrictions. Also the biggest advantage for INTC will always be, Design and FAB in one house. ARM minions will have to rely on "others" and in many occasions that will not work.Vlad_Da_Great - Thursday, July 23, 2015 - link
* material propertiesravyne - Monday, July 7, 2014 - link
x86/x64 ISA compatibility really isn't the problem -- you could make the argument that these ISAs are large, and kind of ugly -- but its affect on die-size is minimal. In fact, only between 1-2 percent of your CPU is actually doing direct computation -- and nearly all the rest of the complexity is wiring or latency-hiding mechanisms (OoO execution engines, SMT, buffers, caches). Silvermont, IIRC, is something of a more-direct x86/x64 implementation than is typical though (others translate all x86/x64 ISA to an internal RISC-like ISA), so its hard to say, but I think in general these days that ISA has very little impact on die size, performance, or power draw -- regardless of vendor or ISA, there's a fairly linear scale relating process-size, die-size, power consumption, and performance.Madpacket - Thursday, June 26, 2014 - link
Looks impressive. Now only if we could have a generous amount of stacked DRAM to sit alongside a new APU we could gain respectable gaming speeds.nathanddrews - Thursday, June 26, 2014 - link
Mass quantities of fast DRAM... McDRAM is the perfect name.Homeles - Thursday, June 26, 2014 - link
Clever :-)Homeles - Thursday, June 26, 2014 - link
You'll undoubtedly see it, but the question is when. Stacked DRAM won't be debuting in consumer products until late this year at the earliest. AMD will likely be the first to market, with their partnership with Hynix for HBM. There was a chance Intel would put Xeon Phi out first, but with a launch in the second half of 2015, that's pretty much erased.Still, initial applications will be for higher margin parts, like this Xeon Phi, and AMD's higher-end GPUs. It will be some time before costs come down to be used in the more cost-sensitive APU segment.
abufrejoval - Thursday, June 26, 2014 - link
Well a slightly different type of stacked DRAM is already in wide use on consumer devices today: My Samsung Galaxy Note uses a nice stack of six dies mounted on top of the CPU.It's using BGA connections rather than silicon through vias, so it's not quite the same.
What I don't know (and would love to know) is whether that BGA stacking allows Samsung to maintain CMOS logic between the DRAM chips or whether they need to switch to "TTL like" signals and amplifiers already.
If I don't completely misunderstand this ability to maintain CMOS levels and signalling across dice would be one of the critical differentiators for silicon vias.
But while stacked DRAM (of both types) may work ok on smartphone SoCs not exceeding 5 Watt of power dissipation I can't help but imagine that with ~100Watts of Knights Landing power consumption below, stacked DRAM on top may add significant cooling challenges.
Perhaps these silicon through vias will be coupled with massive copper (cooling) towers going through the entire stack of CPU and DRAM dice, but again that seems to entail quite a few production and packaging challenges!
Say you have 10x10 cooling towers across the chip stack: Just imagine the cost of a broken drill on hole #99.
I guess actually thermally optimized BGA interconnects may be easier to manage overall, but at 100Watts?
I can certainly see how this won't be ready tomorrow and not because it's difficult to do the chips themselves.
Putting that stack together is an entirely new ball game (grid or not ;-) and while a solution might never pay for itself in the Knights Corner ecosystem, that tech in general would fit on anything else Intel does produce.
My bet: They won't actually put the stacked DRAM on top of the CPU but go for a die carrier solution very much like the eDRAM on Iris Pro.
iceman-sven - Thursday, June 26, 2014 - link
This might hint a faster Skylake-ep/-ex release. EP could be moved from late 3Q17 to late 4Q16. Even skip Broadwell-EP.I hope so.
iceman-sven - Thursday, June 26, 2014 - link
I mean from late 3Q16 to late 4Q15Why no edit. Dammit.
Pork@III - Thursday, June 26, 2014 - link
Just silicon, it make no troubles on me.tuxRoller - Thursday, June 26, 2014 - link
So, what's the flops/W?Pork@III - Friday, June 27, 2014 - link
Better of any others big cores except IBM Power8 and future supercomputer on chip - IBM Power9.toyotabedzrock - Friday, June 27, 2014 - link
Why not just add more AVX pipes instead of more cores? Then use a design where a few logic cores handle moving data to the pipes.And is it just me or did anyone else read the MCDRAM as McRAM?
puplan - Friday, June 27, 2014 - link
"nearly 50% more than Knights Corner’s GDDR5" - the slide shows "5x bandwidth vs. GDDR5".iceman-sven - Friday, June 27, 2014 - link
5x bandwidth vs. DDR4 not GDDR5Ryan Smith - Friday, June 27, 2014 - link
The 5x comment is on a per chip/module basis. Out of total bandwidth, KC was ~350GB a sec versus KL targeting over 500GB/seccelestialgrave - Friday, June 27, 2014 - link
All the talk about the importance of co-processors, anyone else remember how big a deal it was when integrating math co-processors way back when?History sure likes to repeat itself (in a good way this time at least)
frozentundra123456 - Friday, June 27, 2014 - link
Yea, I remember back in the late 80's using chromatography control and analytical software that had a math co-processor. I think the computer's main cpu was something like 50 mhz, all the programs and data were stored on floppy disks. Installing the board for the coprocessor and getting the software to work properly was a real bear.Laststop311 - Wednesday, July 2, 2014 - link
coprocessors, i remember having one with a 486 sx or dx can't remember so long ago im old :(could of even been with a 386
Tikcus9666 - Saturday, July 5, 2014 - link
if i remember correctly the 486 dx had a co processor within the CPU, the sx variant didn't and had to be added, although I'm old and that was 20 + years ago (and cba to check)Wolfpup - Tuesday, July 1, 2014 - link
I still want to run Folding @ Home on one of these :-Djamescox - Thursday, July 31, 2014 - link
I don't see these competing that well against gpus. Binary compatibility isn't usually that much of an issue in HPC. Using something like OpenCL allows you to run across a wide variety of hardware architectures. The run-time compilation should not be much of a factor since it is very fast compared to the rest of the processing. Also, adding more extensions to the ISA seems unnecessary and possibly counter productive. It gets you "closer to the metal", but it also locks the implementation in place. Using an intermediate layer allows greater changes to the underlying implementation in the future.