Intel Launches 4S and 8S Broadwell-EX Xeons: E7-4800 v4 and E7-8800 v4 Families, up to 384 Threadsby Ian Cutress on June 13, 2016 8:30 AM EST
- Posted in
- E7 v4
The super-high-end of Intel’s Xeon CPU range, based on servers with as many cores and as much memory as you can throw at them, represent a good part of Intel’s business with the potential to offer large margins: some customers want the most, the best, the powerful, and are willing to pay for it. For a number of generations, this has come via the Intel E7 line, consisting of two families of products designed for quad-socket servers (the E7-4000 v4) and eight socket servers (the E7-8000 v4). The new element to this launch is the use of ‘v4’, meaning that following the launch of Broadwell-EP for 1S/2S systems a couple of months ago and Broadwell-E (high-end desktop, HEDT) two weeks go, Intel has now filled out the v4 product line as we would typically expect. The new Xeons will be under the Broadwell-EX nomenclature (following Haswell-EX, Ivy Bridge-EX and so on), and using the Brickland platform aimed at mission critical environments.
Intel currently runs several processor lines in the Xeon/enterprise space, from E3-1200 v5 processors using consumer level performance in a Xeon package, the recently released E3-1500 v5 processors with embedded DRAM to help accelerate visual/video workflow, all the way up to the large EX core platforms.
|Intel Xeon Families (June 2016)|
|E3-1200 v5||E3-1500 v5
|E7-4800 v4||E7-8800 v4|
|Core Count||2 to 4||2 to 4||4 to 22||8 to 16||4 to 24|
|Integrated Graphics||Few, HD 520||Yes, Iris Pro||No||No||No|
|Max DRAM Support (per CPU)||64 GB||64 GB||1536 GB||3072 GB||3072GB|
|DMI/QPI||DMI 3.0||DMI 3.0||2600: 1xQPI||3 QPI||3 QPI|
|Multi-Socket Support||No||No||2600: 1S or 2S||1S, 2S or 4S||Up to 8S|
|Suited For||Entry Workstations||QuickSync,
|High-End Workstation||Many-Core Server||World Domination|
As referred to in Johan’s very detailed review of the dual socket E5-2600 v4 platform, Broadwell Xeon processor dies come in three die sizes: a low core count (LCC) featuring ten physical cores at 246.24 mm2 for ~3.2 billion transistors, a medium core count (MCC) with fifteen physical cores at 306.18 mm2 for ~4.7 billion transistors, and high core count (HCC) with 24 physical cores at 456.12mm2 for ~7.2 transistors. The MCC and HCC arrangements use dual memory controllers to address four memory channels whereas the LCC die uses a single memory controller which results in a slight performance hit compared to the other two. Most of the new E7 v4 processors however will be using the HCC die.
Intel has formally announced eleven processors between the 4S and 8S families, varying in core count, frequency, power consumption and L3 cache. The design of the HCC core is such that a processor can have certain cores fused off but the rest of the die can have access to the L3 cache, providing some SKUs with more ‘total cache per core’, such as the E7-8893 v4 which will be a four-core design but with 60 MB of L3 cache between them. These are classified by Intel as 'segment optimized', where applications require faster cache rather than more cores. This is arguably a stone-throw away from an eDRAM SKU with 64MB of eDRAM, but in this case Intel is still going with a large (and faster than eDRAM) L3 cache.
|Intel E7-8800 v4 Xeon Family|
|E7-8860 v4||E7-8867 v4||E7-8870 v4||E7-8880 v4||E7-8890 v4||E7-8891 v4||E7-8893 v4|
|TDP||140 W||165 W||140 W||150 W||165 W||165 W||140 W|
|Cores||18 / 36||18 / 36||20 / 40||22 / 44||24 / 48||10 / 20||4 / 8|
|L3 Cache||45 MB||45 MB||50 MB||55 MB||60 MB||60 MB||60 MB|
|QPI (GT/s)||3 x 9.6||3 x 9.6||3 x 9.6||3 x 9.6||3 x 9.6||3 x 9.6||3 x 9.6|
|PCIe Support||3.0 x32||3.0 x32||3.0 x32||3.0 x32||3.0 x32||3.0 x32||3.0 x32|
The flagship model is the E7-8890 v4, a 165W processor supporting the full 24 cores in the HCC die with hyperthreading, offering 48 threads per CPU. At a base frequency of 2.2 GHz, this processor can be used in an eight-socket glueless configuration (an 8S implementation means 192 cores/384 threads) or up to 128 sockets using third party controllers. In the eight socket configuration, a system can support up to 24TB of DDR4 LRDIMMs (three modules per channel, 12 modules per socket, 256GB per module). All the CPUs listed will support DDR4 and DDR3 with the dual controller configuration.
|Intel E7-4800 v4 Xeon Family|
|E7-4809 v4||E7-4820 v4||E7-4830 v4||E7-4850 v4|
|TDP||115 W||115 W||115 W||115 W|
|Cores||8 / 16||10 / 20||14 / 28||16 / 32|
|L3 Cache||20 MB||25 MB||35 MB||40MB|
|QPI (GT/s)||3 x 6.4||3 x 6.4||3 x 8.0||3 x 8.0|
|PCIe Support||3.0 x32||3.0 x32||3.0 x32||3.0 x32|
The E7-4800 v4 line by comparison will use a reduced QPI speed (6.4 or 8.0 gigatransfers per second compared to 9.6 gigatransfers per second on the E7-8800 v4) as well as some of the family having no Turbo frequencies. These non-turbo processors will run at their given frequency no matter the loading.
The new E7 v4 carries over all of the new features that Johan covered in our E5 v4 review, including:
- VM cache allocation (the ability for a supported hypervisor to mark a VM as high priority or partition cache as needed for QoS),
- New memory bandwidth monitoring tools,
- New frequency/power management tools to reduce frequency adjustment latency (see slide 29),
- Transactional extension support (TSX, was a feature in Haswell but disabled due to a fundamental hardware bug),
- A new non-deterministic random bit generator instruction for seed generation,
- Haswell to Broadwell generational improvements (decreased divider latency, 40% faster vector floating point multiplier, hardware assist for vector gather, cryptography focused instructions),
- AVX Turbo modes affect single cores rather than the whole processor,
- Entry/Exit latency for virtualization environments reduced to ~400 cycles from ~500 cycles.
There are a couple of features for the HCC based processors that may be more relevant for the 4S systems, such as an upgraded version of Cluster on Die. Due to the configuration of the die and the dual ring design, if a core needs data in an L3 cache on the other side of the die, the latency would be higher than if it was closer to the die. To alleviate this, Haswell E5/E7 Xeons separated each die into two clusters such that each part would be seen by the BIOS as a non-unified memory domain. This allows the home agent/system agent to manage the likelihood that memory requests are aimed at data closer to the core that needs it. In Broadwell, this feature is now brought up from dual-processor systems to four-processor systems, and should reduce last level cache latency and performance for larger systems.
The new E7 v4 processors use the same socket as the previous generation, the E7 v3 processors. With a BIOS update, the new processors are a drop in with the older platform. The usual Intel partners (Supermicro, HP Enterprise, Dell, Cray) are expected to offer systems based on the new processors. We expect the new processors to cost in line with the previous generation with a typical generational increase. I believe Johan is currently in the process of testing a few parts, and I’m looking forward to the review.
Post Your CommentPlease log in or sign up to comment.
View All Comments
ZeDestructor - Monday, June 13, 2016 - linkI think that's a typo, since it's a more or less drop-in upgrade to the Brickland platform that only has 4-channel memory. Besides, ark also lists it as having 4 channels.
Ian Cutress - Monday, June 13, 2016 - linkIt's technically four, but my using memory expanders it effectively splits each memory channel into two, allowing for 3DPC.
ZeDestructor - Tuesday, June 14, 2016 - linkAhh, very nice.
How I wish I had the cash and power to have a Brickland machine for my homeserver... would do wonders for a silly ZFS host...
Eden-K121D - Monday, June 13, 2016 - linkWell i heard a rumour about a zen naples server processor having 32 cores 8 channel DDR4 meory and 128 PCIE gen 3 lanes
Meteor2 - Monday, June 13, 2016 - linkSo... What are 8S servers used for? VM farms? When is effective to buy one of these rather than use several smaller, cheaper servers?
FunBunny2 - Monday, June 13, 2016 - link-- When is effective to buy one of these rather than use several smaller, cheaper servers?
any embarrassingly parallel problem. OLTP systems are the archetype.
mdw9604 - Sunday, June 19, 2016 - linkI am not embarrassed that my problems are parallel. Its the perpendicular ones that I tend to cover up.
Kevin G - Monday, June 13, 2016 - linkSeveral smaller cheaper servers introduces networking overhead and in most cases centralized storages. A single system image with equivalent processing power ends up being faster due the removal of this overhead, sometimes by a surprising amount.
The other thing is that these systems support a lot of memory per socket: 24 TB using the largest DIMMs available today an in eight socket configuration. Many production datasets can fit into that amount. Intel is offering quad core chips with full support for this capacity which is interesting from a licensing cost standpoint.
Meteor2 - Tuesday, June 14, 2016 - linkBig in-memory databases are interesting, though I understand it takes something like 15 minutes to load them into memory. Plus NMVeF is blurring local and remote memory.
I guess the problem for these machines though is there aren't many embarrassingly parallel problems out there. We run HPC workloads where I am, and they're best suited to just so many E5s on a very fast network. The jobs are many times too big to fit on one of these.
mapesdhs - Tuesday, June 14, 2016 - linkOn the contrary, there are many relevant workloads, from GIS to medical and defense imaging. Just look into the history of the customer base at SGI, those who bought their Origin systems, etc. Hence the existence of the modern UV series (256 sockets atm). Customers were already dealing with multi-GB datasets 20 years ago, and SGI was the first to design something that could load such files in mere seconds (Group Station for Defense Imaging). I'm not sure about the modernUV systems, but the bisection bandwidth of the last-Gen Origin was 512GB/sec (or it might be 1TB/sec if they made the usual 2X larger system for selected customers) and the tech has moved on a lot since then, with new features such as hw MPI offload, etc.
But yes, other loads don't scale well, all depends on the task. Hence the existence of cluster products aswell, and of course the ability to partition a UV into multiple subunits, each of which can be optimised to match the task scalability, while also allowing fast intercommunication between them, aswell as shared access to data, etc. Meanwhile, work goes on to improve scalability methods, eg. an admin at the Cosmos centre told me they're working hard to improve the scaling of various cosmological simulation codes to exploit up to 512 CPUs. In other fields however, GPU acceleration has taken over, but often that needs big data access aswell. It's a mixed bag as usual.
Speaking of which, Ian, re the usual Intel partners, you forgot to mention SGI. There's no doubt they'll be using these new CPUs in their UV range.
One thing I don't get concerning the E7-8893 v4, if it only has 4 cores, why aren't the max Turbo levels much higher? Indeed, the base clock could surely be a lot higher aswell.