This week Microsoft released a new blog dedicated to the Windows Kernel internals. The purpose of the blog is to dive into the Kernel across a variety of architectures and delve into the elements, such as the evolution of the kernel, the components, the organization, and in this post, the focus was on the scheduler. The goal is to develop the blog over the next few months with insights into what goes on behind the scenes, and the reasons why it does what it does. However, we got a sneak peek into a big system that Microsoft looks like it is working on.

For those that want to read the blog, it’s really good. Take a look here:
https://techcommunity.microsoft.com/t5/Windows-Kernel-Internals/One-Windows-Kernel/ba-p/267142

When discussing the scalability of Windows, the author Hari Pulapaka, Lead Program Manager in the Windows Core Kernel Platform, showcases a screenshot of Task Manager from what he describes as a ‘pre-release Windows DataCenter class machine’ running Windows. Here’s the image:


Click to zoom. Unfortunately the original image is low resolution

If you weren’t amazed by the number of threads in task manager, you might notice that on the side there’s a scroll bar. That’s right: 896 cores means 1792 threads when hyperthreading is enabled, which is too much for task manager to show at once, and this new type of ‘DataCenter class machine’ looks like it has access to them all. But what are we really seeing here, aside from every single thread loaded at 100%?

So to start, the CPU listed is a Xeon Platinum 8180, Intel’s highest core count, highest performing Xeon Scalable ‘Skylake-SP’ processor. It has 28 cores and 56 threads, and by math we get a 32 socket system. In fact in the bumf below the threads all running at 100%, it literally says ‘Sockets: 32’. So this is 32 full 28 core processors all acting together under one version of Windows. Again, the question is how?

Normally, Intel only rates Xeon Platinum processors for up to 8 sockets. It does this by using three QPI links per processor to form a dual-box configuration. The Xeon Gold 6100 range does up to four sockets with three QPI links, ensuring each processor is linked to each other processor, and then the rest of the range does single socket or dual socket.

What Intel doesn’t mention is that with an appropriate fabric connecting them, system builders and OEMs can chain together several 4-socket or 8-socket systems into a single, many-socket interface. Aside from the fabric to be used and the messaging, there are other factors in play here, such as latency and memory architecture, which are already present in 2-8 socket platforms but get substantially increased going beyond eight sockets. If one processor needs memory that is two fabric hops and a processor hop is away, to a certain extent having that data in a local SDD might be quicker.

As for the fabric: I’m actually going to use an analogy here. AMD’s EPYC platform goes up to two sockets, but for the interconnect between sockets, it uses 64 PCIe lanes from each processor to host AMD’s Infinity Fabric protocol to act as links, and has the benefit of the combined bandwidth of 128 PCIe lanes. If EPYC had 256 PCIe lanes for example, or cut the number of PCIe lanes down to 32 per link, then we could end up with EPYC servers with more than two sockets built on Infinity Fabric. With Intel CPUs, we’re still using the PCIe lanes, but we’re doing it in one of three ways: control over Omni-Path using PCIe, control over Infiniband using PCIe, or control using custom FPGAs, again over PCIe. This is essentially how modern supercomputers are run, albeit not as one unified system.

Unfortunately this is where we go out of my depth. When I spoke to a large server OEM last year, they said quad socket and eight socket systems are becoming rarer and rarer as each CPU by itself has more cores the need for systems that big just doesn't exist anymore. Back in the days pre-Nehalem, the big eight socket 32-core servers were all the rage, but today not so much, and unless a company is willing to spend $250k+ (before support contracts or DRAM/NAND) on a single 8-socket system, it’s reserved for the big players in town. Today, those are the cloud providers.

In order to get 32 sockets, we’re likely seeing eight quad-socket systems connected in this way in one big blade infrastructure. It likely takes up half a rack, of not a whole one, and your guess is as good as mine on the price, or power consumption. In our screenshot above it does say ‘Virtualization: Enabled’, and given that this is Microsoft we’re talking about, this might be one of their internal planned Azure systems that is either rented to defence-like contractors or partitioned off in instances to others.

I’ve tried reaching out to Hari to get more information on the system this is, and will report back if we get anything. Microsoft may make an official announcement if these large 32-socket systems are going to be 'widespread' (meant in the leanest sense) offerings on Azure.

Note: DataCenter is stylized with a capital C as quoted through Microsoft's blog post.

Related Reading

Comments Locked

56 Comments

View All Comments

  • alpha754293 - Saturday, October 27, 2018 - link

    I think that you've accidentally hit the nail on the head though:

    All of those systems that you've mentioned above (Cray, SGI, Fujitsu, etc.) with Alphas, MIPS, SPARC, etc. - note that none of those are x86 or x86_64 ISA class systems.

    a) Those were all proprietary systems with proprietary splinters of the original BSD SVR4 UNIX and they would "appear" to be single image systems to the user, but in reality, you KNOW that they weren't because the technology wasn't actually truly there. So example, the head node will deploy the slave nodes and the slave processes of said slave nodes, and transferring said slave processes acrossed or between slave nodes was generally NOT a trivial task. Once the slave process has been assigned to a slave node, migration between was generally, at least not trivial.

    Depending on what you're running, that can be both a good thing and a bad thing. If you have a HPC run and one of the slave nodes is failing and/or where the hardware has failed; the death of the slave process will report back to the headnode as the entire run having failed. The lack of portability in the slave processes means that you don't, inherently, have some level of HA available to you within the single image.

    This is what, I presume, would be one of the key differences with the Windows-based single image.

    If you want to lock out a slave process by way of processor affinity mask, you can do so across the sockets/nodes. But if you want to have HA, say, "within a box", you can also do that as well.

    And again, I think that you accidentally hit the nail on the head because unlike the Cray/SGI/Fujitsu/POWER systems, the single image isn't initialized via the headnode. In this case, I don't think that there IS a headnode in the classical sense per se anymore, and THAT is what makes this vastly more interesting (perhaps) than the stuff that you mentioned.

    (Also, oh BTW, the systems that you've mentioned, the head node does NOT typically, perform any of the actual computationally intensive work. It's not designed to, and it's only design to keep everything else coordinated. Thus, if you only have two nodes (for example) within that example, trying to get the head node to also perform the computations is NOT a common nor recommended practice. You'll see that from all of the various sysadmin guides from the respective vendors.)
  • theeldest - Friday, October 26, 2018 - link

    HPE superdome flex supports 32 Intel Xeon Platinum sockets: https://psnow.ext.hpe.com/doc/PSN1010323140USEN.pd...
  • HStewart - Friday, October 26, 2018 - link

    Interesting,, check out the link with in -- it states it support Windows DataCenter Server 2016

    https://h20195.www2.hpe.com/v2/GetDocument.aspx?do...

    My question is this same unit - as Microsoft using - it is modules and support 4 to 32 cpus in 4 cpu increments. Each 4 cpu is pluggable for 99% up time.
  • Alex_Haddock - Friday, November 2, 2018 - link

    Azure is typically reticent to name vendor kit and whilst I don't work in the Superdome Flex group directly I'd say there is a high chance...The system is a very nice amalgamation of the SGI UV acquisition and the mission critical aspects of the Superdome range. Roadmap is very cool (any channel partners reading this attending TSS Europe in March 19 or the US equivalent I'd highly recommend attending some of the Superdome sessions). Will see if any chance of getting Ian remote access to one for an AnandTech test as well...I've enjoyed CPU articles on this site myself since its inception :-).
  • arnd - Friday, October 26, 2018 - link

    Huawei have another one with similar specifications:
    https://e.huawei.com/kr/products/cloud-computing-d...
  • HStewart - Friday, October 26, 2018 - link

    This one is slightly different - it only supports 1-96 cores per partition for total of 80 partitions

    So basically like 80 connected machines and not treated as single machine.
  • loony - Friday, October 26, 2018 - link

    Microsoft already uses the SD Flex for their Linux based Azure SAP HANA offerings, so I assume they just finally got Windows working on the SGI box.
  • cyberguyz - Friday, October 26, 2018 - link

    Y'know PC is short for "Personal Computer". I think I can say with a straight face that if a system has 32 sockets, in ain't no stinkin' PC!
  • HollyDOL - Friday, October 26, 2018 - link

    Some persons are more demanding than others :-)
  • HStewart - Friday, October 26, 2018 - link

    Well you could say the same thing about 16 or 32 Core computer. But this is not actually intended as PC - but a server - but one could make a killer workstations

    But who knows we maybe talking about 896 cores in laptop in 10 years

Log in

Don't have an account? Sign up now