Original Link: https://www.anandtech.com/show/4285/westmereex-intels-flagship-benchmarked



Intel's Best x86 Server CPU

The launch of the Nehalem-EX a year ago was pretty spectacular. For the first time in Intel's history, the high-end Xeon did not have any real weakness. Before the Nehalem-EX, the best Xeons trailed behind the best RISC chips in either RAS, memory bandwidh, or raw processing power. The Nehalem-EX chip was well received in the market. In 2010, Intel's datacenter group reportedly brought in $8.57 billion, an increase of 35% over 2009.

The RISC server vendors have lost a lot of ground to the x86 world. According to IDC's Server Tracker (Q4 2010), the RISC/mainframe market share has halved since 2002, while Intel x86 chips now command almost 60% of the market. Interestingly, AMD grew from a negligble 0.7% to a decent 5.5%.

Only one year later, Intel is upgrading the top Xeon by introducing Westmere-EX. Shrinking Intel's largest Xeon to 32nm allows it to be clocked slightly higher, get two extra cores, and add 6MB L3 cache. At the same time the chip is quite a bit smaller, which makes it cheaper to produce. Unfortunately, the customer does not really benefit from that fact, as the top Xeon became more expensive. Anyway, the Nehalem-EX was a popular chip, so it is no surprise that the improved version has persuaded 19 vendors to produce 60 different designs, ranging from two up to 256 sockets.

Of course, this isn't surprising as even mediocre chips like Intel Xeon 7100 series got a lot of system vendor support, a result of Intel's dominant position in the server market. With their latest chip, Intel promises up to 40% better performance at slightly lower power consumption. Considering that the Westmere-EX is the most expensive x86 CPU, it needs to deliver on these promises, on top of providing rich RAS features.

We were able to test Intel's newest QSSC-S4R server, with both "normal" and new "low power" Samsung DIMMs.

Some impressive numbers

The new Xeon can boast some impressive numbers. Thanks to its massive 30MB L3 cache it has even more transistors than the Intel "Tukwilla" Itanium: 2.6 billion versus 2 billion transistors. Not that such items really matter without the performance and architecture to back it up, but the numbers ably demonstrate the complexity of these server CPUs.

Processor Size and Technology Comparison
CPU transistors count (million) Process

Die Size (mm²)

Cores
Intel Westmere-EX 2600 32 nm 513 10
Intel Nehalem-EX 2300 45 nm 684 8
Intel Dunnington 1900 45 nm 503 6
Intel Nehalem 731 45 nm 265 4
IBM Power 7 1200 45 nm 567 8
AMD Magny-cours 1808 (2x 904) 45 nm 692 (2x 346) 12
AMD Shanghai 705 45 nm 263 4

 



Intel Quanta QSCC-4R Benchmark Configuration

CPU 4x Xeon X7560 at 2.26GHz or
4x Xeon E7-4870 at 2. 4GHz
RAM 16x4GB Samsung Registered DDR3-1333 at 1066MHz
Motherboard QCI QSSC-S4R 31S4RMB00B0
Chipset Intel 7500
BIOS version QSSC-S4R.QCI.01.00.S012,031420111618
PSU 4x Delta DPS-850FB A S3F E62433-004 850W

The Quanta QSCC-4R is an updated version of the server we reviewed a year ago. The memory buffers consume less power and support low power (1.35V) DDR3 ECC DIMMs. The server can accept up to 64x32GB Load Reduced DIMMs (LR-DIMMs), so the new server platform can offer up to 2TB of RAM!

LR-DIMMs are the successors of FB-DIMMs. Fully Buffered DIMMs reduced the load on the memory channel courtesy of a serial interface between the memory controller and the AMB. The very high serial input frequency however increased the heat generation significantly, so the memory vendors abandoned FB-DIMMs after DDR2. Until recently, all large DDR3 DIMMs have been registered DIMMs.

The new Load Reduced DIMM is a registered DIMM on steroids that buffers the address signals just like registered DIMMs, but it also buffers the datalines. LR-DIMMs therefore fully buffer the DIMMs and greatly increase the number of memory chips that can be used per channel without the power hogging serial interface of the AMBs. The downside is that buffering the datalines increases latency, especially with bus turnarounds.

The QSSC-4R comes with a rich BIOS. Below you can see the typical BIOS configuration that we used. As you can see we tested the Xeon with Turbo Boost and Hyper-Threading enabled.

Dell PowerEdge R815 Benchmarked Configuration

CPU 4x Opteron 6174 at 2.2GHz
RAM 16x4GB Samsung Registered DDR3-1333 at 1333MHz
Motherboard Dell Inc 06JC9T
Chipset AMD SR5650
BIOS version v1.1.9
PSU 2x Dell L1100A-S0 1100W

The R815 is not a direct competitor to the quad Xeon platform; it is more limited in RAS features and expandability (512GB of RAM max). However, it is an attractive alternative for some of the more cost sensitive quad Xeon buyers. Its very compact 2U design takes half the space of the quad Xeon servers, and a fully equipped quad Opteron server with 256GB of RAM can be purchased for less than $20,000. A similar quad Xeon system can set you back $30,000 or more.

Storage Setup

The storage setup is the same as what we described here.



SAP S&D Benchmark

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We looked at SAP's benchmark database for these results. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every 2-tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. The new 2009 version of the benchmark produces scores that are 25% lower. We analyzed the SAP Benchmark in-depth in one of our earlier articles. The profile of the benchmark has remained the same:

  • Very parallel resulting in excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Somewhat limited by memory bandwidth
  • Likes large caches (memory latency!)
  • Very sensitive to sync ("cache coherency") latency

SAP Sales & Distribution 2 Tier benchmark

There is no doubt here: the Westmere-EX Xeon delivers with 30% higher performance than the previous x86 quad CPU record. The 40-core, 80-thread quad Xeon server can not beat the 32-core, 128-thread IBM Power 750, the RISC champion; however, the high-end IBM servers start at $100,000, two to three times more than a comparable Xeon system.

The 30% extra performance that the new 32 nm Xeon delivers over its predecessor also increases the gap with AMD. The best quad Xeon now offers 50% more performance than the best quad Opteron. The ERP market is a market where RAS, scalability, and performance are the top priorities and hardware pricing is only a secondary thought. There is little doubt in our mind that Intel will continue to dominate the x86 ERP server market.



vApus Mark II

vApus Mark II is our newest benchmark suite that tests how well servers cope with virtualizing "heavy duty applications". We explained the benchmark methodology here. We used vSphere 4.1 Update 1, based upon the 64 bit ESX 4.1.0 b348481 hypervisor.

vApus Mark II score
* with 128GB of RAM

Benchmarks cannot be interpreted easily, and virtualization adds another layer of complexity. As always, we need to explain quite a few details and nuances.

First of all, we tested most servers with 64GB of RAM. However, the memory subsystem of the Quad Xeon needs 32 DIMMs before it can deliver maximum bandwidth. As some of these server systems will get those 32 DIMMs while others will not, we tested both with 16 (64GB) and 32 DIMMs (128GB). Our vApus mark test requires only 11GB per tile: 4GB for the OLTP database, 4GB for the OLAP and 1GB for each of the three web applications (3GB in total). So even a five tile test demands only 55GB. Thus, in this particular benchmark there is no real advantage to having 128GB of RAM other than the bandwidth advantage for the quad Xeon platform. That is why we do not test the the Quad Opteron with more than 64GB: it makes no difference and makes the graph even more complex.

Then there's the problem that every virtualization benchmark encounters: the number of tiles (a tile is a group of VMs). With VMmark, the benchmark folks add tiles until the total throughput begins to decline. The problem with this approach is that you favor throughput over response time. In the real world, response time is more important than throughput. We test with both four (20 VMs, 72 vCPUs) and five tiles (25 VMs, 90 vCPUs). Which benchmark gives you the most accurate number for a given system? Let us delve a little deeper and take the response time into account.



vApusMark II Response Time

Each tile in vApusMark II demands 18 virtual CPUs: four for the Oracle OLTP test, eight for the MS SQL Server OLAP test, and six for the three web application VMs (two CPUs each). Therefore, a four tile test will require 72 virtual CPUs. A quad Xeon E7-4870 contains 40 cores and 80 threads with Hyper-Threading enabled. With a test that puts 72 virtual CPUs to work, you cannot measure the total throughput of the quad Xeon E7. In fact, some of those 72 virtual CPUs are not working at 100% all of the time. For example, the CPU load caused by the web VMs shows a lot of spikes. Thus, we can not interprete the throughput numbers without a look at the response times.

vApus Mark II Response time

Back to our benchmark or throughput scores. Ideally, we should measure throughput at exactly the same response times. But with our current stress testing software, trying to keep response time the same would be an extremely time consuming process.

vApus Mark II score revisited

Since the quad Opteron shows a 40% increase in response time from 4 to 5 tiles (or from 20 to 25 VMs), we believe that the four tile score (149) is more representative of the "real performance". The extra throughput that the five tile test delivers comes at a response time price that is too high.

The response time of the Quad Xeon 7560 increases 9% when we try to load it with five extra VMs. In this case, the "real and fair" throughput score is a little bit harder to determine. It is somewhere between the score of 4 tiles and 5 tiles, probably around 180 or so.

In case of the Quad Xeon E7, however, things are crystal clear. Running 20 or 25 VMs does not make any difference: the response times stay in the same league. In this case we take the highest score to be the real one.

So if we take response times into account, the quad E7-4870 is about 35% faster than its predecessor (243 vs 180) and about 63% faster than the AMD system in our test (243 vs 149). AMD's fastest processor is the 2.5GHz 6180SE now. This CPU is clocked around 13% higher and should thus be able to reach a score of around 168. That means the Xeon E7-4870 should still have a 44% (or more) advantage over its nearest but much cheaper competitor in this particular workload.



Power Extremes: Idle and Full Load

Idle and full load power measurements are hardly relevant for virtualized environments but they offer some interesting datapoints. In the first test we report the power consumption running vApus Mark II, which means that the servers are working at 85-98% CPU load. We measured with four tiles, but we also tested the Xeon with five tiles (E7-4870 5T).

We test with redundant power supplies working, so the Dell R815 uses 1+1 1100W PSUs and the QSSC-4R uses 2+2 850W PSU. Remember that the QSSC server has cold redundant PSUs, so the redundant PSUs consume almost nothing. There's more to the story than just the PSU and performance, of course: the difference in RAS features, chassis options, PSU, and other aspects of the overall platform can definitely impact power use, and that's unfortunately something that we can't easily eliminate from the equation.

vApus Mark II Full Power

The Quad E7-4870 Xeons save about 7.5% power (894 vs 966) compared to their older brothers. The power consumption numbers look very bad compared to the AMD system in absolute terms. However, with five tiles the Quad Xeon E7 delivers 63% higher performance while consuming 57% more power. We can conclude that at very high loads, the Xeon E7 performance/watt ratio is quite competitive.

When we started testing at idle, we test with both the Samsung 1333MHz 4GB 1.35V (low power) DDR3 registered (M393B5273DH0) and 1.5V DIMMs (M393B5170FH0-H9).

vApus Mark II Idle Power

Despite the fact that the Xeon X7560 does not support low power DIMMs officially, it was able to save about 5% of the total power use. The Xeon E7's more advanced memory controller was able to reduce power by 8%. But the picture is clear: if your servers runs idle for large periods of time, the quad Opteron server is able to run on a very low amount of power, while the quad Xeon server needs at least 600W if you do not outfit it with low power 1.35V DIMMs. How much of the power difference comes from the platform in general and how much is specific to the servers being tested is unfortunately impossible to say without testing additional servers.

As we have stated before, maximum power and minimum power are not very realistic, so let us check out our real world scenario.



Real-World Power

In the real world you do not run your virtualized servers at their maximum load just to measure the potential performance, but neither do they run idle. The user base will create a certain workload and expect this workload to be performed with the lowest response times. We created a real world “equal load” scenario as we described here.

vApus Mark II Real world energy

The numbers above show that there is more to energy consumption than just measuring idle and full load. The quad Xeon needs 67% more energy to run the same workload. Granted, our methodology favors the Opteron a bit: the average CPU load is around 5-30% for the 80 thread Xeon E7, while the Opteron is running at an average of 10-40%. So the load on the Xeon E7 is a bit too low. Still, the real problem is that the Xeon E7 power consumption is high at low loads.



Conclusion

Performance and RAS features took a giant leap forwared when Intel replaced the Xeon 7400 with the Xeon 7500. The memory subsystem went from a high latency, totally bandwidth choked loser (hardly 10GB/s for 24 cores) to a low latency and very high bandwidth champion (up to 70GB/s). The Xeon E7 builds further on that excellent platform, and adds up to 35% higher performance.

We now have a proven platform with excellent RAS features that needs slightly less power now while it provides a decent performance boost. That's excellent, but the Xeon E7 still has a few weakness. One weakness is the relatively high power consumption at idle load. Compared to the high-end Power 7 servers, this kind of power consumption is probably very reasonable. The Power 7 CPUs are in the 100 to 170W TDP range, while the Xeon E7s are in the 95 to 130W TDP range. A quad 3.3GHz Power 755 with (256GB RAM) server consumes 1650W according to IBM (slide 24), while our first measurements show that our 2.4GHz E7-4870 server will consume about 1300W in those circumstances.

Considering that the 3.3GHz Power 7 and 2.4GHz E7-4870 perform at the same level, we'll go out on a limb and assume that the new Xeon wins in the performance/watt race. AMD might take advantage of this "weakness", but availablility of quad 16-core "Bulldozer" servers is still months away and we don't know what the power use will be yet.

The 10-core Xeons are pretty expensive ($3000-4600 per CPU), but many of these systems are bought to run software that will cost 10 times more. In a nutshell, Intel's Xeon E7 moves up the server CPU food chain. The Xeon E7 closes the performance gap with the best RISC CPUs (see the SAP benchmarks), offers lower power and cost, and the rest of the x86 competition is relegated to the low-end of the quad x86 market.

For those looking for a virtualization platform, there is no x86 server that is able to offer such low response times at such high consolidation ratios. However, in order to get a good performance/watt ratio, you need to make sure that your quad Xeon E7 servers are working under high CPU loads. The quad Xeon E7 server is a good platform for consolidating CPU intensive applications. For less intensive VMs, it makes a lot more sense to check out the dual Xeon and quad Opteron offerings.

I would also like to thank to Tijl Deneut for his invaluable assistance.

Log in

Don't have an account? Sign up now