The Best Server CPUs part 2: the Intel "Nehalem" Xeon X5570
by Johan De Gelas on March 30, 2009 3:00 PM EST- Posted in
- IT Computing
Benchmark Configuration
None of our benchmarks required more than 16GB RAM.
Each Server had an Adaptec 5805 connected to the Promise 300js DAS. Database files were placed on a six drive RAID 0 set of Intel X25-E SLC 32GB SSDs, and log files on a four drive RAID 0 set of 15000RPM Seagate Cheetah 300GB hard disks.
We used AMD 8356 and 8384 CPUs in dual CPU configurations. Performancewise they are identical to the Opteron 2356 and 2387. So to avoid confusion, we list the Opterons 83xx as Opteron 2356 and Opteron 2384.
Xeon Server 1: ASUS RS700-E6/RS4 barebone
CPU: Dual Xeon "Gainestown" X5570 2.93GHz
MB: ASUS Z8PS-D12-1U
RAM: 6x4GB (24GB) ECC Registered DDR3-1333
NIC: Intel 82574L PCI-E Gbit LAN
Xeon Server 2: Intel "Stoakley" platform server
CPU: Dual Xeon E5450 at 3GHz
MB: Supermicro X7DWE+/X7DWN+
RAM: 16GB (8x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC
Xeon Server 3: Intel "Bensley" platform server
CPU: Dual Xeon X5365 at 3GHz, Dual Xeon L5320 at 1.86 GHz and Dual Xeon 5080 at 3.73 GHz
MB: Supermicro X7DBE+
RAM: 16GB (8x2GB) Crucial Registered FB-DIMM DDR2-667 CL5 ECC
NIC: Dual Intel PRO/1000 Server NIC
Opteron Server: Supermicro SC828TQ-R1200LPB 2U Chassis
CPU: Dual AMD Opteron 8384 at 2.7GHz or Dual AMD Opteron 8356 at 2.3GHz
MB: Supermicro H8QMi-2+
RAM: 24GB (12x2GB) DDR2-800
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1200W w/PFC (Model PWS-1K22-1R)
vApus/DVD Store/Oracle Calling Circle Client Configuration
CPU: Intel Core 2 Quad Q6600 2.4GHz
MB: Foxconn P35AX-S
RAM: 4GB (2x2GB) Kingston DDR2-667
NIC: Intel Pro/1000
The Platform: ASUS RS700-E6/RS4
We were quite surprised to see that Intel chose the ASUS RS700-E6/RS4 barebone, but it came clear that ASUS is really gearing up to compete with companies like Supermicro and Tyan. This ASUS 1U barebone has a new Tylersburg-36D (Intel 5520) chipset and ICH10R Southbridge.
The ASUS RS700-E6 is a completely cable-less design, which is quite rare. According to ASUS, the gold finger mating mechanism delivers a more reliable signal quality. That is hard to verify but it is clear that a loose connection is much more unlikely than with cables. We have only had the server in the labs a few weeks, so it is too early to talk about the reliability, but we can say that the build quality of the server is excellent. The 6-phase power regulation that feeds each CPU comes from very high quality solid capacitors that are guaranteed to survive 5 years of working at 86°C (typically this is only 2 years). The same is true for the 3-phase memory power regulation. A special energy process unit (EPU) steers the VRMs to obtain higher power efficiency.
A rather unique feature is that this 1U server also supports two full height PCI-E expansion slots and one half-height slot (close to the PSU). The two full height slots are PCI-E x16 slots and the low profile slot is PCI-E x8. In addition, you can add a proprietary PIKE card, which allows you to add a SAS controller. This can be an LSI 1064E Software RAID solution (RAID 0 or 1) or a real hardware RAID card (the LSI 1078) with support for RAID 0, 1, 10, 5 and even 6.
The expandability is thus excellent, especially if you consider that the ASUS RS700 has room for two (1+1) redundant PSUs. We still have a few items on our wish list, though. We would like a less exotic video card with slightly more video RAM; ASUS uses the AST2050 with only 8MB. While many people will never use the onboard video, some of us do need to use it from time to time. The card comes with decent Windows and Linux drivers. Our distribution (SUSE SLES10SP2) would only work well at 1024x768 and refused to work in text mode until we installed the video driver, so it took a bit of tinkering before we were even capable of installing the right driver.
ESX 3.5 Update 3 does not recognize the new Intel SATA controller well, but luckily the ASUS server can be equipped with an ESX3i USB stick. ASUS offers a special USB port inside the server to attach the stick. We are currently circumventing the SATA-ESX issue with an install via ftp.
Overall, this is one of the finest 1U barebones that we have seen to date. We are pleased with the expandability, the excellent fabrication quality, and the 3-year warranty that ASUS provides.
44 Comments
View All Comments
snakeoil - Monday, March 30, 2009 - link
oops it seems that hypertreading is not scaling very well too bad for inteleva2000 - Tuesday, March 31, 2009 - link
Bloody awesome results for the new 55xx series. Can't wait to see some of the larger vBulletin forums online benefiting from these monsters :)ssj4Gogeta - Monday, March 30, 2009 - link
huh?ltcommanderdata - Monday, March 30, 2009 - link
I was wondering if you got any feeling whether Hyperthreading scaled better on Nehalem than Netburst? And if so, do you think this is due to improvements made to HT itself in Nehalem, just do to Nehalem 4+1 instruction decoders and more execution units or because software is better optimized for multithreading/hyperthreading now? Maybe I'm thinking mostly desktop, but HT had kind of a hit or miss reputation in Netburst, and it'd be interesting to see if it just came before it's time.TA152H - Monday, March 30, 2009 - link
Well, for one, the Nehalem is wider than the Pentium 4, so that's a big issue there. On the negative side (with respect to HT increase, but really a positive) you have better scheduling with Nehalem, in particular, memory disambiguation. The weaker the scheduler, the better the performance increase from HT, in general.I'd say it's both. Clearly, the width of Nehalem would help a lot more than the minor tweaks. Also, you have better memory bandwidth, and in particular, a large L1 cache. I have to believe it was fairly difficult for the Pentium 4 to keep feeding two threads with such a small L1 cache, and then you have the additional L2 latency vis-a-vis the Nehalem.
So, clearly the Nehalem is much better designed for it, and I think it's equally clear software has adjusted to the reality of more computers having multiple processors.
On top of this, these are server applications they are running, not mainstream desktop apps, which might show a different profile with regards to Hyper-threading improvements.
It would have to be a combination.
JohanAnandtech - Monday, March 30, 2009 - link
The L1-cache and the way that the Pentium 4 decoded was an important (maybe even the most important) factor in the mediocre SMT performance. Whenever the trace cache missed (and it was quite small, something of the equivalent of 16 KB), the Pentium 4 had only one real decoder. This means that you have to feed two threads with one decoder. In other words, whenever you get a miss in the trace cache, HT did more bad than good in the Pentium 4. That is clearly is not the case in Nehalem with excellent decoding capabilities and larger L1.And I fully agree with your comments, although I don't think mem disambiguation has a huge impact on the "usefullness" of SMT. After all, there are lots of reasons why the ample execution resources are not fully used: branches, L2-cache misses etc.
IntelUser2000 - Tuesday, March 31, 2009 - link
Not only that, Pentium 4 had the Replay feature to try to make up for having such a long pipeline stage architecture. When Replay went wrong, it would use resources that would be hindering the 2nd thread.Core uarch has no such weaknesses.
SilentSin - Monday, March 30, 2009 - link
Wow...that's just ridiculous how much improvement was made, gg Intel. Can't wait to see how the 8-core EX's do, if this launch is any indication that will change the server landscape overnight.However, one thing I would like to see compared, or slightly modified, is the power consumption figures. Instead of an average amount of power used at idle or load, how about a total consumption figure over the length of a fixed benchmark (ie- how much power was used while running SPECint). I think that would be a good metric to illustrate very plainly how much power is saved from the greater performance with a given load. I saw the chart in the power/performance improvement on the Bottom Line page but it's not quite as digestible as or as easy to compare as a straight kW per benchmark figure would be. Perhaps give it the same time range as the slowest competing part completes the benchmark in. This would give you the ability to make a conclusion like "In the same amount of time the Opteron 8384 used to complete this benchmark, the 5570 used x watts less, and spent x seconds in idle". Since servers are rarely at 100% load at all times it would be nice to see how much faster it is and how much power it is using once it does get something to chew on.
Anyway, as usual that was an extremely well done write up, covered mostly everything I wanted to see.
7Enigma - Wednesday, April 1, 2009 - link
I think that is a very good method for determining total power consumption. Obviously this doesn't show cpu power consumption, but more importantly the overall consumption for a given unit of work.Nice thinking.
JohanAnandtech - Wednesday, April 1, 2009 - link
I am trying to hard, but I do not see the difference with our power numbers. This is the average power consumption of one CPU during 10 minutes of DVD-store OLTP activity. As readers have the performance numbers, you can perfectly calculate performance/watt or per KWh. Per server would be even better (instead of per CPU) but our servers were too different.Or am I missing something?