Intel's newest Quad Xeon MP versus HP's DL585 Quad Opteron
by Johan De Gelas on November 10, 2006 12:00 PM EST- Posted in
- IT Computing
The Xeon 70xx
Tulsa or the Xeon MP 71xx is the last Mohican of the "NetBurst / Pentium 4" tribe. It is the successor of the Xeon MP 70xx, also known as the infamous Paxville CPU. The Xeon MP 70xx was one of the worst CPUs in history from a performance/Watt view. The max TDP of Paxville was no les than 173W, and the CPU was limited to "only" 3 GHz, which is low for a NetBurst CPU as NetBurst CPUs were initially built for 4 GHz and more. According to Intel's own graphs, the fastest Opteron beats the best Xeon MP 7041 by no less than 30% in integer benchmarks....
... and by no less than 76% in Java Server benchmarks!
Needles to say, the Xeon 70xx is and was a small disaster and one of the reasons why AMD's Opteron gained so much support so quickly. With that kind of heritage, the expectations for the Xeon MP 71xx, aka Tulsa, are not high. Is Tulsa yet another power gobbling CPU which can't outperform the competition? Although the CPU is sitting completely in the shadow of Intel's newest Core based Xeons, Intel engineering did spend a lot of time on trying to make the last NetBurst CPU perform well and consume less.
Tulsa is a dual core Xeon built on Intel's very successful 65 nm process. It is a true dual core, with both cores sharing some control logic and a large L3 cache which can be 4, 8 or 16 MB in size. Tulsa can scale up to 3.4 GHz, but we tested the more affordable 3.2 GHz version with 8 MB cache.
The Tulsa Die
The biggest Tulsa die weighs in at 435 mm², a result of containing 1.3 Billion transistors. By using slower but 3 times less "leaky" transistors, and letting the parts of the caches that are not accessed "sleep", the caches consume less than 1 W/MB. Tulsa can be used as an upgrade for Paxville and uses the same "Truland platform" with the Twin Castle chipset. If that sounds like gibberish, the Truland platform has been tested and explained here at AnandTech by Jason.
Two independent 800 MHz FSBs give each of the 2 sockets (4 cores) a 6.4GB/s pipe to the Northbridge. By using four XMBs (eXternal Memory Bridge), capacity and bandwidth is maximized. The XMBs find a place on a hot swappable memory board, and each XMB drives 4 memory slots. Below you can see the memory board; the XMB is under the heatsink.
The big performance booster is Tulsa's L3 cache. Tulsa's massive L3 is protected by Pellston technology. As caches get bigger, the possibility of getting a data error also increases. Pellston can disable a faulty cache line (128 byte) during BIOS initialization when all cache lines are checked, or it can even do so while the CPU is processing. The Pellston technology is in fact an algorithm that checks if a cache line error is the result of a hard error or a soft error. The actual "checking" whether a cache line is bad or not is done by an ECC algorithm on the 32 ECC bits which protects the L3 cache lines. In other words, Pellston makes the ECC protect cache a little smarter, allowing it to act on ECC errors rather than only reporting ECC errors.
The L3 cache is inclusive: it also contains the contents of the L2-cache. Thanks to the shared and inclusive nature of the L3-cache coherency traffic between the four CPUs is significantly reduced. Too much Coherency traffic can cause multithreaded applications that share variables among the different threads like OLTP databases and web servers to slow down.
So higher clock speeds, the newer 65 nm process, much less leaky transistors, and an extra shared L3 should allow the Xeon 71xx "Tulsa" to perform much better than the Xeon 70xx "Paxville" and consume a quite a bit less. Considering that Xeon 71xx has a TDP of 95W at 3 GHz while the Xeon 70xx needed 165 W at the same speed, it appears that Intel engineers have been very successful in reducing power consumption.
Intel's own benchmarks indicate 42% higher Integer throughput while the clock speed has increased by 13%. The most spectacular graph is the SPECjbb one: according to Intel, the Xeon 7140 is no less than 2.5 times faster than the old Xeon 7041. However, the benchmark is rather vague, as Intel does not reveal if the JVMs were completely the same. A different JVM can make a big difference. Tulsa also supports EM64T, the XD bit, HW Virtualization Technology and EIST as you can see from our BIOS setup screenshot.
Tulsa or the Xeon MP 71xx is the last Mohican of the "NetBurst / Pentium 4" tribe. It is the successor of the Xeon MP 70xx, also known as the infamous Paxville CPU. The Xeon MP 70xx was one of the worst CPUs in history from a performance/Watt view. The max TDP of Paxville was no les than 173W, and the CPU was limited to "only" 3 GHz, which is low for a NetBurst CPU as NetBurst CPUs were initially built for 4 GHz and more. According to Intel's own graphs, the fastest Opteron beats the best Xeon MP 7041 by no less than 30% in integer benchmarks....
... and by no less than 76% in Java Server benchmarks!
Needles to say, the Xeon 70xx is and was a small disaster and one of the reasons why AMD's Opteron gained so much support so quickly. With that kind of heritage, the expectations for the Xeon MP 71xx, aka Tulsa, are not high. Is Tulsa yet another power gobbling CPU which can't outperform the competition? Although the CPU is sitting completely in the shadow of Intel's newest Core based Xeons, Intel engineering did spend a lot of time on trying to make the last NetBurst CPU perform well and consume less.
Tulsa is a dual core Xeon built on Intel's very successful 65 nm process. It is a true dual core, with both cores sharing some control logic and a large L3 cache which can be 4, 8 or 16 MB in size. Tulsa can scale up to 3.4 GHz, but we tested the more affordable 3.2 GHz version with 8 MB cache.
The Tulsa Die
The biggest Tulsa die weighs in at 435 mm², a result of containing 1.3 Billion transistors. By using slower but 3 times less "leaky" transistors, and letting the parts of the caches that are not accessed "sleep", the caches consume less than 1 W/MB. Tulsa can be used as an upgrade for Paxville and uses the same "Truland platform" with the Twin Castle chipset. If that sounds like gibberish, the Truland platform has been tested and explained here at AnandTech by Jason.
Two independent 800 MHz FSBs give each of the 2 sockets (4 cores) a 6.4GB/s pipe to the Northbridge. By using four XMBs (eXternal Memory Bridge), capacity and bandwidth is maximized. The XMBs find a place on a hot swappable memory board, and each XMB drives 4 memory slots. Below you can see the memory board; the XMB is under the heatsink.
The big performance booster is Tulsa's L3 cache. Tulsa's massive L3 is protected by Pellston technology. As caches get bigger, the possibility of getting a data error also increases. Pellston can disable a faulty cache line (128 byte) during BIOS initialization when all cache lines are checked, or it can even do so while the CPU is processing. The Pellston technology is in fact an algorithm that checks if a cache line error is the result of a hard error or a soft error. The actual "checking" whether a cache line is bad or not is done by an ECC algorithm on the 32 ECC bits which protects the L3 cache lines. In other words, Pellston makes the ECC protect cache a little smarter, allowing it to act on ECC errors rather than only reporting ECC errors.
The L3 cache is inclusive: it also contains the contents of the L2-cache. Thanks to the shared and inclusive nature of the L3-cache coherency traffic between the four CPUs is significantly reduced. Too much Coherency traffic can cause multithreaded applications that share variables among the different threads like OLTP databases and web servers to slow down.
So higher clock speeds, the newer 65 nm process, much less leaky transistors, and an extra shared L3 should allow the Xeon 71xx "Tulsa" to perform much better than the Xeon 70xx "Paxville" and consume a quite a bit less. Considering that Xeon 71xx has a TDP of 95W at 3 GHz while the Xeon 70xx needed 165 W at the same speed, it appears that Intel engineers have been very successful in reducing power consumption.
Intel's own benchmarks indicate 42% higher Integer throughput while the clock speed has increased by 13%. The most spectacular graph is the SPECjbb one: according to Intel, the Xeon 7140 is no less than 2.5 times faster than the old Xeon 7041. However, the benchmark is rather vague, as Intel does not reveal if the JVMs were completely the same. A different JVM can make a big difference. Tulsa also supports EM64T, the XD bit, HW Virtualization Technology and EIST as you can see from our BIOS setup screenshot.
88 Comments
View All Comments
duploxxx - Monday, November 13, 2006 - link
Its nice to say that the new Intel system's have the RAS support and the AMD one not, however keep in mind that you are using an old opteron socket (you can say you have the latest revision 2006).AMD's Opteron 800/200-series (1207-pin, Socket F). The 1207-pin Socket F "Santa Rosa" core AMD Opteron CPU features DDR-2 memory support and Virtualization technology, in addition to Memory RAS security.
Slappi - Sunday, November 12, 2006 - link
Please sell your Intel stock and then rewrite the article please.Thank you,
Slappi
LuxFestinus - Tuesday, November 14, 2006 - link
Taken from Scientia's post here:http://www.amdzone.com/index.php?name=PNphpBB2&...">AMD Forum Board
Kiijibari - Sunday, November 12, 2006 - link
They are misleading as it is unclear what you mean with "mem bandwidth".Is it FSB bandwidth ? System memory bandwidth ? CPU bandwidth ... ?
It is correct that Intel can deliver 21 GB/s from the memory, however one CPU so far can "just" can handle ~11GB/s. So why should 1 Xeon DP have a memory bandwidth of 21 GB/s ? That statement is not valid, if you limit it to one CPU.
Obviously, you meant the System memory bandwidth, but then I really wonder about your Opteron Socket-F numbers ...
First it would be only fair to write the system bandwidth for a 2P System(or whatever compares to the Intel configuration), too. This would be then ~21 GB/s, too, for a 2P Opteron System, 42 GB/s for a Quad System.
Then I wonder how you calculate that 8.5 GB/s mentioned with the Socket-F Opterons.
As far as I know, these chips support DDR2-667 and that means 10.6 GB/s, not 8.5. Please be fair and correct at least that obvious error ...
cheers
Kiijibari
spaceoddity - Saturday, November 11, 2006 - link
Hi Johan,Thank you very much for doing some Linux benchmarks. They are not easy to come by. There are virtually no Linux benchmarks for desktops (perhaps understandable, but frustrating for us Linux users), but for servers, which is where Linux has a sizeable presence, they are always welcome. I hope Anandtech continues to provide good Linux/UNIX benchmarks, and doesn't abandon them for more windoze benchmarks, which are everywhere anyway.
Cheers!
JohanAnandtech - Sunday, November 12, 2006 - link
Well I firmly believe the marketshare of linux servers can only grow and that therefore linux benchmarking will only get more important. A colleague of mine pointed out that Novell has launched e-directory: a very solid alternative to MS Small business server with the same functionality and ease of use, but much cheaper per connection, and with the ability to grow with the enterprise.It is just yet another reason why Linux on servers is so attractive besides much lower cost and much more control over your own IT infrastructure
Justin Case - Saturday, November 11, 2006 - link
In the OpenSSL 1024-bit signs, the quad Opteron has an almost 40% advantage over the Xeon when using 8 threads (in fact, that advantage rises to more than 90% when using optimized binaries), and is still the best of the bunch at 16 threads (32 wasn't tested), and yet the article text completely fails to mention this.It mentions the point where the (more expensive and more power-hungry) Xeon has its biggest advantage (4 threads, with a whooping 9% advantage over the Opteron), and the point where the Sun server (even more expensive) has its biggest advantage (32 threads, but the Opteron wins if using optimized binaries), but completely ignores the Opteron's trouncing of all the competition at 8 and 16 threads, and the fact that the Xeon 5160 cannot scale past its 4-thread peformance at all.
http://images.anandtech.com/reviews/it/2006/tulsa-...">http://images.anandtech.com/reviews/it/2006/tulsa-...
So the fact is the Opteron can handle a load of 10000 signs per second (over 12000 with optimized binaries), while the Xeon can't even reach 6000 (6200 with optimized binaries).
http://images.anandtech.com/reviews/it/2006/woodcr...">http://images.anandtech.com/reviews/it/...rest-lin...
And yet, according to the article, "the Opteron no longerbeats the Xeon". Huh? A 40% advantage isn't enough to win? Who compared the scores, Diebold?
So what if the Xeon performs better when you cripple the Opteron by reducing the number of threads? In any real-world situation, the server admin is going to use the number of threads that delivers the best performance (and is going to use the optimized binaries, of course, if he's competent). Just because the Xeon tops out at 4 threads doesn't mean the (better) results delivered by the Opteron should be discarded.
If this was a "normal" Anadtech article, I wouldn't be surprised by the bias and "selective reporting", but I never expected Johan to "tow the party line" like this.
JohanAnandtech - Sunday, November 12, 2006 - link
From the article:Yeah, I am really doing Intel a favor here, pointing out one of the weaknesses of their core architecture and showing yet another very weak point of Netburst.
Again from the article:
More than one thread per core doesn't give any performance advantage (unless you have a multithreaded CPU) so of course a Dual Xeon 5160 doesn't scale beyond 4 threads, just like a Dual Opteron. As openSSL scales almost perfectly, The important thing here is performance/core, as you don't want to pay for multi socket machine if you don't want to.
You should definitely read more carefully. "Selective reporting" would not include the MySQL, Power consumption or even the NUMA specjbb results as they are favorable for the Opteron.
Justin Case - Monday, November 13, 2006 - link
The fact is, the quad Opteron box reviewed (DL585) _can_ sustain higher performance than the Xeon 5160 (close to 90% higher, using optimized binaries), correct? So, unless the Opteron box costs twice as much as the 5160 box (identically supported and configured, apart from the CPUs / MB), it delivers more bang for the buck.Is this a server test or a CPU core test? It's filed under "IT / Computing", not under "CPU / Chipset", so I have to assume it's supposed to be the former.
So what if one server has twice (or 100 times) as many cores as the other? You might as well argue that the servers must be compared at the same clock speed, with the same amount of on-die cache, or with the same type of memory. All those things might be relevant when comparing CPU architectures (then again...), but not when you're comparing complete systems. The whole point of a server comparison is to see what kind of performance you get for the price. If one server is 70% more expensive but 80% faster, it's still a better deal for people who need the extra performance. That extra performance can be due to a higher clock speed, more CPUs, more cores per CPU, better memory bandwidth, a dedicated coprocessor, magic imps, whatever. But it doesn't make any sense to "compensate" for those variables (or for one of those variables) and ignore the fact that server X can and does deliver better performance than server Y when both make full use of their resources.
At 4 threads, the 5160 is the fastest system of those tested. So what if it has a 20% clock speed advantage? It's still the fastest, right? You're not going to artificially cripple its clock speed to match the others; doing that wouldn't make any sense (because, in the real world, no buyer / server admin would do that). So why cripple the other systems by limiting the number of threads they are running? In that test (with unoptimized binaries), the Sun box reaches the highest performance, period. With optimized binaries, the Opteron box manages to pull slightly ahead. Of course, then you have to take price into account, and maybe for a lot of people the 5160-based server will be the better deal, but you can't say it performs better when, objectively, it does not.
nah - Saturday, November 11, 2006 - link
Great job Johan---as always