AMD's 3rd generation Opteron versus Intel's 45nm Xeon: a closer look
by Johan De Gelas on November 27, 2007 6:00 AM EST- Posted in
- IT Computing
The memory subsystem (Linux 64-bit)
Most of the applications of the server and HPC world are well multi-threaded and scale nicely with more cores. With the exception of some rendering engines, this also means that our hard working quad-core CPUs will require quite a bit more bandwidth when they are processing these multi-threaded applications. In our previous review, we found out that:
- Barcelona's L2 cache is 50 to 60% faster
than the older Opteron (22xx). So each core gets at least 50% more L2 bandwidth.
- Each Barcelona's L2 cache is almost as fast as the
shared L2 cache of a similarly clocked 65nm Core based Xeon.
The Intel Xeon has a big advantage of course: its L2 cache is 8 to 12 times larger!
- Barcelona's single-threaded memory bandwidth is 26% to 50% better than the older Opteron and almost twice as good as what a similar Intel Xeon gets.
The problem is of course the word "single-threaded". Those bandwidth numbers are not telling us what we really want to know: how well does the memory subsystem keep up if all cores are processing several heavy threads?
We only had access to the Intel and GCC compilers, and we felt we should use a different compiler to create our multi-threaded stream binary. GCC would probably not create the fastest binary and Intel's compiler might give the Core architecture too many software prefetch hints (or other tricks that might artificially boost the bandwidth numbers). Alf Birger Rustad helped us out and sent us a multi-threaded, 64-bit Linux Stream binary based on v2.4 of Pathscale's C-compiler. We used the following compiler switches:
-Ofast -lm -static -mp
We tested with one, two, and four threads. "Two CPUs" means that we tested with four threads on dual dual-core and eight threads on dual quad-core. "2 CPUs" also means that we used only one CPU in the 1-4 threads setting and we only used a second CPU in the "2 CPUs" setup.
Note that clock speeds do not really matter, except for the Socket-F Opteron. Although we did not include this in the graph above (to avoid color chaos), the clock speed of the socket-F Opteron only matters for the single-threaded bandwidth numbers. Look at the table below:
AMD vs. AMD Multi-threaded Stream | |||
1 Thread | 2 threads | 2 CPUs | |
Dual Opteron 2212 2.0 | 5474 | 6330 | 12220 |
Dual Opteron 2222 3.0 | 6336 | 6472 | 12664 |
Difference 3GHz vs. 2GHz | 16% | 2% | 4% |
Dual Opteron 23xx | 6710 | 8232 | 16614 |
Difference Opteron 23xx vs. Opteron 22xx | 23% | 30% | 36% |
With one thread, the 2GHz Opteron 2212 is clearly not fast enough to take advantage of the bandwidth that DDR2-667 can deliver. However, once you make both cores work, this is no longer the case. The Opteron 23xx numbers make clear that the deeper buffers really help: each quad-core has about 30% more bandwidth available than the dual-core. That should be more than enough to keep twice as many cores happy.
The graph above also quantifies the platform superiority that many ascribe to AMD. Likewise, it confirms that the new Intel platform has a much better memory subsystem thanks to the Seaburg chipset. To understand this we calculated the bandwidth numbers, with the "Bensley + Clovertown" platform representing our baseline 100%.
AMD vs. Intel Multi-threaded Stream | ||||
1 Thread | 2 threads | 4 threads | 2 CPUs | |
Opteron 23xx | 232% | 207% | 150% | 308% |
Xeon 54xx + Seaburg + 800MHz RAM | 164% | 225% | 158% | 172% |
Xeon 54xx + Seaburg + 667MHz RAM | 159% | 196% | 128% | 138% |
If you use two CPUs, the Opteron 23xx has no less than 3 times the amount of bandwidth compared to the "old" 65nm Xeon. However, it is much less likely that bandwidth will be a bottleneck for the "new" Xeon 45nm as it has 40% to 60% more bandwidth (with the same kind of memory) compared to the "old" Xeon. If necessary, you'll be able to use 800MHz FBDIMMs that will offer more bandwidth (9GB/s versus 7.7GB/s).
It becomes clear why even a 3GHz Xeon 5365 is not able to beat AMD in SPECFP2006rate: running eight instances of SPECFP2006 is bandwidth limited.
The memory subsystem, latency
To understand the memory subsystem of the different CPUs, we also need to look at latency. We have noticed that many latency measurement benchmarks are inaccurate when you have two CPUs running, so we tested with only one socket filled. Below you can see the numbers for a stride of 128 Bytes, measured with the CPU-Z 1.41 latency test.
CPU-Z Memory Latency | |||||
Data size (kB) | Opteron 2212 2.0 | Opteron 2350 | Opteron 2360SE | Dual Xeon
5472 (DDR2-667) |
Xeon E5365 |
4 | 3 | 3 | 3 | 3 | 3 |
8 | 3 | 3 | 3 | 3 | 3 |
16 | 3 | 3 | 3 | 3 | 3 |
32 | 3 | 3 | 3 | 3 | 3 |
64 | 3 | 3 | 3 | 15 | 14 |
128 | 12 | 15 | 15 | 15 | 14 |
256 | 12 | 15 | 15 | 15 | 14 |
512 | 12 | 15 | 15 | 15 | 14 |
1024 | 12 | 44 | 48 | 15 | 14 |
2048 | 114 | 44 | 48 | 15 | 14 |
4096 | 117 | 111 | 121 | 15 | 14 |
8192 | 117 | 113 | 126 | 242 | 215 |
16384 | 117 | 113 | 125 | 344 | 282 |
32768 | 117 | 113 | 126 | 344 | 282 |
The quad-core Opteron had to make a compromise or two. As the 463 million transistor chip is already 285 mm² in size, each core only gets a 512 KB L2 cache. That means that in some situations (>512 KB) the old 90nm Opteron 22xx is better off as it has access to a very fast 12 cycle L2 cache while the Opteron 23xx has to access a rather slow 44-48 cycle L3 cache.
Note also that the 2.5GHz Opteron 2360 "sees" a slower L3 cache than the 2350: 48 cycles versus 44. The memory controller seems to be ok: the slightly higher latency compared to the Opteron 22xx series is a result of the fact that the Opteron 23xx cores have to check the L3 cache tags, while the Opteron 22xx doesn't have to do that. Notice that memory latency of the on-die memory controller is still far better (+/- 60 ns) than what the Seaburg or Blackford chipset (+/- 70-90 ns) can offer to the Xeon Cores. We have encountered situations where Barcelona's memory controller accesses the memory with much higher latencies (86 ns and more) than the Opteron 22xx but we have to study this in more detail to understand whether this has a realworld impact or not.
Native quad-core versus dual dual-core, part 2
Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time.
We noticed that running our Cache2Cache benchmark (see here and here) gives results that are more accurate if you measure the results on the same die with only one CPU, and then measure the results from one CPU die to another one with two CPUs. Cache2cache quantifies the delay that a "snooping" CPU encounters when it tries to get up-to-date data from another CPU's cache.
Cache coherency ping-pong (ns) | |||
Same die, same package |
Different die, same package |
Different die, different socket |
|
Opteron 2350 - Stepping B1 | 127 | N/A | 199 |
Opteron 2360SE - Stepping B2 | 107 | N/A | 199 |
Xeon E5472 3.0 | 53 | 150 | 237 |
Xeon E5365 3.0 | 53 | 150 | 237 |
The Xeon syncs very quickly via its shared L2 cache (26.5 ns), but a bit slower from the first CPU to the third one (75 ns). AMD's native quad-core design is a bit faster in the latter case (53.5 ns with the 2360 SE). The difference is slightly less when you have to sync between two sockets (99.5 ns versus 118.5).
43 Comments
View All Comments
befair - Friday, November 28, 2008 - link
ok .. getting tired of this! Intel loving Anandtech employs very unfair & unreasonable tactics to show AMD processors in bad light every single time. And most readers have no clue about the jargon Anandtech uses every time.1 - HPL needs to be compiled with appropriate flags to optimize code for the processor. Anandtech always uses the code that is optimized for Intel processors to measure performance on AMD processors. As much as AMD and Intel are binary compatible, when measuring performance even a college grad who studies HPC knows the code has to be recompiled with the appropriate flags
2 - Clever words: sometimes even 4 GFLOPS is described as significant performance difference
3- "The Math Kernel Libraries are so well optimized that the effect of memory speed is minimized." - So ... MKL use is justified because Intel processors need optimized libraries for good performance. However, they dont want to use ACML for AMD processors. Instead they want to use MKL optimized for Intel on AMD processors. Whats more ... Intel codes optimize only for Intel processors and disable everything for every other processors. They have corrected it now but who knows!! read here http://techreport.com/discussions.x/8547">http://techreport.com/discussions.x/8547
I am not saying anything bad about either processor but an independent site that claims to be fair and objective in bringing facts to the readers is anything but fair and just!!! what a load!
DonPMitchell - Friday, December 7, 2007 - link
I think a lot of us are intrigued by AMD's memory architecture, its ability to support NUMA, etc. A lot of benchmarch test how fast a small application runs with a high cash-hit rate, and that's not necessarily interesting to everyone.The MySQL test is the right direction, but I'd rather see numbers for a more sophisticated application that utilizes multiple cores -- Oracle or MS SQL Server, for example. These are products designed to run on big iron like Unisys multi-proc servers, so what happens when they are running on these more economical Harpertown or Barcelona.
kalyanakrishna - Thursday, November 29, 2007 - link
http://scalability.org/?p=453">http://scalability.org/?p=453kalyanakrishna - Thursday, November 29, 2007 - link
a much better review than the original one. But I still see some cleverly put sentences, wish it were otherwise.Viditor - Thursday, November 29, 2007 - link
Nice review Johan!On the steppimgs note you made, it's not the B2 stepping that is supposed to perform better, it's the BA stepping...
The BA stepping was the improved form for B1s, and the B3 stepping is the improved form of the B2. BA and B2 came out at the same time in Sept (though BA was the one launched, B1 was what was reviewed), B2 for Phenom and performance clockspeeds, BA for standard and low power chips.
Do you happen to have a BA chip to test (those are the production chips)?
BitByBit - Wednesday, November 28, 2007 - link
Despite K10's rather extensive architectural improvements, it looks likes its core performance isn't too different to K8. In fact, the gains we've seen so far could easily be attributable to the improved memory controller and increased cache bandwidth. It seems that introducing load reordering, a dedicated stack, improved branch prediction, 32B instruction fetch, and improved prefetching has had little impact, certainly far less than expected. The question is, why?JohanAnandtech - Wednesday, November 28, 2007 - link
Well, we are still seeing 5-10% better integer performance on applications that are runing in the L2, so it is more than just a K8 with a better IMC. But you are right, I expected more too.However, the MySQL benchmark deserves more attention. In this case the Barcelona core is considerably faster than the previous generation (+ 25%). This might be a case where 32 bit fetch and load reordering are helping big time. But unfortunately our Codeanalyst failed to give all the numbers we needed
BaronMatrix - Wednesday, November 28, 2007 - link
At any rate, it was the most in-depth review I've seen, especially with the code analysis. I too, thought it would be higher, but remember that Barcelona is NOT HT3 and doesn't have the advantage of "gangning\unganging." There was an interesting article recently that showed perf CAN be improved by unganging (maybe it was ganging, can't find it) the HT3 links.I really hate that OEMs decided to stand up to the big, bad AMD and DEMAND that Barcelona NOT have HT3 with ALL OF ITS BENEFITS.
I mean people complain that Barcelona uses more power, but HT3 would cut that somewhat. At least in idle mode, and even in cases where IMC is used more than the CPU or vice versa.
I also may as well use this to CONDEMN all of these "analysts" who insist on crapping on the underdog that keeps prices reasonable and technology advancing.
INSERT SEVERAL EXPLETIVES. REPEATEDLY. FOR A FEW DAYS. A WEEK. FOR A YEAR.
INSERT MORE EXPLETIVES.
donaldrumsfeld - Wednesday, November 28, 2007 - link
Conjecture regarding why AMD went quad core on the same die... and this has nothing to do with performance. I think one place where Intel is way ahead of AMD is package technology. Remember they were doing a type of Multichip module with the P6. Having 2 dice instead of a single die allows them to have an overall lower defect rate, higher yield, and higher GHz. This is vs. AMD's lower GHz but (it was hoped) greater data efficiency using an L3 die and lower latency of on-die communications amongst cores vs. Intel's solution of die to die communication.Can anyone confirm/deny this?
thanks
tshen83 - Tuesday, November 27, 2007 - link
Seriously, can you buy the 2360SE? Newegg doesn't even stock the 1.7Ghz 2344HEs.The same situation exist on the Phenom line of CPUs. I don't see the value of reviewing Phenom 9700, 9900s when AMD cannot deliver them. I have trouble locating Phenom 9500s.