The Secret Boost of the Opteron 2224
Socket F Opterons have a small secret weapon: a speed bump offers more than just a faster CPU. To understand this, take a look at the table below. We measured the L2 cache's bandwidth with Lavalys Everest 3.51.
The L2 cache of the Opteron 8218 at 2.6GHz is slower than the Core 2's L2 cache at 2.33. At about 10-11 GB/s it barely matches the theoretical peak bandwidth that DDR2 at 667MHz can deliver (10.6 GB/s), while its exclusive nature also forces it to exchange quite a bit of data with the L1 cache. Now combine this table with the following one, where we measured memory bandwidth.
It is no secret that a higher clocked integrated memory controller can increase the actual delivered bandwidth of the same DDR2 modules. But it also helps that the L2 cache is able to swallow the bandwidth that the memory is capable of delivering. Also notice that without the use of SSE2 instructions, the memory subsystem of the 5000p chipset delivers relatively disappointing amounts of bandwidth. As most applications do not use carefully tuned SSE2 code to get data from memory, this should reflect the real world situation most of the time. And of course, until Intel introduces the Nehalem family, memory latency will continue to be one of the strong points of AMD.
The latency penalty that FB-DIMM introduces is huge. To get an idea, we added the latency measured with a Core 2 Duo 2.933 using 2x 2GB 533MHz DDR2. The staggering conclusion is that registered FB-DIMMs add - in the worst case - about 200 cycles or 66ns of latency. Sure, some of that latency can be attributed to the buffering which is necessary for server memory. Buffered memory contains registers which will actually hold data for one full clock cycle before it's passed on. So this means that registered memory should add about 8ns (2 clock cycles at 266MHz base clock, DDR2-533).
The secondary benefit of FB-DIMMs is that motherboards can use more DIMMs per bank, potentially increasing total memory capacity. AMD already gets around this quite easily with up to eight DIMM sockets per CPU socket, however, so this benefit really doesn't materialize in any reasonable form. The bottom line is that while FB-DIMMs were a potentially good idea from a purely theoretical point of view, it is rather obvious that in practice they have some pretty bad consequences.
Socket F Opterons have a small secret weapon: a speed bump offers more than just a faster CPU. To understand this, take a look at the table below. We measured the L2 cache's bandwidth with Lavalys Everest 3.51.
Lavalys Everest 3.51 L2 Bandwidth | |||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | |
Dual Xeon 5160 3.0 GHz | 22019 | 17751 | 23628 |
Xeon E5345 2.33 GHz | 17610 | 14878 | 18291 |
Opteron 2224 SE 3.2 GHz | 14636 | 12636 | 14630 |
Opteron 8218HE 2.6 GHz | 11891 | 10266 | 11891 |
The L2 cache of the Opteron 8218 at 2.6GHz is slower than the Core 2's L2 cache at 2.33. At about 10-11 GB/s it barely matches the theoretical peak bandwidth that DDR2 at 667MHz can deliver (10.6 GB/s), while its exclusive nature also forces it to exchange quite a bit of data with the L1 cache. Now combine this table with the following one, where we measured memory bandwidth.
Lavalys Everest 3.51 Memory Bandwidth | ||||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | Latency (ns) | |
Dual Xeon 5160 3.0 GHz | 3656 | 2771 | 3800 | 112.2 |
Xeon E5345 2.33 GHz | 3578 | 2793 | 3665 | 114.9 |
Opteron 2224 SE 3.2 GHz | 7466 | 6980 | 6863 | 58.9 |
Opteron 8218HE 2.6 GHz | 6944 | 6186 | 5895 | 64 |
It is no secret that a higher clocked integrated memory controller can increase the actual delivered bandwidth of the same DDR2 modules. But it also helps that the L2 cache is able to swallow the bandwidth that the memory is capable of delivering. Also notice that without the use of SSE2 instructions, the memory subsystem of the 5000p chipset delivers relatively disappointing amounts of bandwidth. As most applications do not use carefully tuned SSE2 code to get data from memory, this should reflect the real world situation most of the time. And of course, until Intel introduces the Nehalem family, memory latency will continue to be one of the strong points of AMD.
Processor Latency Comparison | ||||||
CPU | L1 | L2 | L3 | min mem | max mem | Absolute latency (ns) |
Xeon 5160 3.0 - DDR2 533 | 3 | 14 | 69 | 380 | 127 | |
Xeon 5160 3.0 - DDR2 667 | 3 | 14 | 67 | 338 | 113 | |
Core 2 Duo 2.933 - DDR2 533 | 3 | 14 | 67 | 180 | 61 | |
Quad Xeon E5345 2.33 - DDR2 533 | 3 | 14 | 80 | 280 | 120 | |
Quad Xeon E5345 2.33 - DDR2 667 | 3 | 14 | 80 | 271 | 116 | |
Xeon 7130M 3.2 - DDR2 400 | 4 | 29 | 109 | 245 | 624 | 195 |
Opteron 880 2.4 - DDR333 | 3 | 12 | 84 | 228 | 95 | |
Opteron 2224 SE - DDR2 667 | 3 | 12 | 72 | 189 | 59 | |
Opteron 2218 HE - DDR2 667 | 3 | 12 | 62 | 157 | 60 |
The latency penalty that FB-DIMM introduces is huge. To get an idea, we added the latency measured with a Core 2 Duo 2.933 using 2x 2GB 533MHz DDR2. The staggering conclusion is that registered FB-DIMMs add - in the worst case - about 200 cycles or 66ns of latency. Sure, some of that latency can be attributed to the buffering which is necessary for server memory. Buffered memory contains registers which will actually hold data for one full clock cycle before it's passed on. So this means that registered memory should add about 8ns (2 clock cycles at 266MHz base clock, DDR2-533).
The secondary benefit of FB-DIMMs is that motherboards can use more DIMMs per bank, potentially increasing total memory capacity. AMD already gets around this quite easily with up to eight DIMM sockets per CPU socket, however, so this benefit really doesn't materialize in any reasonable form. The bottom line is that while FB-DIMMs were a potentially good idea from a purely theoretical point of view, it is rather obvious that in practice they have some pretty bad consequences.
30 Comments
View All Comments
2ManyOptions - Monday, August 6, 2007 - link
... for most of the benchmarks Intel chips performed better than the Opterons, don't know why Intel should get scared from these, they can safely wait for Barcelona. Didn't really understand why you have out it as AMD is still in game with these in the 4S space.baby5121926 - Monday, August 6, 2007 - link
intel got scared because they dont want to see the real result from AMD + ATI.the longer intel lets AMD lives, the more dangerous intel will be.
that's why you guys can see Intel is attacking AMD really really hard at this meantime... just to kick AMD out of the game.
Justin Case - Monday, August 6, 2007 - link
What are the units in the WinRAR results table?coldpower27 - Monday, August 6, 2007 - link
Check Intel own pricing lists, and you will see that Intel has already pre-empted some of these cuts with their Xeon X5355 at $744 or Xeon E5345 at $455 and the "official" Xeon X5365 should be cout soon if not already...http://www.intel.com/intel/finance/pricelist/proce...">http://www.intel.com/intel/finance/pric...rice_lis...
TheOtherRizzo - Monday, August 6, 2007 - link
I know nothing about 4S servers. But what's the essence of this article? Surely not that NetBurst is crap? We've known that for years. Is the real story here that Intel doesn't really give a s*** about 4S, otherwise they would have moved on to the core 2 architecture long ago? Just guessing.coldpower27 - Monday, August 6, 2007 - link
Xeon 7300 Series based on the Tigerton core which is a 4 Socket Capable Kentsfield/Clovertown derivatives is arriving in Sepetember this year, so Intel does care in becoming more competitive in the 4S space, but it is just taking some time.They decided to concentrate on the high volume 2S sector is all first, since Intel has massive capacity, going for the high volume sector first makes sense.
mino - Monday, August 13, 2007 - link
Yes and no, actually to have two intel quads running on a single FSB was a serious technical problem.Therefore they had to wait for 4-FSB chipset to be able to get them out the door. Not to mention the qualification times which are a bit onger for 4S platforms that 2S.
AMD does not have these obstacles as 8xxx series are essentially 2xxx series from stability/reliability POW.
Calin - Monday, August 6, 2007 - link
The 5160 processor is Core2 unit, not a NetBurst one. Also, the 5345 is a quad core based on Core2jay401 - Monday, August 6, 2007 - link
People built 3.0GHz - 3.33GHz E4300 & E4400 systems six months ago that cost roughly $135 for the CPU. Others went for an E6300 or more recently an E6320, both again under $200.They were all relatively easy overclocks.
Why does anyone with any skill in building their own computer care about an $800+ CPU again?
Calin - Monday, August 6, 2007 - link
Why don't Ford Mustangs use a small engine, overclocked to hell? Like an inline 4 2.0l with turbo, and a high rpm instead of their huge 4+ liter engines?Why do trucks use those big engines, when they could get the same power from a smaller, gasoline, turbocharged engine?
People pay $800+ for processors that work in multiprocessor systems (your run of the mill Athlon64 or E4300 won't run). Also, they use error checking (and usually error correcting) memory in their systems - again, Athlon64 doesn't do this. They also use registered DDR in order to access more memory banks - your Athlon64 again falls short. On the E4300 side, the chipset is responsible with those things, so you could use such a processor in a server chassis - if the socket fits.