Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
by Johan De Gelas on March 24, 2006 12:05 AM EST- Posted in
- IT Computing
The Slim T1 CPU
It is very unfair of us to compare one of the eight very slim T1 cores to mammoths like the Opteron or the Xeon, which have about 10 to 20 times more transistors. Still, we are curious. We know that Sun sacrificed single-threaded performance on the altar of power consumption, multi-threaded performance and die space. How far did they go? Let us find out with LMBench 3.0a. By the way, you can find much more information about the T1 CPU in our previous article.
First, we check the cache latency and RAM latency. For fat modern superscalar cores like the Opteron and Xeon, these numbers are extremely important. The T1 CPU is less sensitive to the latency of the memory subsystem as long as it has enough threads. The T1 swaps threads waiting for the memory to respond for more responsive threads.
Sun has definitely favoured power consumption here. A 3-cycle latency at 1 GHz on a 90 nm process is very conservative. A 22-cycle L2-cache latency is even a bit slow, but again, the thread Gatling gun takes care of that. The built-in memory controllers pay off: latency is about 105 cycles, while even the Pentium-M needs 147 cycles. This helps to keep the average latency (seen from viewpoint of the CPU) low.
Let us see if there is some integer crunching power in the little Sparc core.
The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off. Update: the Sun documentation tell us 7-11 cycles for multiply and 72 for division.
Let us check out what the lonely FPU of the T1 can do.
FADD and FMUL are a little faster than what we first reported (40 cycles), and the main part of that latency might just consist of getting the data to the FPU of the T1. It is clear that the Sun T1 doesn't like FP code at all.
It is very unfair of us to compare one of the eight very slim T1 cores to mammoths like the Opteron or the Xeon, which have about 10 to 20 times more transistors. Still, we are curious. We know that Sun sacrificed single-threaded performance on the altar of power consumption, multi-threaded performance and die space. How far did they go? Let us find out with LMBench 3.0a. By the way, you can find much more information about the T1 CPU in our previous article.
First, we check the cache latency and RAM latency. For fat modern superscalar cores like the Opteron and Xeon, these numbers are extremely important. The T1 CPU is less sensitive to the latency of the memory subsystem as long as it has enough threads. The T1 swaps threads waiting for the memory to respond for more responsive threads.
CPU (LMBench) | OS | Clockspeed | L1 (ns) | L1 (cycles) | L2 (ns) | L2 (cycles) | RAM (ns) | RAM (cycles) |
Opteron 275 | SunOS 5.10 | 2211 | 1.357 | 3 | 5.436 | 12 | 67.5 | 149 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 1593 | 1,880 | 3 | 6 | 10 | 92.1 | 147 |
Sun T1 1 GHz | SunOS 5.10 | 980 | 3.120 | 3 | 22.1 | 22 | 107.5 | 105 |
Opteron 275 | Linux 2.6.15- | 2209 | 1.357 | 3 | 5 | 12 | 73 | 161 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 3594 | 1.110 | 4 | 8 | 28 | 48.8 | 175 |
Sun has definitely favoured power consumption here. A 3-cycle latency at 1 GHz on a 90 nm process is very conservative. A 22-cycle L2-cache latency is even a bit slow, but again, the thread Gatling gun takes care of that. The built-in memory controllers pay off: latency is about 105 cycles, while even the Pentium-M needs 147 cycles. This helps to keep the average latency (seen from viewpoint of the CPU) low.
Let us see if there is some integer crunching power in the little Sparc core.
CPU (LMBench) | OS | Bit | Add | mul | div | mod |
Opteron 275 | SunOS 5.10 | 0.45 | 0.45 | 1.36 | 18.60 | 19.00 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 0.63 | 0.63 | 2.51 | 19.50 | 11.50 |
Sun T1 1 GHz | SunOS 5.10 | 1.01 | 1.00 | 29.10 | 104.00 | 114.00 |
Opteron 275 | Linux 2.6.15- | 0.45 | 0.45 | 1.36 | 18.60 | 19.00 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 0.28 | 0.28 | 2.79 | 17.30 | 23.30 |
The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off. Update: the Sun documentation tell us 7-11 cycles for multiply and 72 for division.
Let us check out what the lonely FPU of the T1 can do.
CPU (LMBench) | OS | FADD | FMUL | FDIV |
Opteron 275 | SunOS 5.10 | 1.80 | 1.80 | 10.90 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 1.88 | 3.14 | 23.90 |
Sun T1 1 GHz | SunOS 5.10 | 26.50 | 29.30 | 54.20 |
Opteron 275 | Linux 2.6.15- | 1.81 | 1.81 | 9.58 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 1.39 | 1.95 | 12.60 |
FADD and FMUL are a little faster than what we first reported (40 cycles), and the main part of that latency might just consist of getting the data to the FPU of the T1. It is clear that the Sun T1 doesn't like FP code at all.
26 Comments
View All Comments
JackPack - Friday, March 24, 2006 - link
Pleasant to read as usual, Johan.BTW, are they letting you keep the T2000?
http://blogs.sun.com/roller/page/jonathan?entry=ni...">http://blogs.sun.com/roller/page/jonathan?entry=ni...
PandaBear - Friday, March 24, 2006 - link
In terms of Branded server it is a good price, but as benchmark have shown, a Dual Opteron running Linux both perform better and use less power. I think people who buy these class of server want support and service (and build quality) and in that case Sun certain would win the whitebox builder no matter how good a Dual Opteron is.Nonetheless it is a good product, for the one who demand this kind of quality. Now Intel's solution really looks bad.
Calin - Friday, March 24, 2006 - link
I don't know what you are talking about - if you would up the memory on the Opteron HE (2CPU of 2 cores) to 32GB, the power consumption would be almost the same (assuming 6W per 4GB of RAM, it would be at 234W. Close enough to be considered equal, I'd say.Also, wouldn't populating all the possible memory slots on the Opteron decrease a bit its performance? I don't know about Opteron, but Athlon64 decrease its command rate (Help, Johan! :) ) when working with all the memory channels filled.
I agree about the better performance of the Opteron server, but regarding the power use, it is the same as the Sun's recent offering. Maybe the introduction of the DDR2 Opterons would change the power envelope, but until then, the T1 might have some aces up its sleeve
JohanAnandtech - Friday, March 24, 2006 - link
You must calculate about 4-5 Watt per 2 GB Dimm. Based on the measurements I did and slightly guessing I think a 32 GB Opteron HE with 32 GB would definitely consume more than The T2000 as also have to count a few Watts per memory channel.Indeed, fully loaded DIMM channels will probably throttle back to lower speeds. I am not sure about Command rate though (BTW, it increases on the Athlon 64 not decreases :-), as it is possible less important with buffered DIMMs.
About performance, we still have to test a lot of scenario's (jsp, databases). The impression of the T2000 might still change.
Zoomer - Sunday, April 9, 2006 - link
2xx Opterons use rigistered ram, so its not an issue like with the 1xx 939s.Calin - Friday, March 24, 2006 - link
I just took the difference measured between the 2xOpteron HE with 4 and 8 GB or RAM (192 and 198W), shown in the table on the last page. I know that even rounding errors might change that between 4 and 8W, but anyway, Opterons won't use less power than the T1.Very interesting article, and I eagerly await for the sequels :D