Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
by Johan De Gelas on March 24, 2006 12:05 AM EST- Posted in
- IT Computing
The Slim T1 CPU
It is very unfair of us to compare one of the eight very slim T1 cores to mammoths like the Opteron or the Xeon, which have about 10 to 20 times more transistors. Still, we are curious. We know that Sun sacrificed single-threaded performance on the altar of power consumption, multi-threaded performance and die space. How far did they go? Let us find out with LMBench 3.0a. By the way, you can find much more information about the T1 CPU in our previous article.
First, we check the cache latency and RAM latency. For fat modern superscalar cores like the Opteron and Xeon, these numbers are extremely important. The T1 CPU is less sensitive to the latency of the memory subsystem as long as it has enough threads. The T1 swaps threads waiting for the memory to respond for more responsive threads.
Sun has definitely favoured power consumption here. A 3-cycle latency at 1 GHz on a 90 nm process is very conservative. A 22-cycle L2-cache latency is even a bit slow, but again, the thread Gatling gun takes care of that. The built-in memory controllers pay off: latency is about 105 cycles, while even the Pentium-M needs 147 cycles. This helps to keep the average latency (seen from viewpoint of the CPU) low.
Let us see if there is some integer crunching power in the little Sparc core.
The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off. Update: the Sun documentation tell us 7-11 cycles for multiply and 72 for division.
Let us check out what the lonely FPU of the T1 can do.
FADD and FMUL are a little faster than what we first reported (40 cycles), and the main part of that latency might just consist of getting the data to the FPU of the T1. It is clear that the Sun T1 doesn't like FP code at all.
It is very unfair of us to compare one of the eight very slim T1 cores to mammoths like the Opteron or the Xeon, which have about 10 to 20 times more transistors. Still, we are curious. We know that Sun sacrificed single-threaded performance on the altar of power consumption, multi-threaded performance and die space. How far did they go? Let us find out with LMBench 3.0a. By the way, you can find much more information about the T1 CPU in our previous article.
First, we check the cache latency and RAM latency. For fat modern superscalar cores like the Opteron and Xeon, these numbers are extremely important. The T1 CPU is less sensitive to the latency of the memory subsystem as long as it has enough threads. The T1 swaps threads waiting for the memory to respond for more responsive threads.
CPU (LMBench) | OS | Clockspeed | L1 (ns) | L1 (cycles) | L2 (ns) | L2 (cycles) | RAM (ns) | RAM (cycles) |
Opteron 275 | SunOS 5.10 | 2211 | 1.357 | 3 | 5.436 | 12 | 67.5 | 149 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 1593 | 1,880 | 3 | 6 | 10 | 92.1 | 147 |
Sun T1 1 GHz | SunOS 5.10 | 980 | 3.120 | 3 | 22.1 | 22 | 107.5 | 105 |
Opteron 275 | Linux 2.6.15- | 2209 | 1.357 | 3 | 5 | 12 | 73 | 161 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 3594 | 1.110 | 4 | 8 | 28 | 48.8 | 175 |
Sun has definitely favoured power consumption here. A 3-cycle latency at 1 GHz on a 90 nm process is very conservative. A 22-cycle L2-cache latency is even a bit slow, but again, the thread Gatling gun takes care of that. The built-in memory controllers pay off: latency is about 105 cycles, while even the Pentium-M needs 147 cycles. This helps to keep the average latency (seen from viewpoint of the CPU) low.
Let us see if there is some integer crunching power in the little Sparc core.
CPU (LMBench) | OS | Bit | Add | mul | div | mod |
Opteron 275 | SunOS 5.10 | 0.45 | 0.45 | 1.36 | 18.60 | 19.00 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 0.63 | 0.63 | 2.51 | 19.50 | 11.50 |
Sun T1 1 GHz | SunOS 5.10 | 1.01 | 1.00 | 29.10 | 104.00 | 114.00 |
Opteron 275 | Linux 2.6.15- | 0.45 | 0.45 | 1.36 | 18.60 | 19.00 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 0.28 | 0.28 | 2.79 | 17.30 | 23.30 |
The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off. Update: the Sun documentation tell us 7-11 cycles for multiply and 72 for division.
Let us check out what the lonely FPU of the T1 can do.
CPU (LMBench) | OS | FADD | FMUL | FDIV |
Opteron 275 | SunOS 5.10 | 1.80 | 1.80 | 10.90 |
Pentium- M 1.6 GHz | Linux 2.6.15- | 1.88 | 3.14 | 23.90 |
Sun T1 1 GHz | SunOS 5.10 | 26.50 | 29.30 | 54.20 |
Opteron 275 | Linux 2.6.15- | 1.81 | 1.81 | 9.58 |
Xeon Irwindale 3.6 GHz | Linux 2.6.15- | 1.39 | 1.95 | 12.60 |
FADD and FMUL are a little faster than what we first reported (40 cycles), and the main part of that latency might just consist of getting the data to the FPU of the T1. It is clear that the Sun T1 doesn't like FP code at all.
26 Comments
View All Comments
drw - Friday, March 24, 2006 - link
Based on the kernel versions listed, I assume that a 32-bit distro was used?If so, am curious how a 64-bit distro would compare, as both Apache and MySQL benefit greatly by 64 bit.
JohanAnandtech - Friday, March 24, 2006 - link
Fully 64 bit. uname -a clearly indicates 64 bitdefter - Friday, March 24, 2006 - link
Dual Opteron 275HE had 5% higher power consumpion (198W vs 188W), but it was 5-30% faster (depending wherever or not gzip was used). These results would suggest that dual Opteron has won performance/watt battle in this benchmarks.
Pricing is also quite important. What's the price for dual Opteron 275HE server with 8GB of memory? About $5000-7000?
PeterMobile - Friday, March 24, 2006 - link
Definitely interesting to see a 3. party review of the T2000. I think it could also be interesting to compare both the Sun machine and the x86 servers to an IBM p5 510Q. That's a 4-way 1.5 GHz Power5+, which including 4 GB RAM and 2 Ultra320 disks lists for $8,536.Calin - Friday, March 24, 2006 - link
I saw there is almost no loss of performance for compressing data... how about encrypting it?cxl - Friday, March 24, 2006 - link
Actually, MOD operation can be very important for servers, as it is basis for any hashing operations, commonly used in many server applications. E.g. to identify variable in a script, interpreters routinely use hashtables.
114 cycles per MOD operation is performance disaster.
Calin - Friday, March 24, 2006 - link
The performance in the tested configuration was quite good - I wonder how other benchmarks and maybe other "twists" of the benchmark tested would look like.cosmotic - Friday, March 24, 2006 - link
Did you mean certainly NOT least?
JohanAnandtech - Friday, March 24, 2006 - link
definitely ... Fixed. Just checking if you read it carefully :-)cosmotic - Friday, March 24, 2006 - link
Why no graphs? It makes reading benchmarks SO much easier.