Intel Clovertown: Quad Core for the Masses
by Jason Clark & Ross Whitehead on March 30, 2007 12:15 AM EST- Posted in
- IT Computing
Benchmarking Clovertown
What do you compare Clovertown to? Since there are no other quad core solutions to compare it to, we were stuck with how to compare it. Do you compare it only to existing dual socket options, Woodcrest and Opteron? Do you compare it only to existing eight way options, quad socket Opteron? There is no perfect answer, but we decided that comparing it to the previous Intel solution, Woodcrest, allowed us to explore scalability of the quad core architecture vs. the dual core architecture.
We also decided to include quad socket Opteron numbers for reference. We recognize that comparing a quad socket server to a dual socket server is a bit like comparing apples and oranges, but we decided to provide the results regardless until we see K10 and can do a proper comparison of quad core technologies. Let's not lose track that we are comparing two different technologies with totally different cost structures and power consumption profiles.
Another problem we had was the additional processing power that Clovertown provided with two sockets and eight cores. We found that we could no longer run our previous benchmarks, Dell DVD Store and our Forums Benchmark, as we did not have enough I/O throughput to handle the additional processing power. In our lab we have a Promise VTrak J300s which is a 12 disk SAS chassis, but we found that using 12 disks was not enough for our old benchmarks. We estimated we needed approximately 36-48 disks to be able to continue running our OLTP benchmarks. We were not able to "obtain" the required chassis and spindles so we decided to change our benchmark suite.
We did consider several SSD flash based disks but it seems there are more "announced" SSD Flash based drives than there are "shipping" drives. Until we can significantly increase our IP capacity in our lab we will no longer be running OLTP based benchmarks. Our preference to increase our I/O capacity would be an SSD solution as it would not require spinning dozens of drives in several chassis but... neither is available to us at this time.
What do you compare Clovertown to? Since there are no other quad core solutions to compare it to, we were stuck with how to compare it. Do you compare it only to existing dual socket options, Woodcrest and Opteron? Do you compare it only to existing eight way options, quad socket Opteron? There is no perfect answer, but we decided that comparing it to the previous Intel solution, Woodcrest, allowed us to explore scalability of the quad core architecture vs. the dual core architecture.
We also decided to include quad socket Opteron numbers for reference. We recognize that comparing a quad socket server to a dual socket server is a bit like comparing apples and oranges, but we decided to provide the results regardless until we see K10 and can do a proper comparison of quad core technologies. Let's not lose track that we are comparing two different technologies with totally different cost structures and power consumption profiles.
Another problem we had was the additional processing power that Clovertown provided with two sockets and eight cores. We found that we could no longer run our previous benchmarks, Dell DVD Store and our Forums Benchmark, as we did not have enough I/O throughput to handle the additional processing power. In our lab we have a Promise VTrak J300s which is a 12 disk SAS chassis, but we found that using 12 disks was not enough for our old benchmarks. We estimated we needed approximately 36-48 disks to be able to continue running our OLTP benchmarks. We were not able to "obtain" the required chassis and spindles so we decided to change our benchmark suite.
We did consider several SSD flash based disks but it seems there are more "announced" SSD Flash based drives than there are "shipping" drives. Until we can significantly increase our IP capacity in our lab we will no longer be running OLTP based benchmarks. Our preference to increase our I/O capacity would be an SSD solution as it would not require spinning dozens of drives in several chassis but... neither is available to us at this time.
56 Comments
View All Comments
TA152H - Monday, April 2, 2007 - link
Viditor,Are you making this stuff up, or going by what Intel has said?
Intel has said that the reason they haven't gone up with the on-board memory controller, with respect to the Core 2, is because they preferred to use the silicon for the cache and other things. I think a lot of it is because they sell a lot of IGPs, and didn't want to use the awkward arrangement or either adding another memory controller outside of the processor, or having to use the processors memory controller since the IGP doesn't have it's own memory. The last part is speculation on my part, Intel said they preferred to use the transistors differently, and used cache as an example.
Your argument has now become comparitive, rather than absolute, going back to what I am saying about it helping enough. Also remember that the Penryn will have larger caches, which helps mediate this problem since you will have less contention. Both together should make a reasonably large impact on bandwidth restricted situations.
With regards to 2+2, actually, you're wrong on that. That's exactly what Intel said. They commented that they are able to run them at higher clock speeds than they could if they went native four, since they can test before they are all together rather than have to downbin, or throw away, a whole part if one of the dual cores is a failure or can't clock high. It's not speculation on my part.
Apps becoming more parallel is kind of a bad joke that people who are clueless talk about. Multithreading has been around since 1988 with OS/2, and back then I was writing them. Even for single processors, you did this, because good programmers wanted their application to always be responsive to the user even when you were doing things for them. Admittedly, Windows was quite a bit behind, but multithreading is nothing new, and there are limitations to it that simply can't be overcome. For some applications, it works great, for others you can't use it. Multiple cores are fairly new mainly because AMD and Intel can't think of anything better to do with the transistors, but multiprocessor computers are not, and people have been writing applications for them for many, many years (myself included). ILP applies to everything, TLP does not, and is essentially an admission from CPU makers that they are in a very, very diminishing returns situation with regards to transistors and performance.
With regards to the shared cache, you are also incorrect in saying it is why the Core 2 is so fast. It's a tradeoff, and you seem to ignore the L2 now has four more wait states because it is shared by two processors. I'm not sure how many more they'd have to add if it were shared among four cores, but it wouldn't be a free lunch.
Also keep in mind, theory sounds great, but where the rubber meets the road, the Clovertown does really well, and the main limitations have nothing to do with the trivialities of having a 2+2. In apps that can use it, the quad core shows a dramatic improvement over the dual. The FSB problems show up in these benchmarks rather vividly though, not a percentage or two that aren't that easily noticed.
Viditor - Monday, April 2, 2007 - link
TA152HI don't "make stuff up", mate...
"Intel does not integrate the memory controller. One reason is that memory standards change. Current Athlon computers, for instance, don't come with DDR II memory because the integrated memory controller connects to DDR I. Intel once tried to come out with a chip, Timna, that had an integrated memory controller that hooked up to Rambus. The flop of Rambus in the market led to the untimely demise of the chip"
http://news.com.com/2061-10791_3-6047412.html">News.com story
While they also listed the large cache and space as a "reason", this was the reason they mentioned most often in interviews.
If by your insinuation you were questioning how long it takes to build a chip, I'm afraid that is just a result of many years of industry knowledge on my part (though if you ask anybody who works in the semi industry, they will confirm this for you).
Nehalem for example began it's design almost 6 years ago, and has been delayed because of necessary architectual changes (similar to the way Itanium was).
Actually, the large cache doesn't help at all with the MCH bottleneck problem...in fact it makes it slightly worse. Remember that the data path for interchip communication is from cache to cache, not from system memory to cache. The larger cache (with the help of a good prefetcher) certainly helps reduce memory latency (though not as much as an on-die controller)...
Actually, multi-cores have been around for awhile...The Power4 was dual core back in 2000. What's new is that mainstream consumer level apps are being written for TLP because single cores are to be phased out...
Not true...Intel tried to convert everything to ILP with Itanium and EPIC, but it was the market (and in many cases the software companies) that decided that it was too hard and too expensive for not enough gain. Most (if not all) software companies are now developing for greater TLP efficiency, as this allows a much smoother transition (evolutionary vs revolutionary).
Sure multithreading has been around for a long time, I used many programs on my old Amiga that were multi-threaded...but it's a matter of degree.
To use an anology, when I was a kid, the best TV set you could buy was a 6" black and white set...today I have a 50" plasma that displays native 1080P. The degree to which software is optimized for TLP is increasing every day.
I said "one of the reasons"...
Actually, Clovertown is at the bottom when you're talking 4 cores...
For example, a 2P Woodcrest is significantly faster than a similarly clocked Clovertown, and they are essentially the same thing. The reason for this is that the 2 Woodcrest on the Clovertown must share the connection to the MCH while the 2x2P Woodcrest each have their own connection.
TA152H - Tuesday, April 3, 2007 - link
Actually, if you read the article, it says much more what I am saying. It talks mostly about cache, and in the interviews I have seen, that's what Intel touts. Even this article you present as proof shows the opposite, it mentions the memory changes, and then goes on and on about the extra cache and the performance of Core 2, not how quickly it can change with memory standards. Your whole premise is illogical, you are saying that with the Nehalem all the sudden memory changes will happen slower. That's plain wrong. I am saying with the Nehalem and 45 nm lithography, and the diminishing returns with adding more cache, it makes more sense for Intel to add the controller. Which is more logical to you?The larger cache makes it unnecessary for the cores to use the FSB, thus removes a bottleneck and causes less collisions. This has always been the case with multiprocessor configurations. If we have a 2+2, and one set needs to access main memory, and the other can access it's cache, you'll have less collisions than if they both needed to access main memory through the FSB. With a larger cache, you'll have less reads to main memory from each set of cores, and thus less contention.
I disagree on your remarks about TLP becoming suddenly important. Have you already forgotten about hyperthreading? Also, as I mentioned, there were ALWAYS advantages to writing multithreaded apps, even with one processor. I gave you one example where you always want your application to respond to a user, even if to tell them that you are doing something in the background for them. Another reason is that it is a lot more efficient, yes, even with a single processor. Even with the mighty 286 (an amazing processor for its day) the processor spent way too much time waiting for the I/O subsystems and a multithreaded application kept the processor busy while one thread waited on the leisurely hard disk. Yes, most programmers are hackers (a term misused now to mean someone that does bad things with code, whereas it meant someone that just sucked and couldn't write elegant code and hacked his way through it with badly written rubbish), but they still knew to write multithreaded stuff before dual cores, particularly with multiprocessing configurations becoming much more common with the P6. I'm not saying you won't see more of an effort, but the way things are being spoken about in the press is the it just takes some effort and these multicores will become absolutely fantastic when the software catches it. It ain't so, it's way overblown and there are a lot of things that will never be multithreaded because they can't be, and others that only benefit somewhat from it. Others will do great with it, it all depends on the type of application. Not every algorithm can be multithreaded effectively, and anyone who tells you otherwise reads too much press and hasn't coded a day in his or her life.
Your remarks about the Itanium are so bad I'm surprised you made them. Are you really this uninformed, or arguing just to argue. I think the latter. The problems with Itanium have nothing to do with ILP, although that was one of Intel's goals with it. The problem is, it remained a goal and has not been realized yet. Are you implying that the Itanium 2 has higher single threaded performance than the Core 2? I hope not.
If it had say 30% higher integer performance per core, on a wide list of applications, you'd have a big point there. It doesn't. It trails, in fact. First of all, I wouldn't call the Itanium a failure, because it's still premature to and I don't like counting out anything that gains market share year after year (albeit at a lower than expected rate). However, to the extent it has failed to gain the anticipated acceptance has a lot to do with cost, failures to meet schedules on Intel's part, the weird VLIW instruction set that people tend to dislike even as much as x86, and the fact it didn't run mainstream software well. Compatibility is so important, and that's why arguable the worst instruction set (aside from Intel's 432) is still king. Motorola's 68K line was much more elegant. Alpha even ran NT and couldn't dethrone it. It's hard to move people from x86, nearly (or possibly) impossible), and if you think this is some indictment against ILP, you're not even with reality.
Six years to design a processor is absurd, and you should know better. If you want to screw around with numbers why not start around 1991 or so when Intel started work on the P6 and say the Nehalem took 17 years, since some of it will come from there. People love throwing around BS numbers like that because it sounds impressive, but you only need to look at how quickly AMD and Intel add technology to their products to see it doesn't take six years. Look at AMD copying SSE, and Intel copying x86-64. Products now are derivative of earlier generations anyway, so you can't go six years back. The Nehalem will build on the Merced, it's not a totally from scratch processor. The Pentium 4 was pretty close, and the Prescott was a massive overhaul of the processor (much more than the Athlon 64 was vis-a-vis the Athlon), and it didn't take them even close to six years.
Viditor - Tuesday, April 3, 2007 - link
???...sigh...I never said anything of the sort. I can see that you are just trying to read into anything published or said just what you want it to say, so I'll stop there. Everyone else can just read the article (and the CC, the other articles Intel published on the subject, etc...). But your misunderstanding comes clear with the following:
Just to pull from a Google at random (this one from http://en.wikipedia.org/wiki/CPU_design">Wikipedia)
"The design cost of a high-end CPU will be on the order of US $100 million. Since the design of such high-end chips nominally take about five years to complete, to stay competitive a company has to fund at least two of these large design teams to release products at the rate of 2.5 years per product generation"
It's my mistake really...I thought that since you used all of these buzz words, you actually knew the industry. I was wrong...
This is another misconcenception of the novice...
1. Things like x86-64 and SSE are published many years before they are built. For example, x86-64 was first published for the public in 2001 (and in fact AMD had started work on it in 1998/9) under the name LDT. In fact, it was released to the open Consortium as freely distributable in April of 2001. The first K8 chip wasn't released until 2003.
Likewise, Intel's Yamhill team began work on x86-64 in 2000/1, though they didn't admit it's existence until much later because they wanted to foster support for IA64. The first EM64T chip was released in Q1 2005...
2. Intel and AMD have a comprehensive cross-licensing deal for their patents, and the patents are filed well before development begins...so even before it becomes public, they each know what technology the other is working on many years before release.
There are so many inaccuracies and misunderstandings in your posts that I suggest the following:
1. Use the quote feature so that I can understand just what it is you're responding to. Several of your points have nothing to do with what I said...
2. Try actually posting a link now and then so that we can see that what you're saying isn't just something else you've misunderstood...
TA152H - Wednesday, April 4, 2007 - link
I think you have a problem connecting things you say with their logical foundations, and I'll help you with that defect.You are said that Intel's main reason for not putting a memory controller on the chip was because changes in memory happen too quickly. Intel is putting a memory controller onchip for the Nehalem. Therefore, the logical conclusion is that this problem will not be as big of one with the Nehalem, since it no longer prevents Intel from doing it. You really didn't understand that? Why am I even arguing with you when you have such gaps in reasoning? I said it was mainly for the real estate savings, and that becomes less of a problem on 45nm since you have more transistors, so it's a logical premise, unlike yours.
It's kind of interesting that you read things, but don't really understand much. First of all, you said six years, now you're down to five. You also assume a completely new design, which isn't the case anymore. They are derivative from previous designs. How long do you think it took to do the original Alpha? Mind you, this is from brainstorming the requirements and what they wanted to do, designing the instruction set, etc... This is when superscalar was extremely unusual, superpipelining was unheard of, and a lot of the features on this processor were very new. Even then, it took less than five years. They have a good story on it from Byte magazine from August 1992.
If could remember anything, you'd know that AMD was against using SSE and was touting 3D Now! instead. Companies get patents, but they don't tell the whole story or for the purpose of designing a processor, any meaningful story. To make the transistor designs, you need to know specifics about how things will act under every situation and the necessary behavior. You are clueless if you think that's in the patents. You also need an actual processor to have so you can test. You wouldn't want to be AMD and implement just based on specs, because inevitably there would be incompatibilities.
You are also using your pretzel logic with regards to Yamhill. The processors had this logic in them way before they were released, and the design was done well before that. You really don't understand that? The only positive from this is you at least admit it's not six years, but is five. You'll slowly worm your way down to a realistic number, but five isn't so bad.
With regards to what I'm responding to, I could paste your stuff, but you have logical deficiencies. You are talking about multi-core, and can't make the connection to me saying multithreading has been going on forever. Even in 1992 (I got a nice batch of Byte Magazines off of eBay, and I am rereading a few of them), they were talking about how multiple cores were the future, in MIMD or SIMD configurations. How multithreading was going to take over the world, and how programmers were working on it, etc... It's funny, people are so clueless, and they just read articles and repeat them (hey, that's what I'm doing!).
My suggestion to you is to go back and get a nice batch of Byte magazines on eBay, and read them and really try to understand what they're saying, instead of being a parrot that repeats stuff you don't understand and try to sound impressive with it.
I'm done arguing with you, you're not informed enough to even interest me, and I won't even waste my time to read your responses.
Viditor - Wednesday, April 4, 2007 - link
You see? That's why I asked you to actually quote (I really was being quite sincere, it will help you)...that's NOT what I said.
What I said was that this was the reason Intel gave publicly, but that the real reason was that redesign of an architecture takes years not months. This is why they couldn't fit it on to C2D but will be able to on Nehalem...
I said Nehalem was six years and that the average was five (please go back and reread my posts...or maybe use quote?). I also said that the reason was that Nehalem was changed which is WHY it took 6 years.
They are all derivatives of a previous design...for example, the C2D is a derivative of the P3. Did you think that Intel was just twiddling it's thumbs? AMD had several years of advantage over the Netburst architecture...don't you think that they would have released the C2D many years earlier if they could have?
They use both (even now), but of course they would have preferred just 3D Now (just as Intel would have preferred everyone using just IA64). What's your point?
Sigh...
1. You need to learn the difference between "transistor design" and microarchitectural design. Both take a long time, but they are entirely different things (transistor design is part of manufacturing).
2. There are certainly ways to test as the product is being developed. For example, AMD released an AMD64 http://www.theregister.co.uk/2000/10/14/amd_ships_...">simulator and debugger to the public in 2000...
3. Even before initial tape-out (this is the first complete mask set), many sets of hand tooled silicon are made to test the individual circuits. This is the reason it takes so long...Each team works on their own specific area, then when the chip is first taped out they work on the processor as a whole unit.
4. Patents are often what initiate parts of the design...but I fail to see your point.
The first Intel processors to actually have the circuits in them (not activated) were the initial Prescotts. But saying the design was done is ludicrous...can you give a single reason why Intel included the circuits (and remember that it's expensive to add those transistors) without being able to use them other than the design not quite being finished??
I see...so instead of actually responding to what I've said, you deem it illogical and make up what I said instead?
Great idea...best one you've had. And my apologies to everyone for the length of the thread...
TA152H - Tuesday, April 3, 2007 - link
Yikes, holy typos Batman.I meant to say the Nehalem will build on the Merom. If it built on the Merced, maybe it does take six years, and I'm thinking AMD would have a real good chance of gaining market share.
yyrkoon - Monday, April 2, 2007 - link
You really havent been following processors for the last 12-14 years have you ? It has been proven, time, and time again, that a faster FSB is paramount to anything else (aside from processor core speed), in performance. Faster FSB == faster CPU->L1->L2. Memory bandwidth not so much (this is only because nothing takes advantage of memory bandwidth currently, and to be honest, I am not sure anything can, as this point), but DEFINATELTY FSB. Since, I do not see a faster core speed in the near future, the only other option for faster processors, aside from 'smarter' branch prediction' HAS to be FSB.
Now, since I have spoken against you, I suppose I am a 'dolt', or a 'moron', right ?
TA152H - Monday, April 2, 2007 - link
Is English your first language? I keep reading your sub-literal drivel and I'm not even sure what you're saying. I think you're agreeing with me that FSB does make a difference, but your writing ability is so poor it's hard to tell.Either way, you're a moron or dolt, or whatever you choose :P.
yyrkoon - Tuesday, April 3, 2007 - link
Yeah, ok, I am agreeing with you. Your triple negative threw me off there . . .