Server Guide Part 1: Introduction to the Server World
by Johan De Gelas on August 17, 2006 1:45 PM EST- Posted in
- IT Computing
TCO
Originally described by the Gartner group, TCO sounds like something that does not belong on a hardware enthusiast site. It has frequently been abused by managers and financial people who understand very little of IT to delay necessary IT investments, so many view it as a pejorative term.
However it is impossible to make a well thought-out server buying decision without understanding TCO, and many typical server hardware features are based on the idea of lowering TCO. Hardware enthusiasts mostly base their buying decision on TCA or Total Cost of Acquisition. The enthusiast motherboard and chipset business is a typical example of how to ignore TCO. As the products are refreshed every 6 months, many of the new features don't work properly, and you find yourself flashing the BIOS, installing new drivers and tweaking configurations before you hopefully get that RAID, Firewall or sound chip to work properly. Luckily you don't have to pay yourself for all the hours you spend....
TCO is a financial estimate of the total cost of buying and using a server. Think of it as the cost that it takes to buy, deploy, support and adapt a certain server during it's lifecycle. So when evaluating servers you should look at the following costs:
There are two big problems with the "hardware choice does not matter much" kind of reasoning. The first is that the TCA is still a big part of the total TCO. For example this study[1] estimates that the price of buying the server is still about 40-50% of the TCO, while maintenance comprises a bit more than 10% and operation costs take about 40% of TCO pie. Thus we can't help but be wary when a vendor claims that a high price is okay, because the maintenance on his product is so much lower than the competition.
Secondly, certain hardware choices have an enormous impact on the rest of the TCO picture. One example is hot-spare and hot-swappable RAID arrays which on average significantly reduce the time that a server is unreachable. This will also become clearer as we dig deeper into the different hardware features of modern servers and the choices you will have to make.
RAS features
Studies done by IBM say that about 50% of the hardware failures are related to hard disk problems and 25% are due to a power supply failure. Fans with 8% are a distant third, so it is clear you need power supplies and hard disks of high reliability, the R of RAS. You also want to increase availability, the A of RAS, by using some redundancy for the most vulnerable parts of your server. RAID, redundant power supplies and fans are a must for a critical server. The S in RAS stands for Serviceability, which relates to hot-swappable/pluggable drives and other areas. Do you need to shut down the server to perform maintenance; what items can be replaced/repaired while keeping the system running? All three items are intertwined, and higher-end (and more expensive) servers will have features designed to improve all three areas.
Originally described by the Gartner group, TCO sounds like something that does not belong on a hardware enthusiast site. It has frequently been abused by managers and financial people who understand very little of IT to delay necessary IT investments, so many view it as a pejorative term.
However it is impossible to make a well thought-out server buying decision without understanding TCO, and many typical server hardware features are based on the idea of lowering TCO. Hardware enthusiasts mostly base their buying decision on TCA or Total Cost of Acquisition. The enthusiast motherboard and chipset business is a typical example of how to ignore TCO. As the products are refreshed every 6 months, many of the new features don't work properly, and you find yourself flashing the BIOS, installing new drivers and tweaking configurations before you hopefully get that RAID, Firewall or sound chip to work properly. Luckily you don't have to pay yourself for all the hours you spend....
TCO is a financial estimate of the total cost of buying and using a server. Think of it as the cost that it takes to buy, deploy, support and adapt a certain server during it's lifecycle. So when evaluating servers you should look at the following costs:
- The total cost of buying the server
- The time you will spend installing it in your network
- The time you will spend on configuring the software and remote management
- Facility management: the space it takes in your datacenter and the electricity it consumes
- The hours you spend on troubleshooting, reconfiguring, securing and repairing the server
- The costs associated with users waiting for the system to respond
- The costs associated with outages and failures, with users not being able to reach your server
- The upgrade costs and the time you spend on upgrading your server to meet new demands
- Cost of security breaches, etc.
There are two big problems with the "hardware choice does not matter much" kind of reasoning. The first is that the TCA is still a big part of the total TCO. For example this study[1] estimates that the price of buying the server is still about 40-50% of the TCO, while maintenance comprises a bit more than 10% and operation costs take about 40% of TCO pie. Thus we can't help but be wary when a vendor claims that a high price is okay, because the maintenance on his product is so much lower than the competition.
Secondly, certain hardware choices have an enormous impact on the rest of the TCO picture. One example is hot-spare and hot-swappable RAID arrays which on average significantly reduce the time that a server is unreachable. This will also become clearer as we dig deeper into the different hardware features of modern servers and the choices you will have to make.
RAS features
Studies done by IBM say that about 50% of the hardware failures are related to hard disk problems and 25% are due to a power supply failure. Fans with 8% are a distant third, so it is clear you need power supplies and hard disks of high reliability, the R of RAS. You also want to increase availability, the A of RAS, by using some redundancy for the most vulnerable parts of your server. RAID, redundant power supplies and fans are a must for a critical server. The S in RAS stands for Serviceability, which relates to hot-swappable/pluggable drives and other areas. Do you need to shut down the server to perform maintenance; what items can be replaced/repaired while keeping the system running? All three items are intertwined, and higher-end (and more expensive) servers will have features designed to improve all three areas.
32 Comments
View All Comments
Whohangs - Thursday, August 17, 2006 - link
Yes, but multiply that by multiple cpus per server, multiple servers per rack, and multiple racks per server room (not to mention the extra cooling of the server room needed for that extra heat) and your costs quickly add up.JarredWalton - Thursday, August 17, 2006 - link
Multiple servers all consume roughly the same power and have the same cost, so you double your servers (say, spend $10000 for two $5000 servers) and your power costs double as well. That doesn't mean that the power catches up to the initial server cost faster. AC costs will also add to the electricity cost, but in a large datacenter your AC costs don't fluctuate *that* much in my experience.Just for reference, I worked in a datacenter for a large corporation for 3.5 years. Power costs for the entire building? About $40-$70,000 per month (this was a 1.5 million square foot warehouse). Costs of the datacenter construction? About $10 million. Costs of the servers? Well over $2 million (thanks to IBM's eServers). I don't think the power draw from the computer room was more than $1000 per month, but it might have been $2000-$3000 or so. The cost of over 100,000 500W halogen lights (not to mention the 1.5 million BTU heaters in the winter) was far more than the costs of running 20 or so servers.
Obviously, a place like Novel or another company that specifically runs servers and doesn't have tons of cubicle/storage/warehouse space will be different, but I would imagine places with a $100K per month electrical bills probably hold hundreds of millions of dollars of equipment. If someone has actual numbers for electrical bills from such an environment, please feel free to enlighten.
Viditor - Friday, August 18, 2006 - link
It's the cooling (air treatment) that is more important...not just the expense of running the equipment, but the real estate required to place the AC equipment. As datacenters expand, some quickly run out of room for all of the air treatment systems on the roof. By reducing heating and power costs inside the datacenter, you increase the value for each sq ft you pay...TaichiCC - Thursday, August 17, 2006 - link
Great article. I believe the article also need to include the impact of software when choosing hardware. If you look at some bleeding edge software infrastructure employed by companies like Google, Yahoo, and Microsoft, RAID, PCI-x is no longer important. Thanks to software, a down server or even a down data center means nothing. They have disk failures everyday and the service is not affected by these mishaps. Remember how one of Google's data center caught fire and there was no impact to the service? Software has allowed cheap hardware that doesn't have RAID, SATA, and/or PCI-X, etc to function well and no down time. That also means TCO is mad low since the hardware is cheap and maintenance is even lower since software has automated everything from replication to failovers.Calin - Friday, August 18, 2006 - link
I don't thing google or Microsoft runs their financial software on a big farm of small, inexpensive computers.While the "software-based redundancy" is a great solution for some problems, other problems are totally incompatible with it.
yyrkoon - Friday, August 18, 2006 - link
Virtualization is the way of the future. Server admins have been implimenting this for years, and if you know what you're doing, its very effective. You can in effect segregate all your different type of servers (DNS, HTTP, etc) in separate VMs, and keep multiple snapshots just incase something does get hacked, or otherwise goes down (not to mention you can even have redundant servers in software to kick in when this does happen). While VMWare may be very good compared to VPC, Xen is probably equaly as good by comparrison to VMWare, the performance difference last I checked was pretty large.Anyhow, I'm looking forward to anandtechs virtualization part of the article, perhaps we all will learn something :)
JohanAnandtech - Thursday, August 17, 2006 - link
Our focus is mostly on the SMBs, not google :-). Are you talking about cluster fail over? I am still exploring that field, as it is quite expensive to build it in the lab :-). I would be interested in what would be the most interesting technique, with a router which simply switches to another server, or with a heartbeat system, where one server monitors the other.I don't think the TCO is that low for implementing that kind of software or solutions, and that hardware is incredibly cheap. You are right when you are talking about "google datacenter scale". But for a few racks? I am not sure. Working with budgets of 20.000 Euro and less, I 'll have to disagree :-).
Basically what I am trying to do with this server guide is give the beginning server administrators with tight budgets an overview of their options. Too many times SMBs are led to believe they need a certain overhyped solution.
yyrkoon - Friday, August 18, 2006 - link
Well, if the server is in house, its no biggie, but if that server is acrossed the country (or world), then perhaps paying extra for that 'overhyped solution' so you can remotely access your BIOS may come in handy ;) In house, alot of people actually use in-expencive motherboards such as offered by Asrock, paired with a celeron / Sempron CPU. Now, if you're going to run more than a couple of VMs on this machine, then obviously you're going to have to spend more anyhow for multiple CPU sockets, and 8-16 memory slots. Blade servers IMO, is never an option. 4,000 seems awefully low for a blade server also.schmidtl - Thursday, August 17, 2006 - link
The S in RAS stands for sevicability. Meaning when the server requires maintainance, repair, or upgrades, what is the impact? Does the server need to be completely shut down (like a PC), or can you replace parts while it's running (hot-pluggable).JarredWalton - Thursday, August 17, 2006 - link
Thanks for the correction - can't say I'm a server buff, so I took the definitions at face value. The text on page 3 has been updated.