Virtualization - Ask the Experts #3by Anand Lal Shimpi on September 2, 2010 9:00 PM EST
Our Ask the Experts series continues with another round of questions.
A couple of months ago we ran a webcast with Intel Fellow, Rich Uhlig, VMware Chief Platform Architect, Rich Brunner and myself. The goal was to talk about the past, present and future of virtualization. In preparation for the webcast we solicited questions from all of you, unfortunately we only had an hour during the webcast to address them. Rich Uhlig from Intel, Rich Brunner from VMware and our own Johan de Gelas all agreed to answer some of your questions in a 6 part series we're calling Ask the Experts. Each week we'll showcase three questions you guys asked about virtualization and provide answers from our panel of three experts. These responses haven't been edited and come straight from the experts.
If you'd like to see your question answered here leave it in the comments. While we can't guarantee we'll get to everything, we'll try to pick a few from the comments to answer as the weeks go on.
Question #1 by AnandTech user mpsii
I am having a hard time trying to determine the best hardware for server and desktop virtualization. There do not seem to be any benchmarks showing the performance comparison of, for example, a Phenom II X6 vs Core i7 (quad) processor.
Answer #1 by Johan de Gelas, AnandTech Senior IT Editor
A Phenom II X6 "Thuban" is almost identical to a "Istanbul" Opteron 2400. The Core i7 9xx is the little brother of the quadcore Xeon 5500. In both cases, the only real difference in a single socket system is the fact that the memory controller of the desktop CPUs runs with unbuffered memory instead of buffered ECC memory. So the desktop chips are slightly faster as the memory latency is a bit lower. We have performed quite a bit of virtualization benchmarking on both the Xeons and Opterons, running Microsoft's Hyper-V and VMware's ESX/vSphere, so you can get a rough idea on how the desktops CPUs compare running virtualized applications (http://www.anandtech.com/tag/IT).
Question #2 by Gary G.
If Type 1 virtualization runs a hypervisor on bare metal to host a guest OS, and type 2 virtualization features an OS hosting a hypervisor to host a guest OS, when will we have a type 3 virtualization? Type 3 virtualization would be physical hardware abstraction sufficient to run multiple physical hardware instruction sets. In this case, the hypervisor could run Sparc on Intel or vice versa. Is that part of our future history?
Answer #2 by Rich Brunner, VMware Chief Platform Architect
It is certainly technically feasible but it may not run at the performance you want even with clever binary translation tricks. Folks have debated this for awhile as a bridge to bring legacy, mission-critical workloads based on RISC architectures to more commonly available commodity hardware. I can't rule it out in that context, but I do not see it as a trend for new workloads.
Question #3 by Aaron P.
As the number of CPU cores increases, so does the consolidation ratio. Can you discuss what initiatives are being pursued that seek to limit the impact of a server failure for a machine hosting potentially hundreds of virtual servers?
Answer #3 by Rich Uhlig, Intel Fellow
Broadly speaking, there are a couple of ways to address the challenge: you can develop approaches to recover from and correct faults when they happen, or you can develop mechanisms to contain faults to limit their effects.
ECC memory is a well-known approach for detecting and correcting memory faults, and is a good example of the first approach. The same principle of fault recovery can be applied to other resources in the platform beyond memory, such as the system interconnect for coherency and I/O (e.g., the use of CRC to detect link-level errors and trigger packet retransmission in hardware).
When faults can’t be corrected, it is still useful to contain them to support higher-level recovery algorithms. This can be done by tagging uncorrectable data errors with “poison” bits that follow the data through the system (called poison forwarding). If the poisoned data is later used, hardware raises a machine-check exception to system software (OS or hypervisor), along with information about the nature of the fault. Ideally, this kind of hardware support enables a hypervisor to perform a more targeted action in response to a fault (e.g., to shut down only the VMs affected by given fault, rather than bringing down the entire platform and all the VMs running on it).
Intel has added a rich set of new features to our EX server product line that extend the kinds of faults that can be corrected or contained, including QPI link recovery and poison forwarding, support for PCIe advanced error reporting, and memory mirroring, among others. This collection of features are all part of the “RAS” (short for “Reliability, Availability and Serviceability”) capabilities of our EX class platforms and we plan to extend and improve them over time.
The above features go to improving the reliability of a given single server, but sometimes you can lose an entire platform (e.g., due to loss of power, etc.). In this case, an interesting emerging solution is to use virtualization to maintain a replica of VM state on another platform, either by replaying its execution or by checkpointing the VM’s state as it runs. In the event of a full platform failure, workload execution can resume on another platform based on the state of the replica VM. Virtualization also pairs nicely with other established methods for high availability, like cluster-based failover solutions. In this case, a standby machine in a failover cluster can be provided by a VM, rather than having to devote a full physical machine for that purpose.
As we see consolidation ratios increase over time, I’d expect to see hardware mechanisms for fault recovery and containment to co-evolve with software (hypervisor) use of those mechanisms to provide higher-level properties of service availability and system fault tolerance, both within and across physical platforms.