VM Starts from IBM VM/370 Old purpose: share the resource of the expensive mainframe machine New purpose: fault tolerance and security; run multiple different OSes on the same desktop; server consolidation and performance isolation, etc.

Virtualization tasks (isolate, secure, fast, transparent …) 1. cpu How to make sure privileged instructions (e.g., enable/disable ; change critical registers) will not be executed directly 2. memory virtualization OS thought they can control the whole physical memory. No! 3. I/O virtualization OS thought I/O devices are all under its own control. No!

Problems of 1. Some privileged instructions do not trap; they just fail/change-semantic silently. • IRET ( return) increment stack pointer by 3 words when the instruction does not change privilege level, by 5 when the instruction does. • PUSHF stores registers (including sensitive information like IF (interrupt enable/disable)) onto stack (so that os can see it) 2. segment is a problem

Categories of X86 virtualization Full-virtualization (VMWare) No change to guest OS Use to change some instructions (privilege instructions or all OS code) on the fly Para-virtualization () Change guest OS to improve performance Rewrite OS; replace privileged instructions with hypercalls Many other changes Hardware support ( VT, AMD …) 1st generation, add one level (root level) beneath ring 0 (non-root level); the execution of all privileged instructions will trap to the root level. Support for `world-change’ (change between root-level and upper level 0—3) Currently, performance is not better than VMware, because world-change has huge overhead and rigid programming mode makes it hard to reduce #world-change 2nd generation, will add support to help shadow

• VMWare, of course, is against the Xen trend. Xen is criticized for not portable, hard to maintain, paralyzed os cannot directly run on hardware. VMWare proposes VMI (virtual machine interface) Make paravirtualized OS standardized and more portable Para-virtualize os by VMI interface. VMI interface is backed up by two versions of implementation: (1) implementation to run on raw machine; (2) call underlying hypercall of supporting vmm Disco On MIPS DISCO directly runs on multi-processor machine; Oses run upon DISCO (w/ slight modification --- relinking OS to put it on mapping address range; option for more OS code change for better performance) Goal: use commodity OSes to leverage many-core cluster

VMM runs in kernel mode VM OS runs in supervisor mode VM user runs in user mode Privileged instructions will trap to VMM and be emulated

CPU virtualization No special challenge compared with CPU (seemed that all privilege instructions will trap) DISCO provide option for OS to be para-virtualized ☺. Disable/enable interrupt and accessing privileged registers can be done through special memory location, which will avoid trap emulation. (*this os change is not have-to, different from Xen) VMM maintains a process-table-entry-like data structure for each VM (virtual PC) (including registers, states, and TLB content (used for emulation …) ).

Memory Shadow page table Leverage shadow page table to ease different OSes can communicate with each other and page sharing (especially share buffer-cache) through memory mapping DISCO also uses mapping to do page migration, duplication, etc., for good performance under ccNUMA. A problem is that in MIPS user address space will go through TLB, BUT OS’ address space does NOT go through TLB. This requires to relink OS (change its code address).

I/O Virtualized device …

DISCO proposes changes to OS (HAL) for better performance (note: this is different from Xen, where changes are definitely necessary for correct virtualization) 1. replace some privileged instructions to simply accessing certain memory locations 2. have OS to call-monitor (like Xen’s hypercall) to tell resource-management related information to monitor (e.g., certain memory pages can be reclaimed now) VMWare (patent)

Two major things: 0. memory trace utility through MMU (page protection, I think) to register and trace certain page’s access/write.

1. binary translation and direct execution and the logic to decide when to do which (1) binary translation translator: translate privilege to non-privilege or callout function (to special monitor functions) translation cache: a hash table index starting physical instruction to translated version in the translation cache; the translation is discarded when memory trace indicates the source code change. Special problem: in original code sequence, exception only happens at instruction boundary, which is not the case in the translated code. Need special attention (2) decision logic binary translation if it is in CPL=0/1, if is cleared, if I/O privilege level is .., if it is in non-reversible segmenting, Actually, x86 has 4 modes (protected, real, v-…, …) making things even harder :S

2. handle segmenting and non-reversible segmenting • VMM and VM are in the same linear address space, so special attention needs to be paid

Vmware Workstation and VMWare Server (ESX)

VMWare workstation Purpose/background: run on PC (supporting different Oses is very appealing) Which means: many different devices; may not be good to require uninstall original OS to move OS upon VMM; (VMWare workstation comes out first, before VMWare server)

Design: VMApp, VM Driver and VMMonitor; still have the original OS; Install VMApp upon OS; VMDriver is loaded into the kernel; VMApp uses VMDriver to start a VMM that runs directly on hardware; New OS and applications can be run inside the VMM

VMWare ESX Server VMWare ESX Server runs directly on hardware, handle I/O by itself, for better performance (Xen uses Domain0 to handle I/O; VMWare workstation uses host OS) Use shadow page table (pmap for each VM to translate physical AD to machine AD) This paper talks about several memory management techniques in ESX 1. Use balloon driver to avoid VMM blindly makes bad paging decisions. 2. Transparent page sharing (use content hashing, etc.) 3. Working-set esitmation 4. Tax-based space sharing technique