MIT IAP Course Lecture #1: 101

Carl Waldspurger (SB SM ’89 PhD ’95) VMware R&D

January 16, 2007

Copyright © 2007 VMware, Inc. All rights reserved. What is Virtualization?

vir •tu •al (adj): existing in essence or effect, though not in actual fact

Virtual systems • Abstract physical components using logical objects • Dynamically bind logical objects to physical configurations Examples • Network – Virtual LAN (VLAN), Virtual Private Network (VPN) • Storage – Storage Area Network (SAN), LUN • Computer – (VM), simulator

Copyright © 2007 VMware, Inc. All rights reserved. 2 Overview

Virtual Machines Virtualization Approaches Processor Virtualization Additional Topics

Copyright © 2007 VMware, Inc. All rights reserved. 3 Starting Point: A Physical Machine

Physical Hardware • Processors, memory, chipset, I/O bus and devices, etc. • Physical resources often underutilized Software • Tightly coupled to hardware • Single active OS image • OS controls hardware

Copyright © 2007 VMware, Inc. All rights reserved. 4 What is a Virtual Machine?

Hardware-Level Abstraction • Virtual hardware: processors, memory, chipset, I/O devices, etc. • Encapsulates all OS and application state Virtualization Software • Extra level of indirection decouples hardware and OS • Multiplexes physical hardware across multiple “guest” VMs • Strong isolation between VMs • Manages physical resources, improves utilization

Copyright © 2007 VMware, Inc. All rights reserved. 5 VM Isolation

Secure Multiplexing • Run multiple VMs on single physical host • Processor hardware isolates VMs, e.g. MMU Strong Guarantees • Software bugs, crashes, viruses within one VM cannot affect other VMs Performance Isolation • Partition system resources • Example: VMware controls for reservation, limit, shares

Copyright © 2007 VMware, Inc. All rights reserved. 6 VM Encapsulation

Entire VM is a File • OS, applications, data • Memory and device state Snapshots and Clones • Capture VM state on the fly and restore to point-in-time • Rapid system provisioning, backup, remote mirroring Easy Content Distribution • Pre-configured apps, demos • Virtual appliances

Copyright © 2007 VMware, Inc. All rights reserved. 7 VM Compatibility

Hardware-Independent • Physical hardware hidden by virtualization layer • Standard virtual hardware exposed to VM Create Once, Run Anywhere • No configuration issues • Migrate VMs between hosts Legacy VMs • Run ancient OS on new platform • E.g. DOS VM drives virtual IDE and vLance devices, mapped to modern SAN and GigE hardware

Copyright © 2007 VMware, Inc. All rights reserved. 8 Common Virtualization Uses Today

Test and Development – Rapidly provision test and development servers; store libraries of pre-configured test machines

Server Consolidation and Containment – Eliminate server sprawl by deploying systems into virtual machines that can run safely and move transparently across shared hardware

Business Continuity – Reduce cost and complexity by encapsulating entire systems into single files that can be replicated and restored onto any target server

Enterprise Desktop – Secure unmanaged PCs without compromising end-user autonomy by layering a security policy in software around desktop virtual machines

Copyright © 2007 VMware, Inc. All rights reserved. 9 Overview

Virtual Machines Virtualization Approaches • Virtual machine monitors (VMMs) • Virtualization platform types • Alternative system virtualizations Processor Virtualization Additional Topics

Copyright © 2007 VMware, Inc. All rights reserved. 10 What is a Virtual Machine Monitor?

An Old Concept VMM Characteristics • Classic definition from • Fidelity Popek & Goldberg ’74 • Performance • IBM mainframes since ’60s • Isolation / Safety

Copyright © 2007 VMware, Inc. All rights reserved. 11 VMM Technology

So this is just like Java, right? • No, a Java VM is very different from the physical machine that runs it • A hardware-level VM reflects underlying processor architecture Like a simulator or emulator that can run old Nintendo games? • No, they emulate the behavior of different hardware architectures • Simulators generally have very high overhead • A hardware-level VM utilizes the underlying physical processor directly

Copyright © 2007 VMware, Inc. All rights reserved. 12 VMMs Past

An Old Idea • Hardware-level VMs since ’60s • IBM S/360, IBM VM/370 mainframe systems • Timeshare multiple single-user OS instances on expensive hardware Classical VMM • Run VM directly on hardware • “Trap and emulate” model From IBM VM/370 product announcement, ca . 1972 for privileged instructions • Vendors had vertical control over proprietary hardware, operating systems, VMM

Copyright © 2007 VMware, Inc. All rights reserved. 13 VMMs Present

Renewed Interest • Academic research since ’90s • VMs for commodity systems • Server consolidation VMM for x86 • Industry-standard hardware, from laptops to datacenter • Run unmodified commodity VMware Fusion for Mac OS X running WinXP, 2006 guest operating systems • Significant challenges, e.g. “non-virtualizable” instructions • Pioneered by VMware in ’98

Copyright © 2007 VMware, Inc. All rights reserved. 14 VMM Platform Types

Hosted Architecture • Install as application on existing x86 “host” OS, e.g. Windows, , OS X • Small context-switching driver • Leverage host I/O stack and resource management • Examples: VMware Player/Workstation/Server, Virtual PC/Server, Parallels Desktop Bare-Metal Architecture • “” installs directly on hardware • Acknowledged as preferred architecture for high-end servers • Examples: VMware ESX Server, , Microsoft Viridian (2008)

Copyright © 2007 VMware, Inc. All rights reserved. 15 System Virtualization Alternatives

Virtual machines abstracted using a layer at different places

Language Level OS Level Hardware Level

Copyright © 2007 VMware, Inc. All rights reserved. 16 System Virtualization Taxonomy

System Virtualization

Hardware Level High-Level Language • Java • Microsoft .NET / Mono • Smalltalk Bare-Metal/ Hosted Hypervisor • • HP Integrity VM • Microsoft Virtual PC • IBM zSeries z/VM • Parallels Desktop • VMware ESX Server • VMware Player • Xen • VMware Workstation OS Level Emulators • VMware Server • FreeBSD Jail • • HP Secure Resource • Microsoft VPC for Mac Para-virtualization Partitions • QEMU • Sun Solaris Zones • Virtutech Simics • • SWsoft Virtuozzo • VMware VMI • User-Mode Linux • Xen

Copyright © 2007 VMware, Inc. All rights reserved. 17 Overview

Virtual Machines Virtualization Approaches Processor Virtualization • Classical techniques • Software x86 VMM • Hardware-assisted x86 VMM • Para-virtualization Additional Topics

Copyright © 2007 VMware, Inc. All rights reserved. 18 Classical Instruction Virtualization

Trap and Emulate • Run guest deprivileged • All privileged instructions trap into VMM • VMM emulates instructions against virtual state e.g. disable virtual interrupts, not physical interrupts • Resume direct execution from next guest instruction Implementation Technique • This is just one technique • Popek and Goldberg criteria permit others

Copyright © 2007 VMware, Inc. All rights reserved. 19 Classical Memory Virtualization

Traditional VMM Approach VPN Extra Level of Indirection guest • Virtual →→→ “Physical” shadow Guest maps VPN to PPN page table using primary page tables • “Physical” →→→ Machine PPN hardware TLB VMM maps PPN to MPN VMM Shadow Page Table • Composite of two mappings MPN • For ordinary memory references Hardware maps VPN to MPN • Cached by physical TLB

Copyright © 2007 VMware, Inc. All rights reserved. 20 Memory Traces

Shadow Page Table • Derived from primary page table in guest • VMM must keep primary and shadow coherent Trace = Coherency Mechanism • Write-protect primary page table • Trap guest writes to primary • Update or invalidate corresponding shadow • Transparent to guest

Copyright © 2007 VMware, Inc. All rights reserved. 21 Classical VMM Performance

Native Speed Except for Traps • No overhead in direct execution • Overhead = trap frequency × average trap cost Trap Sources • Most frequent: Guest page table traces • Privileged instructions • Memory-mapped device traces

Copyright © 2007 VMware, Inc. All rights reserved. 22 Challenges

Not Classically Virtualizable • x86 ISA includes instructions that read or modify privileged state • But which don’t trap in unprivileged mode Example: POPF instruction • Pop top-of-stack into EFLAGS register • EFLAGS.IF bit privileged (interrupt enable flag) • POPF silently ignores attempts to alter EFLAGS.IF in unprivileged mode! • So no trap to return control to VMM Deprivileging not possible with x86!

Copyright © 2007 VMware, Inc. All rights reserved. 23 How to Virtualize x86?

Interpretation • Problem – too inefficient • x86 decoding slow Code Patching • Problem – not transparent • Guest can inspect its own code Binary Translation (BT) • Approach pioneered by VMware • Run any unmodified x86 OS in VM Extend x86 Architecture

Copyright © 2007 VMware, Inc. All rights reserved. 24 Software VMM: Binary Translation

Direct execute unprivileged guest application code • Will run at full speed until it traps, we get an interrupt, etc. “Binary translate” all guest kernel code, run it unprivileged • Since x86 has non-virtualizable instructions, proactively transfer control to the VMM (no need for traps) • Safe instructions are emitted without change • For “unsafe” instructions, emit a controlled emulation sequence • VMM translation cache for good performance

Copyright © 2007 VMware, Inc. All rights reserved. 25 VMware Translator Properties

Binary – input is x86 “hex”, not source Dynamic – interleave translation and execution On Demand – translate only what about to execute (lazy) System Level – makes no assumptions about guest code Subsetting – full x86 to safe subset Adaptive – adjust translations based on guest behavior

Copyright © 2007 VMware, Inc. All rights reserved. 26 BT Mechanics

Input: BB 55 ff 33 c7 03 ... Each Translator Invocation • Consume a basic block (BB) • Produce a compiled code fragment (CCF) Store CCF in Translation Cache • Future reuse translator • Capture working set of guest kernel • Amortize translation costs • Not “patching in place”

Output: CCF 55 ff 33 c7 03 ...

Copyright © 2007 VMware, Inc. All rights reserved. 27 Example: IDENT Translation

80304a69 push %ebp 25555b0 push %ebp 80403a6a push (%ebx) 25555b1 push (%ebx) 80403a6c mov (%ebx), ffffffff 25555b3 mov (%ebx), ffffffff 80403a72 mov %edx, %esp 25555b9 mov %edx, %esp 80403a74 mov %esp, 81c(%ebx) 25555bb mov %esp, 81c(%ebx) 80403a7a push %edx 25555c1 push %edx 80403a7b mov %ebp, %eax 25555c2 mov %ebp, %eax 80403a7d call 80460ba4 25555c4 push 80403a82 25555c9 int 3a BB 25555cb data: 80460ba4 CCF 25555c4: push return address 25555c9: invoke translator on callee

Copyright © 2007 VMware, Inc. All rights reserved. 28 Adaptive BT

Translation Cache Translated Code Is Fast • Mostly IDENT translations !*! • Runs “at speed” Except Writes to Traced Memory • Page fault (shown as !*!) • Decode and interpret instruction • Fire trace callbacks • Resume execution • Can take 1000’s of cycles

Invoke Translator

Copyright © 2007 VMware, Inc. All rights reserved. 29 Adaptive BT: Fast Trace Handling

Detect and Track Trace Faults Splice in TRACE Translation JMP • Execute memory access in software • Avoid page fault • No re-decoding TRACE • Faster resumption Faster Traces • 10x performance improvement • Adapts to runtime behavior

Invoke Translator

Copyright © 2007 VMware, Inc. All rights reserved. 30 Software VMM Evaluation

Benefits • Adaptation • Fast traces • Fast I/O emulation • Flexibility Costs • Running translator • Path lengthening • System call slowdown • Complexity

Copyright © 2007 VMware, Inc. All rights reserved. 31 Hardware-Assisted VMM

Recent x86 Extension • 1998 – 2005: Software-only VMMs using binary translation • 2005: Intel and AMD start extending x86 to support virtualization First-Generation Hardware • Enables classical trap-and-emulate VMMs • Intel VT, aka “Vanderpool Technology” • AMD SVM, aka “Pacifica” Performance • VT/SVM help avoid BT, but not MMU ops (actually slower!) • Main problem is efficient virtualization of MMU and I/O, Not executing the virtual instruction stream

Copyright © 2007 VMware, Inc. All rights reserved. 32 VT/SVM Architecture

Diagram CPL 3 CPL 3 • Y-axis: old school x86 privilege (CPL) • X-axis: virtualization privilege CPL 2 CPL 2 Guest Mode • Runs unmodified OS

CPL 1 CPL 1 • Sensitive operations “exit” (trap out) to host mode VMCB CPL 0 CPL 0 • Virtual Machine Control Block Host Guest • VMM-controlled, hardware-walked • Buffers simple exits

Copyright © 2007 VMware, Inc. All rights reserved. 33 Hardware-Assisted VMM

Hardware-Assisted Direct Exec Guest mode CPL 0-3

Fault, Resume Guest Trace, Interrupt, I/O ...

Host mode VMM CPL 0-3

Copyright © 2007 VMware, Inc. All rights reserved. 34 Hardware-Assisted VMM Evaluation

Benefits • Simplicity (no BT) • Fast system calls • No translator overheads Costs • Exits: 1000’s of cycles for traces and I/O • No adaptation or software flexibility • Stateless model Future • Hardware support for fast MMU virtualization • Intel EPT, AMD NPT

Copyright © 2007 VMware, Inc. All rights reserved. 35 What is ?

Full Virtualization • No modifications to guest OS • Excellent compatibility, good performance, but complex Paravirtualization Exports Simpler Architecture • Term coined by Denali project in ’01, popularized by Xen • Modify guest OS to be aware of virtualization layer • Remove non-virtualizable parts of architecture • Avoid rediscovery of knowledge in hypervisor • Excellent performance and simple, but poor compatibility Ongoing Linux Standards Work • “Paravirt Ops” interface between guest and hypervisor • Small team from VMware, Xen, IBM LTC, etc.

Copyright © 2007 VMware, Inc. All rights reserved. 36 Paravirtualization: Conceptual Diagram

System call Guest interface OS Guest OS Hypercalls (GOOD)

Hypervisor Hypervisor NOT GOOD!

Hardware Hardware

Full Virtualization Paravirtualization

Copyright © 2007 VMware, Inc. All rights reserved. 37 VMware Vision: Transparent Paravirtualization

Same OS binary

VMI VMI Windows Dom0 Linux Xeno Linux Solaris DomU Linux VMI Linux Xen 3.0.x VMware ESX

Native Native Native

Copyright © 2007 VMware, Inc. All rights reserved. 38 Further Reading

VMware Publications • www.vmware.com/academic/resources.html • A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS ’06) • Fast Transparent Migration for Virtual Machines (USENIX ’05) • Memory Resource Management in VMware ESX Server (OSDI ’02) • Virtualizing I/O Devices on VMware Workstation’s Hosted VMM (USENIX ’01) Additional Academic Publications • Xen and the Art of Virtualization (SOSP ’03) • Disco: Running Commodity Operating Systems on Scalable Multiprocessors (SOSP ’97) • Many more …

Copyright © 2007 VMware, Inc. All rights reserved. 39 Additional Topics

I/O Virtualization Memory Management

Copyright © 2007 VMware, Inc. All rights reserved. 40 I/O Virtualization Stack

Guest Device Driver Guest OS Virtual Device • Model existing device, e.g. e1000 Device Driver • Model an idealized device, e.g. vmxnet Virtualization Layer Device • Emulates the virtual device Emulation • Remaps guest and real I/O addresses I/O Stack • Multiplexes and drives physical device Device Driver • Provides additional features, e.g. transparent NIC teaming Real Device • Physical hardware, e.g. bcm5700 • Likely to be different than virtual device

Copyright © 2007 VMware, Inc. All rights reserved. 41 I/O Virtualization Implementations

Emulated I/O Passthrough I/O Hosted or Split Hypervisor Direct

Guest OS Guest OS Guest OS

Device Driver Device Driver Device Driver

Host OS/Dom0/ Parent Domain

Device Device Device Emulation Emulation Emulation

I/O Stack I/O Stack Device Device Driver Device Driver Manager

VMware Workstation, VMware Server, VMware ESX Server VMware ESX Server (for slow devices), A Future Option (storage and network) Xen, Microsoft Viridian, Virtual Server Many Challenges

Copyright © 2007 VMware, Inc. All rights reserved. 42 Passthrough I/O Virtualization

High Performance Guest OS Guest OS Guest OS • Guest drives device directly • Minimizes CPU utilization Device Driver Device Driver Device Driver Enabled by HW Assists Virtualization Device Manager • I/O-MMU for DMA isolation Layer e.g. Intel VT-d, AMD IOMMU I/O MMU • Partitionable I/O device e.g. PCI-SIG IOV spec VF VF VF Challenges • Hardware independence I/O Device PF • Migration, suspend/resume PF = Physical Function, VF = Virtual Function • Memory overcommitment

Copyright © 2007 VMware, Inc. All rights reserved. 43 Additional Topics

I/O Virtualization Memory Management

Copyright © 2007 VMware, Inc. All rights reserved. 44 Memory Management

Desirable capabilities • Efficient memory overcommitment • Accurate resource controls • Exploit sharing opportunities Challenges • Allocations should reflect both importance and working set • Best data to guide decisions known only to guest OS • Guest and meta-level policies may clash

Copyright © 2007 VMware, Inc. All rights reserved. 45 VMware Memory Management

Reclamation mechanisms • Ballooning – guest driver allocates pinned PPNs, hypervisor deallocates backing MPNs • Swapping – hypervisor transparently pages out PPNs, paged in on demand • Page sharing – hypervisor identifies identical PPNs based on content, maps to same MPN copy-on-write Allocation policies • Proportional sharing – revoke memory from VM with minimum shares-per-page ratio • Idle memory tax – charge VM more for idle pages than for active pages to prevent unproductive hoarding

Copyright © 2007 VMware, Inc. All rights reserved. 46 Ballooning

inflate balloon may page out (+ pressure) Guest OS to virtual disk balloon

Guest OS guest OS manages memory balloon implicit cooperation

may page in Guest OS from virtual disk deflate balloon (– pressure)

Copyright © 2007 VMware, Inc. All rights reserved. 47 Page Sharing

Motivation • Multiple VMs running same OS, apps • Collapse redundant copies of code, data, zeros Transparent page sharing • Map multiple PPNs to single MPN copy-on-write • Pioneered by Disco [Bugnion ’97] , but required guest OS hooks Content-based sharing • General-purpose, no guest OS changes • Background activity saves memory over time

Copyright © 2007 VMware, Inc. All rights reserved. 48 Page Sharing: Scan Candidate PPN

011010 hash page contents 110101 …2bd806af 010111 101100 VM 1 VM 2 VM 3

hint frame Machine Memory Hash: …06af VM: 3 PPN: 43f8 MPN: 123b hash table

Copyright © 2007 VMware, Inc. All rights reserved. 49 Page Sharing: Successful Match

VM 1 VM 2 VM 3

shared frame Machine Memory Hash: …06af Refs: 2 MPN: 123b hash table

Copyright © 2007 VMware, Inc. All rights reserved. 50