spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

ADRIAN PERRIG & TORSTEN HOEFLER Our small quiz Networks and Operating Systems (252-0062-00) Chapter 11: Virtual Machine Monitors . True or false (raise hand) . Spooling can be used to improve access times A SIGINT in time saves a kill -9 . Buffering can cope with device speed mismatches . The Linux kernel identifies devices using a single number . From userspace, devices in Linux are identified through files . Standard BSD sockets require two or more copies at the host . Network protocols are processed in the first level interrupt handler . The second level interrupt handler copies the packet data to userspace . Deferred procedure calls can be executed in any process context . mbufs (and skbufs) enable protocol-independent processing . Network I/O is not performance-critical . NAPI’s design aims to reduce the CPU load . NAPI uses polling to accelerate packet processing . TCP offload reduces the server CPU load . TCP offload can accelerate applications

2

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Receive-side scaling Receive-side scaling

. Observations: . Too much traffic for one core to handle Flow table Received . Cores aren’t getting any faster packet  Must parallelize across cores . Key idea: handle different flows on different cores pointer Flow state: . But: how to determine flow for each packet? • IP src + dest • Ring buffer . Can’t do this on a core: same problem! • TCP src + dest • Message-signalled interrupt Etc. . Solution: demultiplex on the NIC Hash of . DMA packets to per-flow buffers / queues packet . Send interrupt only to core handling flow header

DMA Core to address interrupt

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Receive-side scaling

. Can balance flows across cores . Note: doesn’t help with one big flow! . Assumes: . n cores processing m flows is faster than one core . Hence: . Network stack and protocol graph must scale on a multiprocessor. . Multiprocessor scaling: topic for later (see DPHPC class) Virtual Machine Monitors

Literature: Barham et al.: Xen and the art of virtualization and Anderson, Dahlin: Operating Systems: Principles and Practice, Chapter 14 6 spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Virtual Machine Monitors What is a Virtual Machine Monitor?

. Basic definitions . Virtualizes an entire (hardware) machine . Why would you want one? . Contrast with OS processes . Structure . Interface provided is “illusion of real hardware” . How does it work? . Applications are therefore complete Operating Systems themselves . Terminology: Guest Operating Systems . CPU . MMU . Memory . Old idea: IBM VM/CMS (1960s) . Devices . Recently revived: VMware, Xen, Hyper-V, kvm, etc. • Acknowledgement: . Network Thanks to Steve Hand for some of the slides!

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

VMMs and hypervisors Why would you want one?

. Server consolidation (program assumes own machine) . Performance isolation . Backward compatibility

. Cloud computing (unit of selling cycles)

App App App App Some folks . OS development/testing distinguish the . Something under the OS: replay, auditing, trusted computing, Virtual Machine Guest Guest Monitor from the rootkits operating operating Hypervisor system system (we won’t) Creates illusion of VMM VMM hardware Hypervisor

Real hardware

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Running multiple OSes on one machine Server consolidation

. Application . Many applications compatibility assume they have . I use Debian for almost everything, the machine to

but I edit slides in themselves

App App App App App App PowerPoint

. Each machine is

Application Application . Some people Application compile Barrelfish in mostly idle a Debian VM over Windows 7 with  Consolidate Hyper-V servers onto a single physical . Backward machine compatibility Hypervisor Hypervisor . Nothing beats a virtual Real hardware machine for playing Real hardware old computer games spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Resource isolation Cloud computing

. Surprisingly, . Selling computing

modern OSes do capacity on Application not have an Application demand

abstraction for a . E.g. Amazon EC2,

Application Application

single application Hypervisor GoGrid, etc.

Application Application

Application Application Application Real hardware

. Performance Hypervisor . Hypervisors Application Real hardware Application

isolation can be Hypervisor decouple Application critical in some Real hardware Application allocation of

Hypervisor Application enterprises Real hardware Application resources (VMs) Hypervisor . Use virtual Real hardware from provisioning Hypervisor

Real hardware of infrastructure Hypervisor machines as resource (physical machines) Real hardware containers

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Operating System development Other cool applications…

. Building and . Tracing testing a new OS . Debugging without needing . Execution replay to reboot real

. Lock-step

Visual Visual

Studio Editor

hardware Tracer

Application Application

Compiler execution . VMM often gives you more . Live migration information about . Rollback faults than real . Speculation hardware anyway . Etc…. Hypervisor Hypervisor

Real hardware Real hardware

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

How does it all work? Hosted VMMs

. Note: a hypervisor is basically an OS . With an “unusual API” . Many functions quite similar: . Multiplexing resources

. Scheduling, virtual memory, device drivers App App Examples: . Different: • VMware workstation . Creating the illusion of hardware to “applications” Guest • Linux KVM

. Guest OSes are less flexible in resource requirements operating • Microsoft Hyper-V Application Application system • VirtualBox

VMM Host

Real hardware spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Hypervisor-based VMMs How to virtualize…

. The CPU (s)? . The MMU? . Physical memory?

. Devices (disks, etc.)?

App App App App

Mgmt. Examples: . The Network Console • VMware ESX Console • IBM VM/CMS Guest Guest and? (Mgmt) • Xen operating operating operating system system system

VMM VMM VMM Hypervisor

Real hardware

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Virtualizing the CPU Virtualizing the CPU

. A CPU architecture is strictly virtualizable if it can be perfectly . A strictly virtualizable processor can execute a complete native emulated over itself, with all non-privileged instructions Guest OS executed natively . Guest applications run in user mode as before . Guest kernel works exactly as before . Privileged instructions  trap . Kernel-mode (i.e., the VMM) emulates instruction . Problem: architecture is not virtualizable  . Guest’s kernel mode is actually user mode . About 20 instructions are sensitive but not privileged Or another, extra privilege level (such as ring 1) . Mostly segment loads and processor flag manipulation

. Examples: IBM S/390, Alpha, PowerPC

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Non-virtualizable x86: example Solutions

. PUSHF/POPF instructions 1. Emulation: emulate all kernel-mode code in software . Push/pop condition code register . Very slow – particularly for I/O intensive workloads . Includes interrupt enable flag (IF) . Used by, e.g., SoftPC 2. Paravirtualization: modify Guest OS kernel . Unprivileged instructions: fine in user space! . Replace critical calls with explicit trap instruction to VMM . IF is ignored by POPF in user mode, not in kernel mode . Also called a “HyperCall” (used for all kinds of things) . Used by, e.g., Xen  VMM can’t determine if Guest OS wants interrupts disabled! 3. Binary rewriting: . Can’t cause a trap on a (privileged) POPF . Protect kernel instruction pages, trap to VMM on first IFetch . Prevents correct functioning of the Guest OS . Scan page for POPF instructions and replace . Restart instruction in Guest OS and continue . Used by, e.g., VMware 4. Hardware support: Intel VT-x, AMD-V . Extra processor mode causes POPF to trap spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Virtualizing the MMU Virtual/Physical/Machine

. Hypervisor allocates memory to VMs . Guest assumes control over all physical memory Guest Guest Machine . VMM can’t let Guest OS to install mappings Virtual AS Physical AS Memory . Definitions needed: . Virtual address: a virtual address in the guest 17 . Physical address: as seen by the guest Guest 1: . Machine address: real physical address 2 As seen by the Hypervisor 5

6 9

Guest 2:

5

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

MMU virtualization Paravirtualization approach

. Critical for performance, challenging to make fast, especially . Guest OS creates page tables the hardware uses SMP . VMM must validate all updates to page tables . Hot-unplug unnecessary virtual CPUs . Requires modifications to Guest OS . Use multicast TLB flush paravirtualizations etc. . Not quite enough… . Xen supports 3 MMU virtualization modes . VMM must check all writes to PTEs 1. Direct (“Writable”) pagetables . Write-protect all PTEs to the Guest kernel 2. Shadow pagetables . Add a HyperCall to update PTEs 3. Hardware Assisted Paging . Batch updates to avoid trap overhead . OS Paravirtualization compulsory for #1, optional (and very . OS is now aware of machine addresses beneficial) for #2&3 . Significant overhead!

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Paravirtualizing the MMU Writeable Page Tables : 1 – Write fault

. Guest OSes allocate and manage own PTs . Hypercall to change PT base guest reads . VMM must validate PT updates before use . Allows incremental updates, avoids revalidation Virtual → Machine first guest . Validation rules applied to each PTE: write . 1. Guest may only map pages it owns Guest OS . 2. Pagetable pages may only be mapped RO . VMM traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates

page fault VMM Hardware MMU spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Writeable Page Tables : 2 – Emulate? Writeable Page Tables : 3 - Unhook

guest reads guest reads

Virtual → Machine Virtual → Machine first guest guest writes X write Guest OS Guest OS

yes

emulate? VMM VMM Hardware Hardware MMU MMU

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Writeable Page Tables : 4 - First Use Writeable Page Tables : 5 – Re-hook

guest reads guest reads

Virtual → Machine Virtual → Machine guest writes X guest writes Guest OS Guest OS

page fault validate VMM VMM Hardware Hardware MMU MMU

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Writeable page tables require paravirtualization Shadow page tables

. Guest OS sets up its own page tables Guest Machine . Not used by the hardware! Virtual AS Memory . VMM maintains shadow page tables . Map directly from Guest VAs to Machine Addresses 17 . Hardware switched whenever Guest reloads PTBR Guest 1: . VMM must keep V→M table consistent with Guest V→P table and it’s own P→M table 5 . VMM write-protects all guest page tables Guests directly share . Write  trap: apply write to shadow table as well machine memory . Significant overhead! 6

Guest 2:

5 spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Shadow page tables Shadow page tables

Guest Guest Machine guest reads Virtual AS Physical AS Memory Virtual → Guest-Physical 17 Guest 1: guest writes Guest OS updates accessed and 5 Shadow page table mappings dirty bits Virtual → Machine • Guest changes 6 optional, but help with batching, knowing when to Guest 2: unshadow VMM • Latest algorithms Hardware 5 work remarkably MMU well

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Hardware support Memory allocation

. “Nested page tables” . Guest OS is not expecting physical memory to change in size! . Relatively new in AMD (NPT) and Intel (EPT) hardware . Two problems: . Two-level translation of addresses in the MMU . Hypervisor wants to overcommit RAM . Hardware knows about: . How to reallocate (machine) memory between VMs V→P tables (in the Guest) . Phenomenon: Double Paging P→M tables (in the Hypervisor) . Hypervisor pages out memory . Tagged TLBs to avoid expensive flush on a VM entry/exit . Guest OS decides to page out physical frame . Very nice and easy to code to . (Unwittingly) faults it in via the Hypervisor, only to write it out again . One reason kvm is so small . Significant performance overhead…

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Ballooning Ballooning: taking RAM away from a VM

. Technique to reclaim memory from a Guest . Install a “balloon driver” in Guest kernel Guest physical address space . Can allocate and free kernel physical memory Just like any other part of the kernel . Uses HyperCalls to return frames to the Hypervisor, and have them returned Guest OS is unware, simply allocates physical memory

Balloon

Balloon driver spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Ballooning: taking RAM away from a VM Ballooning: taking RAM away from a VM

1. VMM asks balloon driver 1. VMM asks balloon driver Guest physical address space for memory Guest physical address space for memory 2. Balloon driver asks 2. Balloon driver asks Guest OS kernel for more Guest OS kernel for more frames frames . “inflates the balloon” . “inflates the balloon” 3. Balloon driver sends 3. Balloon driver sends physical frame numbers Balloon physical frame numbers to VMM to VMM 4. VMM translates into 4. VMM translates into Balloon machine address and machine address and claims the frames claims the frames Balloon Balloon driver driver

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Ballooning: taking RAM away from a VM Ballooning: taking RAM away from a VM

1. VMM asks balloon driver 1. VMM asks balloon driver Guest physical address space for memory Guest physical address space for memory 2. Balloon driver asks 2. Balloon driver asks Guest OS kernel for more Guest OS kernel for more frames frames . “inflates the balloon” . “inflates the balloon” 3. Balloon driver sends 3. Balloon driver sends Physical Balloon physical frame numbers Physical Balloon physical frame numbers memory to VMM memory to VMM claimed by 4. VMM translates into claimed by 4. VMM translates into balloon driver machine address and balloon driver machine addresses and claims the frames claims the frames Balloon Balloon driver driver

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Returning RAM to a VM Virtualizing Devices

. Familiar by now: trap-and-emulate 1. VMM converts machine . I/O space traps Guest physical address space address into a physical . Protect memory and trap address previously . “Device model”: software model of device in VMM allocated by the balloon driver . Interrupts → upcalls to Guest OS . Emulate interrupt controller (APIC) in Guest 2. VMM hands PFN to . Emulate DMA with copy into Guest PAS balloon driver . Significant performance overhead! 3. Balloon driver frees physical frame back to Guest OS kernel Balloon . “deflates the balloon”

Balloon driver spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Paravirtualized devices Networking

. “Fake” device drivers which communicate efficiently with VMM . Virtual network device in the Guest VM via hypercalls . Hypervisor implements a “soft switch” . Used for block devices like disk controllers . Entire virtual IP/Ethernet network on a machine . Network interfaces . Many different addressing options . “VMware tools” is mostly about these . Separate IP addresses . Separate MAC addresses . Dramatically better performance! . NAT . Etc.

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Where are the real drivers? Xen 3.x Architecture

1. In the Hypervisor VM0 VM1 VM2 VM3 . E.g., VMware ESX Device Unmodified Unmodified Unmodified . Problem: need to rewrite device drivers (new OS) Manager & User User User Control s/w Software Software Software 2. In the console OS . Export virtual devices to other VMs GuestOS GuestOS SMP Unmodified GuestOS 3. In “driver domains” (XenLinux) (XenLinux) GuestOS (XenLinux) (WinXP) . Map hardware directly into a “trusted” VM Device Passthrough Native Device Front-End Front-End Front-End . Run your favorite OS just for the device driver Drivers Device Drivers Device Drivers Device Drivers . Use IOMMU hardware to protect other memory from driver VM 4. Use “self-virtualizing devices” Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU Xen Virtual Machine Monitor

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

Thanks to Steve Hand for some of these diagrams

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Xen 3.x Architecture Xen 3.x Architecture

VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 Device Unmodified Unmodified Unmodified Device Unmodified Unmodified Unmodified Manager & User User User Manager & User User User Control s/w Software Software Software Control s/w Software Software Software GuestOS GuestOS SMP Unmodified GuestOS GuestOS SMP Unmodified (XenLinux) (XenLinux) GuestOS GuestOS (XenLinux) (XenLinux) GuestOS GuestOS Virtual switch (XenLinux) (WinXP) Virtual switch (XenLinux) (WinXP)

Native Native Device Front-End Front-End Front-End Device Front-End Front-End Front-End Drivers Device Drivers Device Drivers Device Drivers Drivers Device Drivers Device Drivers Device Drivers

Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU Xen Virtual Machine Monitor Xen Virtual Machine Monitor

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

Thanks to Steve Hand for some of these diagrams Thanks to Steve Hand for some of these diagrams spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Remember this card? SR-IOV

. Single-Root I/O Virtualization . Key idea: dynamically create new “PCIe devices” . Physical Function (PF): original device, full functionality . Virtual Function (VF): extra “device”, limited funtionality . VFs created/destroyed via PF registers . For networking: . Partitions a network card’s resources . With direct assignment can implement passthrough

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

SR-IOV in action SR-IOV in action

VM VM VM VSwitch VSwitch

PF driver VNIC drvr PF driver

VMM VMM IOMMU IOMMU PCIe PCIe

Physical function Physical function

Virtual ethernet bridge/switch, packet classifier Virtual ethernet bridge/switch, packet classifier SR-IOV NIC SR-IOV NIC

LAN LAN

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

SR-IOV in action SR-IOV in action

VM VM VM VM VM VSwitch VSwitch

VNIC drvr PF driver VF driver VNIC drvr PF driver

VMM VMM IOMMU IOMMU PCIe PCIe

Virtual Virtual Physical function Physical function function function

Virtual ethernet bridge/switch, packet classifier Virtual ethernet bridge/switch, packet classifier SR-IOV NIC SR-IOV NIC

LAN LAN spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

SR-IOV in action SR-IOV in action

VM VM VM VM VM VM VM VM VM VSwitch VSwitch

VF driver VF driver VNIC drvr PF driver VF driver VF driver VF driver VNIC drvr PF driver

VMM VMM IOMMU IOMMU PCIe PCIe

Virtual Virtual Virtual Virtual Virtual Physical function Physical function function function function function function

Virtual ethernet bridge/switch, packet classifier Virtual ethernet bridge/switch, packet classifier SR-IOV NIC SR-IOV NIC

LAN LAN

spcl.inf.ethz.ch spcl.inf.ethz.ch @spcl_eth @spcl_eth

Self-virtualizing devices Next week

. Can dynamically create up to 2048 distinct PCI devices on demand! Reliable storage . Hypervisor can create a virtual NIC for each VM . Softswitch driver programs “master” NIC to demux packets to each virtual OS Research/Future NIC . PCI bus is virtualized in each VM . Each Guest OS appears to have “real” NIC, talks direct to the real hardware