<<

and The Art of

Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt & Andrew Warfield

SOSP 2003

Additional source: Ian Pratt on xen (xen source) 1 www.cl.cam.ac.uk/research/srg/netos/papers/2005-xen-may.ppt Para virtualization

2 Virtualization approaches z z OS sees exact h/w OS z OS runs unmodified z Requires virtualizable VMM architecture or work around H/W z Example: Vmware z Para Virtualization z OS knows about VMM OS z Requires porting (source code) z Execution overhead VMM z Example Xen, denali H/W

3 The Xen approach

z Support for unmodified binaries (but not OS) essential z Important for app developers z Virtualized system exports has same Application Binary Interface (ABI) z Modify guest OS to be aware of virtualization z Gets around problems of architecture z Allows better performance to be achieved z Expose some effects of virtualization z Translucent VM OS can be used to optimize for performance z Keep layer as small and simple as possible z Resource management, Device drivers run in privileged VMM z Enhances security, resource isolation

4 5

z Solution to issues with x86 instruction set z Don’t allow guest OS to issue sensitive instructions z Replace those sensitive instructions that don’t trap to ones that will trap z Guest OS makes “hypercalls” (like system calls) to interact with system resources z Allows hypervisor to provide protection between VMs z Exceptions handled by registering handler table with Xen z Fast handler for OS system calls invoked directly z Page fault handler modified to read address from replica location z Guest OS changes largely confined to arch-specific code z Compile for ARCH=xen instead of ARCH=i686

z Original port of required only 1.36% of OS to be modified 5 Para-Virtualization in Xen z Arch xen_x86 : like x86, but Xen hypercalls required for privileged operations z Avoids binary rewriting z Minimize number of privilege transitions into Xen z Modifications relatively simple and self-contained z Modify kernel to understand virtualized environment. z Wall-clock time vs. virtual processor time z Xen provides both types of alarm timer z Expose real resource availability z Enables OS to optimise behaviour

6 x86 CPU virtualization z Xen runs in ring 0 (most privileged) z Ring 1/2 for guest OS, 3 for user-space z General Processor Fault if guest attempts to use privileged instruction z Xen lives in top 64MB of linear z Segmentation used to protect Xen as switching page tables too slow on standard x86 z Hypercalls jump to Xen in ring 0 z Guest OS may install ‘fast trap’ handler z Direct user-space to guest OS system calls z MMU virtualisation: shadow vs. direct-mode 7 x86_32

z Xen reserves top of VA 4GB space Xen S z Segmentation protects Kernel S Xen from kernel 3GB z System call speed unchanged ring 0

User U ring 1

ring 3 z Xen 3.0 now supports >4GB mem with 0GB Processor Address Extension (64 bit etc)

8 Xen VM interface: CPU z CPU z Guest runs at lower privilege than VMM z Exception handlers must be registered with VMM z Fast system call handler can be serviced without trapping to VMM z Hardware interrupts replaced by lightweight event notification system z Timer interface: both real and virtual time

9 Xen virtualizing CPU z Many processor architectures provide only 2 levels (0/1) z Guest and apps in 1, VMM in 0 z Run Guest and app as separate processes z Guest OS can use the VMM to pass control between address spaces z Use of TLB with address space tags to minimize CS overhead

10 XEN: virtualizing CPU in x86 z x86 provides 4 rings (even VAX processor provided 4) z Leverages availability of multiple “rings” z Intermediate rings have not been used in practice since OS/2; x86- specific z An O/S written to only use rings 0 and 3 can be ported; needs to modify kernel to run in ring 1

11 CPU virtualization z Exceptions that are called often: z Software interrupts for system calls z Page faults z Improve Allow “guest” to register a ‘fast’ exception handler for system calls that can be accessed directly by CPU in ring 1, without switching to ring-0/Xen z Handler is validated before installing in hardware exception table: To make sure nothing executed in Ring 0 privilege. z Doesn’t work for Page Fault z Only code in ring 0 can read the faulting address from register

12 Xen

13 Some Xen hypercalls 14 z See http://lxr.xensource.com/lxr/source/xen/include/public/xen.h #define __HYPERVISOR_set_trap_table 0 #define __HYPERVISOR_mmu_update 1 #define __HYPERVISOR_sysctl 35 #define __HYPERVISOR_domctl 36

14 Xen VM interface: Memory z Memory management z Guest cannot install highest privilege level segment descriptors; top end of linear address space is not accessible z Guest has direct (not trapped) read access to hardware page tables; writes are trapped and handled by the VMM z Physical memory presented to guest is not necessarily contiguous

15 choices z TLB: challenging z Software TLB can be virtualized without flushing TLB entries between VM switches z Hardware TLBs tagged with address space identifiers can also be leveraged to avoid flushing TLB between switches z x86 is hardware-managed and has no tags… z Decisions: z Guest O/Ss allocate and manage their own hardware page tables with minimal involvement of Xen for better safety and isolation z Xen VMM exists in a 64MB section at the top of a VM’s address space that is not accessible from the guest

16 Xen memory management 17 z x86 TLB not tagged z Must optimise context switches: allow VM to see physical addresses z Xen mapped in each VM’s address space z PV: Guest OS manages own page tables z Allocates new page tables and registers with Xen z Can read directly z Updates batched, then validated, applied by Xen

17 Memory virtualization z Guest O/S has direct read access to hardware page tables, but updates are validated by the VMM z Through “hypercalls” into Xen z Also for segment descriptor tables z VMM must ensure access to the Xen 64MB section not allowed z Guest O/S may “batch” update requests to amortize cost of entering hypervisor

18 x86_32

z Xen reserves top of VA 4GB space Xen S z Segmentation protects Kernel S Xen from kernel 3GB z System call speed unchanged ring 0

User U ring 1

ring 3 z Xen 3.0 now supports >4GB mem with 0GB Processor Address Extension (64 bit etc)

19 Virtualized memory 20 management VM1 VM2 z Each in each VM has 1 2 1 5 1 2 1 5 its own VAS 2 3 2 6 2 3 2 6 z Guest OS deals with real (pseudo-physical) pages, Xen 2 8 maps physical to machine 3 12 z For PV, guest OS uses 5 14 hypercalls to interact with memory 6 9 z For HVM, Xen has shadow 2 4 page tables (VT instructions help) 3 16 5 17

6 18 20 TLB when VM1 is running

VP PID PN

1 1 ?

2 1 ?

1 2 ?

2 2 ?

21 MMU Virtualization: shadow mode

22 Shadow page table z Hypervisor responsible for trapping access to virtual page table z Updates need to be propagate back and forth between Guest OS and VMM z Increases cost of managing page table flags (modified, accessed bits) z Can view physical memory as contiguous z Needed for full virtualization

23 MMU virtualization: direct mode z Take advantage of Paravirtualization z OS can be modified to be involved only in page table updates z Restrict guest OSes to read only access z Classify Page frames into frames that holds page table z Once registered as page table frame, make the page frame R_ONLY z Can avoid the use of shadow page tables

24 Single PTE update

25 On write PTE : Emulate

guest reads Virtual → Machine first guest write Guest OS

yes emulate? Xen VMM Hardware MMU 26 Bulk update z Useful when creating new Virtual address spaces z New Process via fork and Context switch z Requires creation of several PTEs z Multipurpose hypercall z Update PTEs z Update virtual to Machine mapping z Flush TLB z Install new PTBR

27 Batched Update Interface

guest reads PD Virtual → Machine guest PT PT PT writes Guest OS

validation Xen VMM Hardware MMU 28 Writeable Page Tables: create new entries

guest reads PD Virtual → Machine guest writes X PT PT PT Guest OS

Xen VMM Hardware MMU 29 Writeable Page Tables : First Use—validate mapping via TLB

guest reads Virtual → Machine guest writes X Guest OS

page fault Xen VMM Hardware MMU 30 Writeable Page Tables : Re- hook

guest reads Virtual → Machine guest writes Guest OS

validate

Xen VMM Hardware MMU 31 Physical memory z Memory allocation for each VM specified at boot z Statically partitioned z No overlap in machine memory z Strong isolation z Non-contiguous (Sparse allocation) z Balloon driver z Add or remove machine memory from guest OS

32 Xen memory management 33 z Xen does not swap out memory allocated to domains z Provides consistent performance for domains z By itself would create inflexible system (static memory allocation) z Balloon driver allows guest memory to grow/shrink z Memory target set as value in the XenStore z If guest above target, free/swap out, then release to Xen z If guest below target, can increase usage z Hypercalls allow guests to see/change state of memory z Physical-real mappings z “Defragment” allocated memory

33 Xen VM interface: I/O z I/O z Virtual devices (device descriptors) exposed as asynchronous I/O rings to guests z Event notification is by means of an upcall as opposed to interrupts

34 I/O z Handle interrupts z transfer z Data written to I/O buffer pools in each domain z These Page frames pinned by Xen

35 Details: I/O z I/O Descriptor Ring:

36 I/O rings

REQUEST CONSUMER (XEN) REQUEST PRODUCER (GUEST)

RESPONSE CONSUMER (GUEST)

RES PRODUCER (XEN)

37 I/O virtualization z Xen does not emulate hardware devices z Exposes device abstractions for simplicity and performance z I/O data transferred to/from guest via Xen using shared-memory buffers z Virtualized interrupts: light-weight event delivery mechanism from Xen-guest z Update a bitmap in shared memory z Optional call-back handlers registered by O/S

38 z Xen models a virtual -router (VFR) to which one or more VIFs of each domain connect z Two I/O rings: one for send and another for receive z Policy enforced by a special domain z Each direction also has rules of the form (if Æ ) that are inserted by domain 0 (management)

39 Network Virtualization z Packet transmission: z Guest adds request to I/O ring z Xen copies packet header, applies matching filter rules z Round-robin packet scheduler

40 Network Virtualization z Packet reception: z Xen applies pattern-matching rules to determine destination VIF z Guest O/S required to provide PM for copying packets received z If no receive frame is available, the packet is dropped z Avoids Xen-guest copies;

41 Disk Virtualization z Uses Split driver approach z Front end, back end drivers z Front end z Guest OSes use a simple generic driver per class z Domain 0 provides the actual driver per device z Back end runs in own VM (domain 0)

42 Disk virtualization z Domain0 has access to physical disks z Currently: SCSI and IDE z All other domains are offerred virtual block device (VBD) abstraction z Created & configured by management software at domain0 z Accessed via I/O ring mechanism z Possible reordering by Xen based on knowledge about disk layout

43 Disk virtualization z Xen maintains translation tables for each VBD z Used to map requests for VBD (ID,offset) to corresponding physical device and sector address z Zero-copy data transfers take place using DMA between memory pages pinned by requesting domain z Scheduling: batches of requests in round-robin fashion across domains

44 Advanced features 45 z Support for HVM (hardware virtualisation support) z Very similar to “classic” VM scenario z Uses emulated devices, shadow page tables z Hypervisor (VMM) still has important role to play z “Hybrid” HVM paravirtualizes components (e.g. device drivers) to improve performance z Migration of domains between machines z Daemon runs on each Dom0 to support this z Incremental copying used to for (60ms 45 downtime!) Xen 2.0 Architecture

VM0 VM1 VM2 VM3 Device Unmodified Unmodified Unmodified Manager & User User User Control s/w Software Software Software

GuestOS GuestOS GuestOS GuestOS (XenLinux) (XenLinux) (XenLinux) (XenBSD)

Back-End Back-End

Native Native Device Device Front-End Front-End Driver Driver Device Drivers Device Drivers

Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU Xen Monitor

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) 46 Xen Today : 2.0 Features z Secure isolation between VMs z Resource control and QoS z Only guest kernel needs to be ported z All user-level apps and libraries run unmodified z Linux 2.4/2.6, NetBSD, FreeBSD, Plan9 z Execution performance is close to native z Supports the same hardware as Linux x86 z Live Relocation of VMs between Xen nodes

47 Xen 3.0 Architecture

VM0 VM1 VM2 VM3 Device Unmodified Unmodified Unmodified Manager & User User User Control s/w Software Software Software

GuestOS GuestOS GuestOS Unmodified AGP (XenLinux) (XenLinux) (XenLinux) GuestOS (WinXP)) ACPI Back-End Back-End SMP PCI Native Native Device Device Front-End Front-End Driver Driver Device Drivers Device Drivers VT-x x86_32 Control IF Safe HW IF Event Channel Virtual CPU Virtual MMU x86_64 Xen Virtual Machine Monitor IA64

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) 48 Xen 3.0 features

zz SupportSupport forfor upup toto 32-way32-way SMPSMP guestguest zz ®Intel® VT-x VT-x andand AMDAMD PacificaPacifica hardwarehardware virtualizationvirtualization supportsupport zz PAEPAE supportsupport forfor 3232 bitbit serversservers withwith overover 44 GBGB memorymemory zz x86/64x86/64 supportsupport forfor bothboth AMD64AMD64 andand EM64TEM64T zz NewNew easy-to-useeasy-to-use CPUCPU schedulerscheduler includingincluding weights,weights, capscaps andand automaticautomatic loadload balancingbalancing zz MuchMuch enhancedenhanced supportsupport forfor unmodifiedunmodified ('hvm')('hvm') guestsguests includingincluding windowswindows andand legacylegacy linuxlinux systemssystems zz SupportSupport forfor sparsesparse andand copy-on-writecopy-on-write disksdisks zz HighHigh performanceperformance networkingnetworking usingusing segmentationsegmentation off-loadoff-load 49 Xen protection levels in PAE 50

z x86_64 removed rings 1,2 z Xen in ring 0 z Guest OS and apps in ring 3

50 x86_64

264 z Large VA space makes life a lot easier, but: Kernel U z No segment limit support Xen S Î Need to use page-level 264-247 protection to protect Reserved hypervisor 247

User U

0 51 x86_64

z Run user-space and kernel in ring 3 using different pagetables z Two PGD’s (PML4’s): one with user r3 User U entries; one with user plus kernel entries z System calls require an additional r3 Kernel U syscall/ret via Xen syscall/sysret z Per-CPU trampoline to avoid r0 Xen S needing GS in Xen

52 Additional resources on Xen 53

z “Xen 3.0 and the art of virtualization”, Presentation by Ian Pratt z Virtual machines by Jim Smith and ravi nair z “The definitive guide to the Xen hypervisor” (Kindle Edition), David Chisnall z The source code: http://lxr.xensource.com/lxr/source/xen/

53