Virtualisation

Advanced Operating Systems Lecture 17 Lecture Outline

• Virtualisation concepts • Full system virtualisation and virtual machines • Jails, containers, and the benefits of imperfect virtualisation

2 Virtualisation Concepts

• Enable resource sharing by decoupling execution environments from physical hardware • Full system virtualisation – run multiple operating systems on a single physical host • Desirable for data centres and cloud hosting environments • Benefits for flexible management and security • Use isolation to protect services • Virtualise resources of interest – lightweight, but imperfect, virtualisation • Constrain ability of a group of processes to access system resources • Enhances security within an

3 Full System Virtualisation

• Partition system resources to allow more than one operating system to run on the same physical hardware Guest OS #1 Guest OS #2 • A supports the partition, and is the Hypervisor operating system for the underlying hardware • Guest operating systems are processes that Hardware run on the hypervisor • The hypervisor API presents a virtual machine abstraction – each operating system thinks it’s running on real hardware • i.e., the hypervisor emulates the hardware – requires CPU support for good performance

• First implemented IBM System/360 mainframe – mid-1960s • Popular current implementations: , VMWare, QEMU, VirtualBox

IBM System/360 (Wikipedia)

4 Types of Hypervisor

• Type 1 (“bare metal”) hypervisor: • The hypervisor is the operating system for the underlying hardware • It requires its own device drivers, and has to be ported to new hardware platforms • Xen • Type 2 (“hosted”) hypervisor: The hypervisor is an application running on • (Wikipedia) an existing operating system • VMWare, VirtualBox

• Type 1 have potentially lower overhead, but are more complex to implement and less portable • Type 1 hypervisors can also make use of hardware virtualisation support, to offer better performance

5 Hardware Virtualisation

• Two approaches to virtualising the hardware: • Complete virtualisation • Allow guest OS to run unchanged, performing any operations it desires – entirely untrusted • Processors have protection rings that allow various instructions to be performed – attempts to execute a privileged instruction from a higher ring cause a kernel and permission check Full virtualisation requires either: • (Wikipedia) • Ability for hypervisor to run in more privileged than the guest OS, allowing it to trap/emulate instructions that would expose virtualisation: not possible on older processors, newer processors allow this • Ability for hypervisor to trap access to , and dynamically re- write machine code to trap to the hypervisor, rather than execute an unsafe instruction – possible, but expensive on all x86 processors • Paravirtualisation – provide a virtual machine that is similar, but not identical, to underlying hardware • Prohibit the guest OS from performing operations that are difficult to virtualise • Provide more convenient API for such operations • Requires cooperation from the guest OS – not for security, but for performance

6 Xen

A modern type 1 hypervisor for x86 Xen and the Art of

Paul Barham∗, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Supports two types of virtualisation: Alex Ho, Rolf Neugebauer†, Ian Pratt, Andrew Warfield • University of Cambridge Laboratory 15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD Full virtualisation – needs hardware support firstname.lastname @cl.cam.ac.uk • { } Older x86 CPUs had no way of trapping supervisor ABSTRACT 1. INTRODUCTION • Numerous systems have been designed which use virtualization to Modern are sufficiently powerful to use virtualization mode instruction calls, to allow efficient emulation subdivide the ample resources of a modern computer. Some require to present the illusion of many smaller virtual machines (VMs), by a hypervisor; modern x86 CPUs do provide this specialized hardware, or cannot support commodity operating sys- each running a separate operating system instance. This has led to tems. Some target 100% binary compatibility at the expense of a resurgence of interest in VM technology. In this paper we present performance. Others sacrifice security or functionality for speed. Xen, a high performance resource-managed virtual machine mon- Few offer resource isolation or performance guarantees; most pro- itor (VMM) which enables applications such as server consolida- Paravirtualisation tion [42, 8], co-located hosting facilities [14], distributed web ser- • vide only best-effort provisioning, risking denial of service. This paper presents Xen, an x86 virtual machine monitor which vices [43], secure computing platforms [12, 16] and application allows multiple commodity operating systems to share conventional mobility [26, 37]. Guest operating system is ported to a new system hardware in a safe and resource managed fashion, but without sac- Successful partitioning of a machine to support the concurrent • rificing either performance or functionality. This is achieved by execution of multiple operating systems poses several challenges. architecture, which looks like x86 with problematic providing an idealized virtual machine abstraction to which oper- Firstly, virtual machines must be isolated from one another: it is not ating systems such as , BSD and Windows XP, can be ported acceptable for the execution of one to adversely affect the perfor- features adapted to ease virtualisation with minimal effort. mance of another. This is particularly true when virtual machines Our design is targeted at hosting up to 100 virtual machine in- are owned by mutually untrusting users. Secondly, it is necessary stances simultaneously on a modern server. The virtualization ap- to support a variety of different operating systems to accommodate • The operating system knows it is running in a VM proach taken by Xen is extremely efficient: we allow operating sys- the heterogeneity of popular applications. Thirdly, the performance tems such as Linux and Windows XP to be hosted simultaneously overhead introduced by virtualization should be small. Processes running within the OS cannot tell that for a negligible performance overhead — at most a few percent Xen hosts commodity operating systems, albeit with some source • compared with the unvirtualized case. We considerably outperform modifications. The prototype described and evaluated in this paper virtualisation is in use competing commercial and freely available solutions in a range of can support multiple concurrent instances of our XenoLinux guest microbenchmarks and system-wide tests. operating system; each instance exports an application binary inter- face identical to a non-virtualized Linux 2.4. Our port of Windows Categories and Subject Descriptors XP to Xen is not yet complete but is capable of running simple user-space processes. Work is also progressing in porting NetBSD. Control operations delegated to a D.4.1 [Operating Systems]: Management; D.4.2 [Opera- Xen enables users to dynamically instantiate an operating sys- • ting Systems]: Storage Management; D.4.8 [Operating Systems]: tem to execute whatever they desire. In the XenoServer project [15, Performance 35] we are deploying Xen on standard server hardware at econom- ically strategic locations within ISPs or at Internet exchanges. We privileged guest operating system General Terms perform admission control when starting new virtual machines and expect each VM to pay in some fashion for the resources it requires. Design, Measurement, Performance We discuss our ideas and approach in this direction elsewhere [21]; Known as Domain0 this paper focuses on the VMM. • Keywords There are a number of ways to build a system to host multiple Virtual Machine Monitors, Hypervisors, Paravirtualization applications and servers on a shared machine. Perhaps the simplest is to deploy one or more hosts running a standard operating sys- Provides policy and controls the hypervisor ∗Microsoft Research Cambridge, UK tem such as Linux or Windows, and then to allow users to install • † Research Cambridge, UK files and start processes — protection between applications being provided by conventional OS techniques. Experience shows that system administration can quickly become a time-consuming task due to complex configuration interactions between supposedly dis- Permission to make digital or hard copies of all or part of this work for joint applications. personal or classroom use is granted without fee provided that copies are More importantly, such systems do not adequately support per- not made or distributed for profit or commercial advantage and that copies formance isolation; the priority, memory demand, net- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific work traffic and disk accesses of one process impact the perfor- permission and/or a fee. mance of others. This may be acceptable when there is adequate SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. provisioning and a closed user group (such as in the case of com- Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.

7 Paravirtualisation

• Guest operating systems must be ported to Xen

Memory Management The Xen hypervisor provides an API Segmentation Cannot install fully-privileged segment descriptors and cannot overlap with the top end of the linear • address space. Paging Guest OS has direct read access to hardware page tables, but updates are batched and validated by to mediate hardware access by guest the hypervisor. A domain may be allocated discontiguous machine pages. CPU operating systems: Protection Guest OS must run at a lower level than Xen. Exceptions Guest OS must register a descriptor table for exception handlers with Xen. Aside from page faults, the handlers remain the same. Privileged CPU instructions System Calls Guest OS may install a ‘fast’ handler for system calls, allowing direct calls from an application into • its guest OS and avoiding indirecting through Xen on every call. Hardware interrupts are replaced with a lightweight event system. Time Each guest OS has a timer interface and is aware of both ‘real’ and ‘virtual’ time. • Device I/O Network, Disk, etc. Virtual devices are elegant and simple to access. Data is transferred using asynchronous I/O rings. • handling An event mechanism replaces hardware interrupts for notifications. • Access to network, disk, etc., hardware Table 1: The paravirtualized x86 interface. VMM is contrary to our goal of performance isolation: malicious guest OS. The task is easier if the architecture provides a software- virtual machines can encourage thrashing behaviour, unfairly de- managed TLB as these can be efficiently virtualized in a simple priving others of CPU time and disk bandwidth. In Xen we expect manner [13]. A tagged TLB is another useful feature supported each guest OS to perform its own paging using its own guaran- by most server-class RISC architectures, including Alpha, MIPS teed memory reservation and disk allocation (an idea previously and SPARC. Associating an address-space identifier tag with each Required changes to guest systems areexploited smallby self-paging [20]). TLB entry allows the hypervisor and each guest OS to efficiently • entity in the system. In order to protect theFinallyhypervisor, Denali virtualizesfrom OSthe ‘namespaces’ ofOSallsubsectionmachine re- coexist in separate address# linesspaces because there is no need to flush misbehavior (and domains from one another)sources,guesttakingOSesthe viemustw thatbeno VM can access the resource alloca- the entire TLB whenLinuxtransferringXPexecution. Porting guests to an API modifiedthat allowsto run at a loefficientwer privilege virtualisationlevel. tions of another VM if it cannot name them (for example,Architecture-independentVMs have Unfortunately, x8678does not ha1299ve a software-managed TLB; in- • no knowledge of hardware addresses, only the virtual addresses stead TLB misses are serviced automatically by the processor by Many processor architectures only providecreatedtwforo prithemvilebygeDenali).levels.In contrast, we belieVveirtualthat securenetwac-ork drivweralking the page table484structure in hardw– are. Thus to achieve the allows significant performanceIn these cases benefitsthe guest OS would share cessthe controllowerwithinprivilethegehypervisorlevel is sufficient to Vensureirtualprotection;block-devicebestdripossiblever performance,1070 all valid page– translations for the current with applications. The guest OS would thenfurthermore,protectasitselfdiscussedby run-previously, there are strongXen-specificcorrectness(non-driaddressver) space should1363be present3321in the hardware-accessible page and performance arguments for making physicalTresourcesotal directly table. Moreover, 2995because the 4620TLB is not tagged, address space Simplifies hypervisor designning in fora separate x86,address wherespace from theits applications,processorvisible to guestandOSes.indirectly switches typically require a complete TLB flush. Given these limi- • (Portion of total x86 code base 1.36% 0.04%) architecture makes full virtualisationpass control to and fromexpensiveapplications via theIn thehypervisorfollowing sectionto setwethedescribe the virtual machine abstrac- tations, we made two decisions: (i) guest OSes are responsible for virtual privilege level and change the currenttionaddressexported byspace.Xen andAgdiscussain, how a guest OS must be modified allocating and managing the hardware page tables, with minimal if the processor’s TLB supports address-spaceto conformtagstothenthis. eNotexpensithatvinethis paperTablewe reserv2: eThethe termsimplicityguest ofinportingvolvementcommodityfrom Xen to ensureOSessafetyto Xen.and isolation;The and (ii) Xen TLB flushes can be avoided. operating system to refer to one of the OSescostthatmetricXen canis thehostnumberand eofxistslinesin aof64MBreasonablysection atcommentedthe top of everyandaddress space, thus The user-space API of guest operatingwe systemsuse the term domain to refer to a runningformattedvirtual machinecode withinwhich areamodifiedvoiding a TLBor addedflush whencomparenteringedandwithleavingthethe hypervisor. • Efficient virtualizion of privilege levelswhichis possiblea guest OSonexecutes;x86 be-the distinction is analogous to that be- Each time a guest OS requires a new , perhaps be- cause it supports four distinct privilege levtweenels inahardwprogramare.andThea processx86 in a conoriginalventional system.x86 codeWe callbase (excludingcause a newdeprocessvice driis beingvers).created, it allocates and initializes a is unchanged privilege levels are generally described asXenringsitself, andthe hypervisorare numberedsince it operates at a higher privilege level page from its own memory reservation and registers it with Xen. than the supervisor code of the guest operating systems that it hosts. At this point the OS must relinquish direct write privileges to the from zero (most privileged) to three (least privileged). OS code page-table memory: all subsequent updates must be validated by typically executes in ring 0 because no other2.1ringThecanVeirtualxecute Machinepriv- Interfacesegment is not present or Xen.if theThishandlerrestrictsisupdatesnot pagedin a numberinto ofmem-ways, including only ileged instructions, while ring 3 is generallyTableused1 presentsfor applicationan overview of the paraoryvirtualizedthen anx86appropriateinterface, faultallowillwing bean OStaktoenmapwhenpagesXenthat itexowns,ecutesandthedisallowing writable code. To our knowledge, rings 1 and 2 hafvactorede not intobeenthreeusedbroadbyaspectsany of theiretsystem:instructionmemory manage-which returnsmappingstoofthepagehandlertables. .GuestXenOSesdetectsmay batctheseh update requests to ment, the CPU, and device I/O. In the following we address each amortize the overhead of entering the hypervisor. The top 64MB well-known x86 OS since OS/2. Any OSmachinewhich follosubsystemws thisin turn,com-and discuss“doublehow eachfaults”is presentedby checkingin regionthe fofaultingeach addressprogramspace,counterwhich is reservvalue:ediffor Xen, is not ac- mon arrangement can be ported to8Xen byourmodifyingparavirtualizedit toarchitecture.execute Note thethat addressalthough certainresidespartswithin thecessibleexception-virtualizingor remappable by guest codeOSes. thenThis addressthe region is not in ring 1. This prevents the guest OS fromofdirectlyour implementation,executingsuchpriv-as memoryofmanagement,fending guestare specificOS is terminated.used by any of the common x86 ABIs however, so this restriction ileged instructions, yet it remains safely isolatedto the x86,frommanyapplicationsaspects (such as our virtualNoteCPUthatand I/Othisde“lazy”vices) checkingdoes notisbreaksafeapplicationeven forcompatibilitythe direct system-. can be readily applied to other machine architectures. Furthermore, Segmentation is virtualized in a similar way, by validating up- running in ring 3. x86 represents a worst case in the areas wherecall handler:it differs significantlyaccess faultsdateswilltooccurhardwarewhensegmentthe descriptorCPU attemptstables. Theto only restrictions Privileged instructions are paravirtualizedfrombyRISC-stylerequiringprocessorsthem —tofor example,directlyefficientlyjumpvirtualizingto the guestonOSx86handlersegment. descriptorsIn this caseare: (i)thethefyaultingmust have lower privi- be validated and executed within Xen— thishardwappliesare pagetotablesoperationsis more difficult thanaddressvirtualizingwill abesoftwoutsideare- Xenlege(sincethan Xen,Xenandwill(ii)nethevyermayexecutenot alloawguestany access to the Xen- such as installing a new page table, or yieldingmanagedtheTLB.processor when OS ) and so thereservfaultedisportionvirtualizedof the addressin thespace.normal way. idle (rather than attempting to it). An2.1.1y guestMemoryOS attemptmanagtoement If propagation of the fault causes2.1.2 aCPUfurther “double fault” then the directly execute a privileged instruction is fVailedirtualizingby thememoryprocessoris undoubtedly, guestthe mostOSdifisficultterminatedpart of as describedVirtualizingabotheve.CPU has several implications for guest OSes. either silently or by taking a fault, sinceparaonlyvirtualizingXen exanecutesarchitecture,at a both in terms of the mechanisms Principally, the insertion of a hypervisor below the operating sys- sufficiently privileged level. required in the hypervisor and modifications2.1.3requiredDetoviceport eachI/O tem violates the usual assumption that the OS is the most privileged Exceptions, including memory faults and software traps, are vir- Rather than emulating existing hardware devices, as is typically tualized on x86 very straightforwardly. A table describing the han- done in fully-virtualized environments, Xen exposes a set of clean dler for each type of exception is registered with Xen for valida- and simple device abstractions. This allows us to design an inter- tion. The handlers specified in this table are generally identical face that is both efficient and satisfies our requirements for protec- to those for real x86 hardware; this is possible because the ex- tion and isolation. To this end, I/O data is transferred to and from ception stack frames are unmodified in our paravirtualized archi- each domain via Xen, using shared-memory, asynchronous buffer- tecture. The sole modification is to the page fault handler, which descriptor rings. These provide a high-performance communica- would normally read the faulting address from a privileged proces- tion mechanism for passing buffer information vertically through sor register (CR2); since this is not possible, we write it into an 2 the system, while allowing Xen to efficiently perform validation extended stack frame . When an exception occurs while executing checks (for example, checking that buffers are contained within a outside ring 0, Xen’s handler creates a copy of the exception stack domain’s memory reservation). frame on the guest OS stack and returns control to the appropriate Similar to hardware interrupts, Xen supports a lightweight event- registered handler. delivery mechanism which is used for sending asynchronous noti- Typically only two types of exception occur frequently enough to fications to a domain. These notifications are made by updating a affect system performance: system calls (which are usually imple- bitmap of pending event types and, optionally, by calling an event mented via a software exception), and page faults. We improve the handler specified by the guest OS. These callbacks can be ‘held off’ performance of system calls by allowing each guest OS to register at the discretion of the guest OS — to avoid extra costs incurred by a ‘fast’ exception handler which is accessed directly by the proces- frequent wake-up notifications, for example. sor without indirecting via ring 0; this handler is validated before installing it in the hardware exception table. Unfortunately it is not 2.2 The Cost of Porting an OS to Xen possible to apply the same technique to the page fault handler be- Table 2 demonstrates the cost, in lines of code, of porting com- cause only code executing in ring 0 can read the faulting address modity operating systems to Xen’s paravirtualized x86 environ- from register CR2; page faults must therefore always be delivered ment. Note that our NetBSD port is at a very early stage, and hence via Xen so that this register value can be saved for access in ring 1. we report no figures here. The XP port is more advanced, but still in Safety is ensured by validating exception handlers when they are progress; it can execute a number of user-space applications from presented to Xen. The only required check is that the handler’s code a RAM disk, but it currently lacks any virtual I/O drivers. For this segment does not specify execution in ring 0. Since no guest OS reason, figures for XP’s virtual device drivers are not presented. can create such a segment, it suffices to compare the specified seg- However, as with Linux, we expect these drivers to be small and ment selector to a small number of static values which are reserved simple due to the idealized presented by Xen. by Xen. Apart from this, any other handler problems are fixed up Windows XP required a surprising number of modifications to during exception propagation — for example, if the handler’s code its architecture independent OS code because it uses a variety of 2In hindsight, writing the value into a pre-agreed shared memory location structures and unions for accessing page-table entries (PTEs). Each rather than modifying the stack frame would have simplified the XP port. page-table access had to be separately modified, although some of Control and Management

Xen provides mechanisms, but does Control read-only VBD may be created, or a VIF may filter IP packets to • User User User Plane prevent source-address spoofing). Software Software Software not define policy Software This control interface, together with profiling statistics on the current state of the system, is exported to a suite of application- • Virtualisation primitives are well understood level management software running in Domain0. This complement GuestOS GuestOS GuestOS GuestOS and stable (XenoLinux) (XenoLinux) (XenoBSD) (XenoXP) of administrative tools allows convenient management of the entire server: current tools can create and destroy domains, set network Control mechanisms and policies evolve to Xeno-Aware Xeno-Aware Xeno-Aware Xeno-Aware • Device Drivers Device Drivers Device Drivers Device Drivers filters and routing rules, monitor per-domain network activity at match business models packet and flow granularity, and create and delete virtual network Domain0 virtual virtual virtual virtual X control x86 CPU phy mem network blockdev E interfaces and virtual block devices. We anticipate the development • Separation of mechanism and policy allows interface N of higher-level tools to further automate the application of admin- these to develop at separate rates istrative policy. H/W (SMP x86, phy mem, enet, SCSI/IDE) 3. DETAILED DESIGN Figure 1: The structure of a machine running the Xen hyper- In this section we introduce the design of the major subsystems • Interactions via a mix of hypercalls visor, hosting a number of different guest operating systems, that make up a Xen-based server. In each case we present both and message passing including Domain0 running control software in a XenoLinux Xen and guest OS functionality for clarity of exposition. The cur- environment. rent discussion of guest OSes focuses on XenoLinux as this is the most mature; nonetheless our ongoing porting of Windows XP and • Hypercalls allow a guest operating system to NetBSD gives us confidence that Xen is guest OS agnostic. call into the hypervisor this process was automated with scripts. In contrast, Linux needed Implement the Domain0 control interface far fewer modifications to its generic memory system as it uses pre- 3.1 Control Transfer: Hypercalls and Events • processor macros to access PTEs — the macro definitions provide Two mechanisms exist for control interactions between Xen and • Implement the paravirtualisation interface a convenient place to add the translation and hypervisor calls re- an overlying domain: synchronous calls from a domain to Xen may • Message passing allows the hypervisor to quired by paravirtualization. be made using a hypercall, while notifications are delivered to do- queue events to be processed by a guest In both OSes, the architecture-specific sections are effectively mains from Xen using an asynchronous event mechanism. a port of the x86 code to our paravirtualized architecture. This The hypercall interface allows domains to perform a synchronous • Callbacks to a guest, to indicate completion of a involved rewriting routines which used privileged instructions, and software trap into the hypervisor to perform a privileged operation, request or receipt of asynchronous data removing a large amount of low-level system initialization code. analogous to the use of system calls in conventional operating sys- Again, more changes were required in Windows XP, mainly due tems. An example use of a hypercall is to request a set of page- to the presence of legacy 16-bit emulation code and the need for table updates, in which Xen validates and applies a list of updates, a somewhat different boot-loading mechanism. Note that the x86- returning control to the calling domain when this is completed. 9 specific code base in XP is substantially larger than in Linux and Communication from Xen to a domain is provided through an hence a larger porting effort should be expected. asynchronous event mechanism, which replaces the usual delivery mechanisms for device interrupts and allows lightweight notifica- 2.3 Control and Management tion of important events such as domain-termination requests. Akin Throughout the design and implementation of Xen, a goal has to traditional signals, there are only a small number of events, been to separate policy from mechanism wherever possible. Al- each acting to flag a particular type of occurrence. For instance, though the hypervisor must be involved in data-path aspects (for events are used to indicate that new data has been received over the example, scheduling the CPU between domains, filtering network network, or that a virtual disk request has completed. packets before transmission, or enforcing access control when read- Pending events are stored in a per-domain bitmask which is up- ing data blocks), there is no need for it to be involved in, or even dated by Xen before invoking an event-callback handler specified aware of, higher level issues such as how the CPU is to be shared, by the guest OS. The callback handler is responsible for resetting or which kinds of packet each domain may transmit. the set of pending events, and responding to the notifications in an The resulting architecture is one in which the hypervisor itself appropriate manner. A domain may explicitly defer event handling provides only basic control operations. These are exported through by setting a Xen-readable software flag: this is analogous to dis- an interface accessible from authorized domains; potentially com- abling interrupts on a real processor. plex policy decisions, such as admission control, are best performed by management software running over a guest OS rather than in 3.2 Data Transfer: I/O Rings privileged hypervisor code. The presence of a hypervisor means there is an additional pro- The overall system structure is illustrated in Figure 1. Note that tection domain between guest OSes and I/O devices, so it is crucial a domain is created at boot time which is permitted to use the con- that a data transfer mechanism be provided that allows data to move trol interface. This initial domain, termed Domain0, is responsible vertically through the system with as little overhead as possible. for hosting the application-level management software. The con- Two main factors have shaped the design of our I/O-transfer trol interface provides the ability to create and terminate other do- mechanism: resource management and event notification. For re- mains and to control their associated scheduling parameters, phys- source accountability, we attempt to minimize the work required to ical memory allocations and the access they are given to the ma- demultiplex data to a specific domain when an interrupt is received chine’s physical disks and network devices. from a device — the overhead of managing buffers is carried out In addition to processor and memory resources, the control inter- later where computation may be accounted to the appropriate do- face supports the creation and deletion of virtual network interfaces main. Similarly, memory committed to device I/O is provided by (VIFs) and block devices (VBDs). These virtual I/O devices have the relevant domains wherever possible to prevent the crosstalk in- associated access-control information which determines which do- herent in shared buffer pools; I/O buffers are protected during data mains can access them, and with what restrictions (for example, a transfer by pinning the underlying page frames within Xen. Imperfect Virtualisation: Jails & Containers

• A full virtual machine is effective, but heavy-weight • Alternatives offer lower cost, but less effective, virtualisation: • FreeBSD Jails • Linux containers – Docker, etc. • Different implementations of the same basic idea: constrain processes to see a restricted subset of system resources • Jailed processes run within the existing operating system, and can tell that they are running in a jail

10 FreeBSD Jail subsystem

Jails: Confining the omnipotent root. • Unix has long had chroot() Poul-Henning Kamp Limits a process to see a subset of the filesystem Robert N. M. Watson • The FreeBSD Project • Used in, e.g., web servers, to prevent buggy code ABSTRACT from serving files outside the chosen directory

The traditional UNIX security model is simple but inexpressive. Adding fine-grained access control improvesthe expressiveness, but Jails extend this to also restrict access often dramatically increases both the cost of system management and • implementation complexity.Inenvironments with a more complexman- agement model, with delegation of some management functions to par- to other resources for jailed processes: ties under varying degrees of trust, the base UNIX model and most natu- ral extensions are inappropriate at best. Where multiple mutually un- Limits a process to see a subset of the processes trusting parties are introduced, ‘‘inappropriate’’rapidly transitions to • ‘‘nightmarish’’, especially with regards to data integrity and privacy pro- tection. (the children of the initially jailed process) The FreeBSD ‘‘Jail’’facility provides the ability to partition the operat- ing system environment, while maintaining the simplicity of the UNIX Limits a process to see a specific network interface ‘‘root’’model. In Jail, users with privilege find that the scope of their • requests is limited to the jail, allowing system administrators to delegate management capabilities for each virtual machine environment. Creating • Prevents a process from mounting/unmounting file virtual machines in this manner has manypotential uses; the most popu- lar thus far has been for providing virtual machine services in Internet systems, creating device nodes, loading kernel Service Provider environments. modules, changing kernel parameters, configuring

1. Introduction network interfaces, etc. The UNIX access control mechanism is designed for an environment with twotypes of users: those with, and without administrative privilege. Within this framework, every Even if running with root privilege attempt is made to provide an open system, allowing easy sharing of files and inter-pro- • cess communication. As a member of the UNIX family,FreeBSD inherits these secu- rity properties. Users of FreeBSD in non-traditional UNIX environments must balance their need for strong application support, high network performance and functionality, • Processes outside the jail can see into This work was sponsored by http://www.servetheweb.com/ and donated to the FreeBSD Project for inclusion in the FreeBSD OS. FreeBSD 4.0-RELEASE was the jail – from outside, jailed processes the first release including this code. Follow-on work was sponsored by Safeport Net- work Services, http://www.safeport.com/ look like regular processes

11 Benefits of Jails

• Extremely lightweight implementation • Processes in the jail are regular Unix processes, with some additional privilege checks • FreeBSD implementation added/modified ~1,000 lines of code in total • Contents of the jail are limited: it’s not a full operating system • Straightforward system administration • Few new concepts: no need to learn how to operate the hypervisor • Visibility into the jail eases debugging and administration; no new tools • Portable • No separate hypervisor, with its own device drivers, needed • If the base system runs, so do the jails

12 Disadvantages of Jails

• Virtualisation is imperfect • Full virtual machines can provide stronger isolation and performance guarantees – processes in jail can compete for resources with other processes in the system, and observe the effects of that competition • A process can discover that it’s running in a jail – a process in a VM cannot determine that it’s in a VM • Jails are tied to the underlying operating system • A virtual machine can be paused, migrated to a different physical system, and resumed, without processes in the VM being aware – a jail cannot

13 Further Reading

P. Barham et al, “Xen and the art of virtualization”, Xen and the Art of Virtualization • Paul Barham∗, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer†, Ian Pratt, Andrew Warfield University of Cambridge Computer Laboratory 15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD Proceedings of the ACM Symposium on Operating Systems firstname.lastname @cl.cam.ac.uk { }

ABSTRACT 1. INTRODUCTION Numerous systems have been designed which use virtualization to Modern computers are sufficiently powerful to use virtualization subdivide the ample resources of a modern computer. Some require to present the illusion of many smaller virtual machines (VMs), Principles, October 2003. DOI:10.1145/945445.945462 specialized hardware, or cannot support commodity operating sys- each running a separate operating system instance. This has led to tems. Some target 100% binary compatibility at the expense of a resurgence of interest in VM technology. In this paper we present performance. Others sacrifice security or functionality for speed. Xen, a high performance resource-managed virtual machine mon- Few offer resource isolation or performance guarantees; most pro- itor (VMM) which enables applications such as server consolida- vide only best-effort provisioning, risking denial of service. tion [42, 8], co-located hosting facilities [14], distributed web ser- This paper presents Xen, an x86 virtual machine monitor which vices [43], secure computing platforms [12, 16] and application allows multiple commodity operating systems to share conventional mobility [26, 37]. hardware in a safe and resource managed fashion, but without sac- Successful partitioning of a machine to support the concurrent rificing either performance or functionality. This is achieved by execution of multiple operating systems poses several challenges. providing an idealized virtual machine abstraction to which oper- Firstly, virtual machines must be isolated from one another: it is not ating systems such as Linux, BSD and Windows XP, can be ported acceptable for the execution of one to adversely affect the perfor- Trade-offs of paravirtualisation vs. full virtualisation? with minimal effort. mance of another. This is particularly true when virtual machines Our design is targeted at hosting up to 100 virtual machine in- are owned by mutually untrusting users. Secondly, it is necessary • stances simultaneously on a modern server. The virtualization ap- to support a variety of different operating systems to accommodate proach taken by Xen is extremely efficient: we allow operating sys- the heterogeneity of popular applications. Thirdly, the performance tems such as Linux and Windows XP to be hosted simultaneously overhead introduced by virtualization should be small. for a negligible performance overhead — at most a few percent Xen hosts commodity operating systems, albeit with some source compared with the unvirtualized case. We considerably outperform modifications. The prototype described and evaluated in this paper competing commercial and freely available solutions in a range of can support multiple concurrent instances of our XenoLinux guest microbenchmarks and system-wide tests. operating system; each instance exports an application binary inter- face identical to a non-virtualized Linux 2.4. Our port of Windows What needs to be done to port an OS to Xen? XP to Xen is not yet complete but is capable of running simple Categories and Subject Descriptors user-space processes. Work is also progressing in porting NetBSD. • D.4.1 [Operating Systems]: Process Management; D.4.2 [Opera- Xen enables users to dynamically instantiate an operating sys- ting Systems]: Storage Management; D.4.8 [Operating Systems]: tem to execute whatever they desire. In the XenoServer project [15, Performance 35] we are deploying Xen on standard server hardware at econom- ically strategic locations within ISPs or at Internet exchanges. We General Terms perform admission control when starting new virtual machines and expect each VM to pay in some fashion for the resources it requires. Design, Measurement, Performance We discuss our ideas and approach in this direction elsewhere [21]; Is paravirtualisation worthwhile, when compared to full this paper focuses on the VMM. Keywords There are a number of ways to build a system to host multiple • Virtual Machine Monitors, Hypervisors, Paravirtualization applications and servers on a shared machine. Perhaps the simplest is to deploy one or more hosts running a standard operating sys- ∗Microsoft Research Cambridge, UK tem such as Linux or Windows, and then to allow users to install †Intel Research Cambridge, UK files and start processes — protection between applications being provided by conventional OS techniques. Experience shows that system virtualisation? system administration can quickly become a time-consuming task due to complex configuration interactions between supposedly dis- Permission to make digital or hard copies of all or part of this work for joint applications. personal or classroom use is granted without fee provided that copies are More importantly, such systems do not adequately support per- not made or distributed for profit or commercial advantage and that copies formance isolation; the scheduling priority, memory demand, net- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific work traffic and disk accesses of one process impact the perfor- permission and/or a fee. mance of others. This may be acceptable when there is adequate SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. provisioning and a closed user group (such as in the case of com- Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.

Jails: Confining the omnipotent root. • P.-H. Kamp and R. Watson, “Jails: Confining the omnipotent Poul-Henning Kamp Robert N. M. Watson root”, Proceedings of the System Administration and Network The FreeBSD Project ABSTRACT

The traditional UNIX security model is simple but inexpressive. Engineering Conference, May 2000. Adding fine-grained access control improvesthe expressiveness, but often dramatically increases both the cost of system management and implementation complexity.Inenvironments with a more complexman- agement model, with delegation of some management functions to par- http://www.sane.nl/events/sane2000/papers/kamp.pdf ties under varying degrees of trust, the base UNIX model and most natu- ral extensions are inappropriate at best. Where multiple mutually un- trusting parties are introduced, ‘‘inappropriate’’rapidly transitions to ‘‘nightmarish’’, especially with regards to data integrity and privacy pro- tection. The FreeBSD ‘‘Jail’’facility provides the ability to partition the operat- ing system environment, while maintaining the simplicity of the UNIX Trade-offs vs. complete system virtualisation? ‘‘root’’model. In Jail, users with privilege find that the scope of their • requests is limited to the jail, allowing system administrators to delegate management capabilities for each virtual machine environment. Creating virtual machines in this manner has manypotential uses; the most popu- lar thus far has been for providing virtual machine services in Internet • Overheads vs. flexibility vs. ease of management? Service Provider environments. 1. Introduction The UNIX access control mechanism is designed for an environment with twotypes of users: those with, and without administrative privilege. Within this framework, every attempt is made to provide an open system, allowing easy sharing of files and inter-pro- Jails vs. Docker vs. …? cess communication. As a member of the UNIX family,FreeBSD inherits these secu- • rity properties. Users of FreeBSD in non-traditional UNIX environments must balance their need for strong application support, high network performance and functionality,

This work was sponsored by http://www.servetheweb.com/ and donated to the FreeBSD Project for inclusion in the FreeBSD OS. FreeBSD 4.0-RELEASE was the first release including this code. Follow-on work was sponsored by Safeport Net- work Services, http://www.safeport.com/

14