Microkernels Meet Recursive Virtual Machines
Bryan Ford Mike Hibler Jay Lepreau Patrick Tullmann Godmar Back Stephen Clawson
Department of Computer Science, University of Utah Salt Lake City, UT 84112
[email protected] http://www.cs.utah.edu/projects/flux/
Abstract “vertically” by implementing OS functionality in stackable virtual machine monitors, each of which exports a virtual This paper describes a novel approach to providingmod- machine interface compatible with the machine interface ular and extensible operating system functionality and en- on which it runs. Traditionally, virtual machines have been capsulated environments based on a synthesis of micro- implemented on and export existing hardware architectures kernel and virtual machine concepts. We have developed so they can support “naive” operating systems (see Fig- a software-based virtualizable architecture called Fluke ure 1). For example, the most well-known virtual machine that allows recursive virtual machines (virtual machines system, VM/370 [28, 29], provides virtual memory and se- running on other virtual machines) to be implemented ef- curity between multiple concurrent virtual machines, all ficiently by a microkernel running on generic hardware. exporting the IBM S/370 hardware architecture. Further- A complete virtual machine interface is provided at each more, special virtualizable hardware architectures [22, 35] level; efficiency derives from needing to implement only have been proposed, whose design goal is to allow virtual new functionality at each level. This infrastructure allows machines to be stacked much more efficiently. common OS functionality, such as process management, demand paging, fault tolerance, and debugging support, to This paper presents a new approach to OS extensibil- be provided by cleanly modularized, independent, stack- ity which combines both microkernel and virtual machine able virtual machine monitors, implemented as user pro- concepts in one system. We have designed a “virtualiz- cesses. It can also provide uncommon or unique OS fea- able architecture” that does not attempt to emulate an ac- tures, including the above features specialized for particu- tual hardware architecture closely, but is instead designed lar applications’ needs, virtual machines transparently dis- along the lines of a traditional process model and is in- tributed cross-node, or security monitors that allow arbi- tended to be implemented in software by a microkernel. trary untrusted binaries to be executed safely. Our proto- The microkernel runs on the “raw” hardware platform and type implementation of this model indicates that it is prac- exports our software-based virtualizable architecture (see tical to modularize operating systems this way. Some types Figure 2), which we will refer to as a virtualizable pro- of virtual machine layers impose almost no overhead at all, cess or nested process architecture to avoid confusion with while others impose some overhead (typically 0Ð35%), but traditional hardware-based architectures. The virtual ma- only on certain classes of applications. chine monitors designed to run on this software architec- ture, which we call nesters, can efficiently create additional recursive virtual machines or nested processes in which ar- 1 Introduction bitrary applications or other nesters can run. Increasing operating system modularity and extensibil- Although the Fluke architecture does not closely fol- ity without excessively hurting performance is a topic of low a traditional virtual machine architecture, it is de- much ongoing research [5, 9, 18, 36, 40]. Microkernels [4, signed to preserve certain highly useful properties of re- 24] attempt to decompose operating systems “horizontally” cursive virtual machines. These properties are required to by moving traditional kernel functionality into servers run- different degrees by different nesters that take advantage ning in user mode. Recursive virtual machines [23], on of the model. For example, demand paging and check- the other hand, allow operating systems to be decomposed pointing nesters require access to and control over the pro- gram state contained in their children, grandchildren, and This research was supported in part by the Defense Advanced Re- so on, whereas process management and security monitor- search Projects Agency, monitored by the Department of the Army, under contract number DABT63Ð94ÐCÐ0058. The opinions and conclusions ing nesters primarily rely on being able to monitor and con- containedin this documentare those of the authors andshould not be inter- trol IPC-based communication across the boundary sur- preted as representing official views or policies of the U.S. Government. Application App App Nested Process Process Architecture Interface Process Process Checkpoint Operating Nester System Application Kernel App App
Process Process Virtual Machine Demand Paging Application Debug Nester Nester Operating System Virtual Machine Kernel Monitor Process Mgmt Nester Virtual Machine Virtual Machine
Hardware Virtual Machine Monitor Hardware Microkernel Interface Interface
Bare Machine Bare Machine
Figure 1: Traditional virtual machines based on hardware architec- Figure 2: Virtual machines based on an extended architecture imple- tures. Each shaded area is a separate virtual machine, and each virtual mented by a microkernel. The interface between the microkernel and the machine exports the same architecture as the base machine’s architecture. bare machineis a traditional hardware-basedmachine architecture, but the common interface between all the other layers in the system is a software- based nested process architecture. Each shaded area is a separate process. rounding the nested environment. Our microkernel’s API provides these properties effi- ciently in several ways. Address spaces are composed from nel and most do not allow control over physical memory other address spaces using hierarchical memory remapping management, just backing store. Similarly, multiuser se- primitives. For CPU resources, the kernel provides primi- curity mechanisms are not always needed, since most per- tives that support hierarchical scheduling. To allow IPC- sonal computers are dedicated to the use of a single per- based communication to short-circuit the hierarchy safely, son, and even process management and job control features the kernel provides a global capability model that supports may not be needed in single-applicationsystems such as the selective interposition on communication channels. On top proverbial “Internet appliance.” Our system demonstrates of the microkernel API, well-defined IPC interfaces pro- decomposed paging and POSIX process management by vide I/O and resource management functionalityat a higher implementing these traditional kernel functions as optional level than in traditional virtual machines. These higher- nesters which can be used only when needed, and only for level interfaces are more suited to the needs of modern ap- the parts of a system for which they are desired. plications: e.g., they provide file handles instead of device Increasing the scope of existing mechanisms: There I/O registers. are algorithms and software packages available for com- This nested process architecture can be used to apply mon operating systems to provide features such as dis- existing algorithms and techniques in more flexible ways. tributed shared memory (DSM) [10, 32], checkpoint- Some examples we demonstrate in this paper include the ing [11], and security against untrusted applications [52]. following: However, these systems only cleanly support applications Decomposing the kernel: Some features of traditional running in a single logical protection domain. In a nested operating systems are usually so tightly integrated into the process model, any process can create further nested sub- kernel that it is difficult to eliminate them in situations processes which are completely encapsulated within the in which they are not needed. A striking example is de- parent. This design allows DSM, checkpointing, security, mand paging. Although it is often possible to disable it and other mechanisms to be applied just as easily to multi- in particular situations on particular regions (e.g., using process applications or even complete operating environ- POSIX’s mlock()), all of the paging support is still in ments. Our system demonstrates this flexibility by provid- the kernel, occupying memory and increasing system over- ing a checkpointer, implemented as a nester, which can be head. Even systems that support “external pagers,” such as transparently applied to arbitrary domains such as a single Mach, contain considerable paging-related code in the ker- application, a multi-process user environment containing a process manager and multiple applications, or even the en- sections. tire system. Composing OS features: The mechanisms mentioned 2.1 Traditional Virtual Machines above are generally difficult or impossible to combine flex- ibly. One might be able to run an application and check- A virtual machine simulator, such as the Java inter- point it, or to run an untrusted application in a secure en- preter [25], is a program that runs on one hardware ar- vironment, but existing software mechanisms are insuffi- chitecture and implements in software a virtual machine cient to run a checkpointed, untrusted application without conforming to a completely different architecture. In con- implementing a new, specialized program designed to pro- trast, a virtual machine monitor or hypervisor, such as vide both functions. A nested process architecture allows VM/370 [28, 29], is a program that creates one or more one to combine such features by layering the mechanisms, virtual machines exporting the same hardware architecture since the interface between each layer is the same. In Fluke, as the machine it runs on. Hypervisors are typically much for example, a Unix-like environment can be built by run- more efficient than simulators because most of the instruc- ning a process manager within a virtual memory manager, tions in the virtual machine environment can be executed at so that the process manager and all of the processes it con- full speed on the bare hardware, and only “special” instruc- trols are paged. Alternatively, the virtual memory manager tions such as privileged instructions and accesses to I/O can be run within the process manager to provide virtual registers need to be emulated in software. Since the “up- memory to an individual process. per” and “lower” interfaces of a hypervisor are the same, We used micro benchmarks to measure the system’s per- a sufficiently complete hypervisor can even run additional formance in a variety of configurations. These measure- copies of itself, recursively. ments indicate a slowdown of about 0Ð35% per virtual Virtual machines have been used for a variety of pur- machine layer, in contrast to conventional recursive vir- poses including security, fault tolerance, and operating sys- tual machines whose slowdown is 20%Ð100% [7]. Some tem software development [23]. In their heyday, virtual nesters, such as the process manager, do not need to in- machine systems were not driven by modularity issues at terpose on performance-critical interfaces such as memory all, but instead were created to make better use of scarce, allocation or file I/O, and hence take better advantage of expensive hardware resources. For example, organizations the short-circuit communication facilities provided by the often needed to run several applications requiring differ- microkernel architecture. These nesters cause almost no ent operating systems concurrently, and possibly test new slowdown at all. Other nesters, such as the memory man- operating systems, all on only a single mainframe. There- ager and the checkpointer, must interfere more to perform fore, traditional virtual machine systems used (and needed) their function, and therefore cause some slowdown. How- only shallow hierarchies, implementing all required func- ever, even this slowdown is fairly reasonable. Our results tionality in a single hypervisor. As hardware became cheap indicate that, at least for the applications we have tested, and ubiquitous, virtual machines became less common, al- this combined virtual machine/microkernel model indeed though they remain in use in specialized contexts such as provides a practical method of increasing operating system fault tolerance [7], and safe environments for untrusted ap- modularity, flexibility, and power. plications [25]. The rest of this paper is organized as follows: In Sec- In this paper we revive the idea of using virtual machine tion 2 we compare our architecture to related work. We concepts pervasively throughout a system, but for the pur- describe the key principles upon which our work is based pose of enhancing OS modularity, flexibility, and extensi- in Section 3, and our software-based virtualizable architec- bility, rather than merely for hardware multiplexing. For ture derived from these principles in Section 4. Section 5 these purposes, virtual machines based on hardware archi- describes the implementation of the example applications tectures have several drawbacks. First, most processor ar- and nesters we designed to take advantage of the nested chitectures allow “sensitive” information such as the cur- process model. Section 6 describes the experiments and re- rent privilege level to leak into user-accessible registers, sults using the example process nesters. Finally, we con- making it impossible for a hypervisor to recreate the under- clude with a short reflective summary. lying architecture faithfullywithout extensive emulation of even unprivileged instructions. Second, a hypervisor’sper- formance worsens exponentially with stacking depth be- 2 Related Work cause each layer must trap and emulate all privileged in- In this section, we first summarize virtual machine con- structions executed in the next higher layer (see Figure 3). cepts and how our system incorporates them; second, Third, since hardware architectures are oblivious to the no- we show how our design relates to conventional process tion of stacking, all communication must be strictly parent- models, and finally, we contrast our system with other child; there is no way to support “short-circuit” commu- microkernel-based systems. We describe related work con- nication between siblings, grandparents and grandchildren, cerning the details of our design later, in the appropriate etc. Application or OS been used in the past to virtualize the activity of a process Hypervisor by interposing special software modules between an ap- Hardware plication and the actual OS on which it is running. This Privileged Instruction form of interposition can be used, for example, to trace system calls or change the process’s view of the file sys- tem [31, 33], or to provide security against an untrusted ap- perform plication [52]. However, these mechanisms can only be ap- plied easily to a single application process and generally
reflect reflect cannot be used in combination (onlyone interpositionmod- perform trap return ule can be used on a given process). Furthermore, although file system access and other system call-based activity can reflect reflect perform trap return be monitored and virtualized this way, it is difficult to vir- tualize other resources such as CPU and memory.
perform trap return (2 traps & returns) (4 traps & returns) 2.3 Other Microkernel Architectures Our nested process model shares many of the same goals Figure 3: Exponential slowdown in traditional recursive virtual ma- as those driving microkernel-based systems: flexibility,
chines caused by emulation and reflection of privileged instructions, ac-