<<

Are Virtual-Machine Monitors Done Right?

Gernot Heiser National ICT Australia∗ and University of New South Wales Sydney, Australia gernot@.com.au Volkmar Uhlig IBM T.J. Watson Research Center, Yorktown Heights, NY vuhlig@us..com Joshua LeVasseur University of Karlsruhe, Germany [email protected]

Abstract The paper compares and contrasts microkernels and virtual-machine monitors (VMMs) as platforms for A paper by Hand et al. at the recent HotOS work- systems design and implementation. While identify- shop re-examined microkernels and contrasted them ing architectural similarities, it examines the differ- to virtual-machine monitors (VMMs). It found that ence in the approaches, and concludes that VMMs the two kinds of systems share architectural com- are one specific point in the design monalities but also have a number of technical dif- space, the “right” one. Unstated but implied is the ferences which the paper examined. It concluded that assertion that VMMs such as Xen [BDF+03] are the VMMs are a special case of microkernels, “microker- (to date) only “right” approach to building microker- nels done right”. nels. A closer examination of that paper shows that it Taking a closer look at the main assertions made contains a number of statements which are poorly by Hand et al, we find that they are hard to justify, justified or even refuted by the literature. While we or even squarely at odds with the literature. While believe that it is indeed timely to reexamine the mer- we think that reexamining the merits and failures of its and issues of microkernels, such an examination microkernels is a potentially valuable exercise, we needs to be based on facts. strongly believe that such a discussion must be per- formed in accordance with established scientific prin- ciples, and most of all, be grounded in facts. As a 1 Introduction contribution to an informed discussion, we examine the assertions made by Hand et al in the light of the At the HotOS workshop in June this year, Hand and public record. coauthors presented a paper [HWF+05] titled “Are virtual machine monitors microkernels done right?” 2 Background ∗National ICT Australia is funded by the Australian Govern- ment’s Department of Communications, Information Technol- Before addressing the specific assertions made in ogy, and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Research Centre of Ex- Hand et al’s paper, we provide some (we hope) useful cellence programs. background for the discussion.

95 2.1 History Revisited domains;

Microkernels and virtual machine monitors both have 2. IPC is the mechanism for kernel-controlled data a long history, dating back to the early 1970’s [BH70, transfer between protection domains; Gol74]. Given that there are significant similarities, it is useful to look at a somewhat narrow definition of 3. IPC is the mechanism for resource delegation both. between protection domains that requires mu- Goldberg [Gol74] defines a virtual machine mon- tual agreement between multiple (potentially itor as “[...] software which transforms the single distrusting) parties. machine interface into the illusion of many. Each of these interfaces (virtual machines) is an efficient Combining these three orthogonal operations into replica of the original computer system, complete a single primitive reduces the number of security with all of the processor instructions [...]”. mechanisms, reduces the code complexity, and re- duces the code size. A smaller code base reduces Liedtke [Lie96] describes the microkernel ap- the number of errors in the privileged kernel, as well proach as “... to minimize the kernel and to imple- as reducing the cache footprint. An obvious key re- ment whatever possible outside of the kernel”. quirement for any microkernel is thus a low-overhead Both definitions appear sufficiently distinct to raise IPC primitive. All other operations that require a the question of how much commonality there can be. combination of the three mechanisms can be imple- Examining the goals of the two approaches shows mented via the single IPC primitive. that there is more similarity than is evident from the VMMs in comparison, closely resemble processor definitions: Goldberg lists software reliability, data hardware and offer a rich variety of primitives. Each security, alternative system APIs, and improved and primitive requires a dedicated set of security mecha- new mechanisms as benefits; Liedtke lists flexibility nisms, resources, and kernel code. A comprehensive and extensibility, fault isolation, maintainability, and list is beyond the scope of this paper, thus we only list restricted interdependencies. the common subset of primitives that can be found in It seems that while VMMs and microkernels share most VMMs: a common set of goals, they take a different approach towards the solution. Yet both approaches consider 1. synchronous switch of protection domain from minimality important. While for microkernels it is a guest user to guest kernel; key objective, Goldberg reports it as a result of the system structure: “A key principle in the analysis of 2. synchronous switch of protection domain from software reliability is that the VMM is likely to be guest kernel to guest user; correct—i.e., the probability of failure is near zero. This assumption is reasonable because the VMM is 3. asynchronous communication channels across likely to be a very small program [...]”. domains (virtual machine (VM) to virtual ma- chine);

2.2 Core primitives 4. resource allocation per VM via VMM hyper- call interface; In the effort to minimise kernel functionality, micro- kernels offer a minimal set of abstractions with a cen- 5. resource allocation within the VM (e.g., via tral primitive for extensibility: inter-process commu- hardware page-table virtualisation); nication (IPC). In a microkernel, IPC serves three pri- mary purposes: 6. resource re-allocation (e.g., via page flipping);

1. IPC is the mechanism for kernel-controlled 7. page-fault and exception handling via exception change of execution flow between protection virtualisation;

96 8. asynchronous event notification across domains variety of hardware platforms, thereby minimising via virtual-interrupt signalling mechanism; porting and maintenance overhead.

9. hardware interrupt notification via virtualized interrupt controller; 3 Architectural Lessons

10. a set of common devices, such as NIC and disk. Now we reexamine the architectural lessons pre- sented by Hand et al in detail, following the headings The interfaces provided by the VMM have an in- of their paper, and clarify the role of microkernels. triguing benefit for an important class of highly com- plex software: existing operating systems. Avail- 3.1 Avoid Liability Inversion able operating systems already program to the inter- face provided by the hardware and resembled by the The paper states that moving system services out VMM. Thus existing operating systems require no or of the kernel relaxes the dependability boundaries only minimal changes to run on a VMM, whereas within the system. Applications and even the ker- adaptation to the microkernel primitives often re- nel depend on user-level code. This situation is quires significant modifications. However, this ben- called liability inversion and an example from efit is being eroded by the increasing divergence of [YTR+87] is used to argue that “inelegant” mecha- VMMs from pure virtualisation (faithful representa- nisms are required to ensure correct system opera- tion of the underlying hardware) to paravirtualisation tion as a consequence of the “kernel abdicating its (representation of modified hardware that lends itself liability”. It is further argued that one of the princi- better to efficient support of legacy OSen). pal design guidelines of Xen were to avoid liability The diversity of interfaces also leads to struc- inversion. tural compromises, such as centralized super-VMs At the workshop, Butler Lampson was quick to that combine and colocate significant critical system point out that this liability inversion is in fact an is- functionality. Such a structure potentially decreases sue in Xen as well. An example for this is actu- overall reliability and poses the risk of a single point ally given in another paper at the same workshop of failure. This problem becomes even more inher- by some of the same authors: the Parallax storage ent if this super-VM runs a legacy system [WRF+05] essentially uses external pagers and thus re-introduces a large number of software to provide file service. While that paper argues that bugs [CYC+01]. the design avoids liability inversion, Parallax is “pro- For extensions that are not an existing operating viding a critical system service for a set of VMMs”. system, the VMM’s interfaces significantly increase This is exactly what a user-level server does in a the complexity of software design. As, per defini- microkernel-based system. The argument is made tion, a VMM presents an interface that is close to the that a failure of the Parallax server only affects its underlying architecture, software developed for one clients — exactly the same situation as if a server VMM is inherently unportable across architectures. fails in an L4-based system. Hence, we fail to see the In contrast, a microkernel abstracts and hides the pe- difference between a VMM and a microkernel in this culiarities of the hardware platform behind its com- respect. mon set of abstractions. For example, software that Possibly this apparent conflict is a result of a is written for an L4 microkernel [Lie95] naturally lack of understanding of microkernels (even though runs on nine different processor platforms, from em- this has been thoroughly explained in the litera- bedded devices such as ARM, to desktop and small ture [Lie96]). The confusion might in fact be the re- servers such as , up to large multiprocessor Pow- sult of an invalid generalisation of a specific example erPC and machines. Hence, it is possible to (a particular design fault of Mach) onto a whole class leverage and reuse system components across a wide of systems (microkernels).

97 3.2 Make IPC performance irrelevant Xen provides a shortcut based on x86’s trap gates that avoids invoking the VMM on guest Here Hand et al. argue that, while microkernel de- systemcalls. However, this shortcut is specifi- signers have spent considerable effort on optimising cally targeted and limited to ’s int 0x80 inter-process communication (IPC) mechanisms, this system-call variant and restricts the use of seg- is irrelevant as it is “not a critical design concern in ments. Protection can only be preserved if all the construction of high-performance VMMs.” active segment configurations explicitly exclude They further argue that IPC between virtual ma- the VMM kernel. Since x86’s trap mechanism chines is much less frequent and thus not per- only reloads two of the six segment selectors, formance critical, as a consequence of the VMM the solution is limited; Linux’s latest glibc vi- scheduling and protecting complete operating sys- olates the assumption and renders the shortcut tems. useless. This is an interesting line of argument, as it is at odds with the reality of Xen-based systems in at least A Xen-based system performs essentially the two respects: same number of IPC operations as a compara- ble microkernel-based system (such as + • Xen uses a separate virtual machine (called [HHL 97]). Dom0) to encapsulate legacy device drivers + [FHN 04]. Hence, any I/O operation implies 3.3 Treat the OS as a component at least one round-trip communication between the guest VM and Dom0. The authors call this Under this heading, Hand et al. argue that a benefit a “simple asynchronous unidirectional event of VMMs is that they are designed to run complete mechanism” — it is nothing else than a form legacy systems, with familiar programming and de- of asynchronous IPC. velopment environments, and lending themselves to extensions such as Parallax. The (unstated) implica- And performance-critical it is indeed. A recent tion of such statements has to be that microkernels paper [CG05] examines the CPU overhead of are somehow not suitable for such use. Dom0 drivers under high load, and finds that the This is a really surprising notion, as L4 has demon- CPU load generated by Dom0 accounts for al- strated many years ago that it is perfectly suitable as most all of the CPU load of the system under a VMM supporting a paravirtualised Linux system test! They also find that the Dom0 CPU time with excellent performance [HHL+97], and the Dres- is proportional to the number of Xen’s page- den DROPS system [HBB+98] is built specifically flipping operations, that is, message transfers, on extending a paravirtualised Linux system running irrespective of the message size. The clear im- on a microkernel with real-time services and is in in- plication of this data is that IPC costs dominate dustrial use. the driver overhead in Xen systems under high Again, we fail to see the claimed “significant dif- I/O load. ference” between VMMs and microkernels. • While it is true that Xen schedules complete op- erating systems, this does not mean that there 4 Conclusions is no other interaction with the VMM. In fact, each guest-application exception and system In summary, the “important differences” between mi- call causes a trap into the VMM, which then in- crokernels and VMMs identified by Hand et al. do vokes corresponding functionality in the guest not seem to hold up to scrutiny. As a consequence, OS. This is nothing but an IPC operation be- their conclusion “that VMMs are microkernels done tween the guest application and the guest OS. right” cannot be inferred from the arguments they

98 presented. Yet, the observation, also made by others SIGOPS European Workshop, Sintra, Portu- [HPHS04], that VMMs and microkernels are closely gal, September 1998. related, deserves further attention. We believe that a [HHL+97] Hermann Hartig,¨ Michael Hohmuth, Jochen systematic and objective examination of the similar- Liedtke, Sebastian Schonberg,¨ and Jean ities and differences of microkernels and VMMs is Wolter. The performance of µ-kernel-based still outstanding, and would make a valuable contri- systems. In Proceedings of the 16th ACM bution to OS theory and practice. Symposium on OS Principles, pages 66–77, St. Malo, France, October 1997. [HPHS04] Michael Hohmuth, Michael Peter, Hermann References Hartig,¨ and Jonathan S. Shapiro. Reduc-

+ ing TCB size by using untrusted compo- [BDF 03] Paul Barham, Boris Dragovic, Keir Fraser, nents —small kernels versus virtual-machine Steven Hand, Tim Harris, Alex Ho, Rolf monitors. In Proceedings of the 11th Neugebauer, Ian Pratt, and Andrew Warfi eld. SIGOPS European Workshop, Leuven, Bel- Xen and the art of virtualization. In Proceed- gium, September 2004. ings of the 19th ACM Symposium on OS Prin- + ciples, pages 164–177, Bolton Landing, NY, [HWF 05] Steven Hand, Andrew Warfi eld, Keir Fraser, USA, October 2003. Evangelos Kottsovinos, and Dan Magen- heimer. Are virtual machine monitors mi- [BH70] . The nucleus of a mul- crokernels done right? In Proceedings of tiprogramming operating system. Communi- the 10th Workshop on Hot Topics in Operat- cations of the ACM, 13:238–250, 1970. ing Systems, Sante Fe, NM, USA, June 2005. USENIX. [CG05] Ludmila Cherkasova and Rob Gardner. Mea- suring CPU overhead ofr I/O processing in [Lie95] . Improved address-space the Xen virtual machine monitor. In Proceed- switching on Pentium processors by transpar- ings of the 2005 USENIX Technical Confer- ently multiplexing user address spaces. Tech- ence, pages 387–390, Annaheim, CA, USA, nical Report 933, GMD SET-RS, Schloß Bir- April 2005. linghoven, 53754 Sankt Augustin, Germany, November 1995. [CYC+01] Andy Chou, Jun-Feng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An [Lie96] Jochen Liedtke. Towards real microkernels. empirical study of operating systems errors. Communications of the ACM, 39(9):70–77, In Proceedings of the 18th ACM Symposium September 1996. on OS Principles, pages 73–88, Lake Louise, + [WRF 05] Andrew Warfi eld, Russ Ross, Keir Fraser, Alta, Canada, October 2001. Christian Limpach, and Steven Hand. Par- [FHN+04] Keir Fraser, Steven Hand, Rolf Neuge- allax: Managing storage for a million ma- bauer, Ian Pratt, Andrew Warfi eld, and Mark chines. In Proceedings of the 10th Workshop Williamson. Reconstructing I/O. Techni- on Hot Topics in Operating Systems, Santa cal Report UCAM-CL-TR-596, University of Fe, NM, USA, June 2005. USENIX. Cambridge, August 2004. [YTR+87] Michael Young, Avadis Tevanian, Richard [Gol74] Robert P. Goldberg. Survey of virtual ma- Rashid, David Golub, Jeffrey Eppinger, chine research. IEEE Computer, 7(6):34–45, Jonathan Chew, William Bolosky, David June 1974. Black, and Robert Baron. The duality of memory and communication in the imple- + [HBB 98] Hermann Hartig,¨ Robert Baumgartl, Martin mentation of a multiprocessor operating sys- Borriss, Claude-Joachim Hamann, Michael tem. In Proceedings of the 11th ACM Sympo- Hohmuth, Frank Mehnert, Lars Reuther, Se- sium on OS Principles, pages 63–76, 1987. bastian Schnberg, and Jean Wolter. Drops — OS support for distributed multimedia applications. In Proceedings of the 8th

99