Technische Universität Berlin

Master’s Thesis

Chair for Security in Telecommunications (SecT) School IV — Electrical Engineering and Technische Universität Berlin

Improving Virtualization Support in the Fiasco.OC

Author: Julius Werner (#310341) Major: Computer Science (Informatik) Email: [email protected]

Primary Advisor: Prof. Dr. Jean-Pierre Seifert Secondary Advisor: Prof. Dr. Hans-Ulrich Heiß Tutor: Dipl.-Inf. Matthias Lange

April – August 2012

Abstract

This thesis aims to improve the memory virtualization efficiency of the Fiasco.OC mi- crokernel on processors with the Intel VMX hardware-assisted virtualization architecture, which is plagued by severe performance degradation. After discussing possible approaches, this goal is accomplished by implementing support for Intel’s nested paging mechanism (EPT). Fiasco.OC’s memory management system is enhanced to allow a new kind of task whose address space is stored in the EPT page attribute format while staying compati- ble with the existing shared memory mapping interface through dynamic page attribute translation. This enhancement is complemented with support for tagged TLB entries (VPIDs), independently providing another minor performance increase. The kernel’s external virtualization API and the experimental userland VMM Karma are adapted to support these changes. The final implementation is evaluated and analyzed in the context of several benchmarks, achieving near-native performance on real-world workloads while only constituting a minor increase in complexity for the microkernel.

Zusammenfassung

Diese Arbeit zielt darauf ab die Speichervirtualisierungseffizienz des Fiasco.OC Mikro- kerns auf Prozessoren mit der hardwareunterstützten Virtualisierungsarchitektur Intel VMX zu verbessern, welche von schweren Performanzverlusten geplagt wird. Nach der Diskussion möglicher Lösungsansätze wird dieses Ziel mit der Unterstützung von Intels Mechanismus für verschachtelte Seitentabellen (EPT) umgesetzt. Fiasco.OCs Speicherver- waltungssystem wird um eine neue Art von Prozess erweitert, dessen Adressraum im EPT-Seitenattributsformat gespeichert wird, der aber durch dynamische Übersetzung der Seitenattribute dennoch kompatibel mit der existierenden Schnittstelle für geteilte Speicher- abbildungen bleibt. Diese Verbesserung wird durch die Unterstützung von ausgezeichneten TLB-Einträgen (VPIDs) komplementiert, welche unabhängig davon einen weiteren kleinen Performanzvorteil einbringen. Die externe Virtualisierungs-API des Kerns und der experi- mentelle Userland-VMM Karma werden angepasst um diese Änderungen zu unterstützen. Die endgültige Implementation wird im Rahmen mehrerer Benchmarks ausgewertet und analysiert, wobei sie nahezu native Performanz bei realistischen Arbeitspaketen erreicht und nur einen geringen Komplexitätszuwachs für den Mikrokern bedeutet.

Die selbstständige und eigenhändige Ausfertigung versichert an Eides statt:

Berlin, den 30. August 2012 Julius Werner

1

Fakultät Elektrotechnik und Informatik

Aufgabenstellung für die Masterarbeit

Name des Studenten: Julius Werner Studiengang: Informatik Matrikel: 310341

Thema: Verbesserung der Virtualisierungsunterstützung auf dem Fiasco­Mikrokern

Zielstellung:

Die Anforderungen an Betriebssysteme hinsichtlich Verfügbarkeit, Robustheit gegen Angriff und Unterstützung zahlreicher Legacy­Anwendungen steigen kontinuierlich. Die Implementierung von Virtualisierungsunterstützung auf Mikrokernsystemen konnte diese Anforderungen überzeugend befriedigen.

Fiasco.OC ist ein Mikrokern der dritten Generation und bietet Unterstützung für die Virtualisierungsfunktionen aktueller AMD (SVM) und Intel (VT­x) ­Prozessoren. Die aktuelle Implementierung für VT­x leidet jedoch unter Leistungsdefiziten, insbesondere bei speicherintensiven Anwendungen. Karma ist ein VMM der als Userlevel­Anwendung auf Fiasco.OC läuft. Momentan wird nur als Gastsystem unterstützt.

In dieser Masterarbeit soll untersucht werden, wie die bestehenden Leistungs­ und Effizienzdefizite in der derzeitigen VT­x­Implementierung beseitigt werden können. Als Arbeitslast ist dabei ein Linux­ System anzunehmen. Die einzelnen Varianten sind dabei hinsichtlich ihrer Auswirkungen auf Leistungsumfang, Entwicklungsaufwand und Kompatibilität zu bestehender Software zu bewerten. Die am besten beurteilte Variante soll implementiert und anhand von Testszenarien bewertet werden.

Verantwortlicher Hochschullehrer: Prof. Dr. Jean­Pierre Seifert

Betreuer: Dipl.­Inf. Matthias Lange Institut: Security in Telecommunications (FG SecT)

Berlin, 08.05.2012 Unterschrift des verantwortlichen Hochschullehrers

2 Contents

1 Introduction 6 1.1 Goal ...... 7 1.2 Structure ...... 8

2 Background 10 2.1 Virtualization ...... 10 2.1.1 Hardware-Assisted Processor Virtualization ...... 12 2.1.2 Memory Virtualization and Nested Paging ...... 15 2.2 ...... 18 2.2.1 Fiasco.OC ...... 19

3 Related Work 21 3.1 Existing Implementation ...... 21 3.1.1 Fiasco.OC ...... 21 3.1.2 Karma ...... 22 3.1.3 SVM Kernel Shadow Paging ...... 22 3.2 Comparable Projects ...... 23 3.2.1 L4Ka::Pistachio ...... 23 3.2.2 NOVA ...... 23 3.2.3 KVM ...... 24 3.2.4 Turtles ...... 24

4 Design 26 4.1 General Approach ...... 26 4.2 Fiasco.OC ...... 27 4.2.1 Handling ...... 28 4.2.2 VPID Support ...... 29 4.3 Karma ...... 30

5 Implementation 31 5.1 Fiasco.OC ...... 31 5.1.1 Feature Reporting and Sanitization ...... 31 5.1.2 Page Table Handling ...... 32 5.1.3 VPID Support ...... 33 5.2 Karma ...... 34 5.2.1 Unpaged Mode Emulation ...... 34

3 Contents

6 Evaluation 36 6.1 Measurements ...... 36 6.1.1 Microbenchmark: Fibonacci ...... 37 6.1.2 Microbenchmark: Forkwait ...... 38 6.1.3 Microbenchmark: Sockxmit ...... 39 6.1.4 Microbenchmark: Touchmem ...... 40 6.1.5 Macrobenchmark: Compiling the Linux Kernel ...... 41 6.2 Analysis ...... 42 6.3 Complexity ...... 43

7 Conclusion 45 7.1 Future Work ...... 45

Appendix A: Code 47

Glossary 48

Bibliography 50

4 List of Figures

1 Comparison of Type I and Type II hypervisor models ...... 13 2 Comparison of page table setup in nested and shadow paging ...... 16 3 Simplified depiction of memory hierarchy in an exemplary Fiasco.OC-based system . 19 4 Comparison of traditional and Intel Extended page table entry format ...... 27 5 Evaluation results from the Fibonacci microbenchmark ...... 37 6 Evaluation results from the Forkwait microbenchmark ...... 38 7 Evaluation results from the Sockxmit microbenchmark ...... 39 8 Evaluation results from the Touchmem microbenchmark ...... 40 9 Evaluation results from the Linux kernel compilation macrobenchmark ...... 41

List of Tables

1 Virtual machine feature reporting bit field format ...... 28 2 Opcode in bits 11 to 0 of VPID flush control VMCS field ...... 30

5 1 Introduction

As computer systems keep becoming more prevalent in our daily life, their security and reliability is more important than ever. Be it personal computers, mobile and embedded devices, or industrial and financial information systems: most of today’s computers are connected to hostile networks or otherwise exposed to attacks. In addition to that, the multitude of software and hardware components from different vendors with varying quality assurance concepts that work together in a system form a large surface for possible vulnerabilities or defects. Although security and reliability are always dependent on the whole system stack, the is the most critical component: while a vulnerability or defect in a single application may cause that component to fail or be subverted, good operating systems isolate their components from another, thus making the difference between a single loss of functionality and breakdown or subversion of the whole system. Sadly, most mass-market operating systems of today are severely lacking in regards to this isolation. While memory protection between userland processes has long been commonplace, other domains such as the file system allow attacks to easily spread through the system: bound by conventions from a time before malware became prevalent, they often allow every process write access to all resources of its user, regardless of whether it would actually need to do so under normal operation. This clearly violates the principle of least privilege, but the coarse-grained access control lists usually employed by such systems are inherently unsuited to efficiently enforce minimalist permissions on a process level. Therefore, even the most mundane programs often constitute prime targets for attack, as they can be used as a stepping stone to infect more important applications. Realizing this danger, more and more established operating systems expand their security models to provide tighter isolation on request. An increasingly common strategy for this is sandboxing, ranging from the already more than a decade old jail mechanism in FreeBSD to the recently added Seatbelt framework in Mac OS X [1]. Systems like these artificially restrict the resources a process may use, effectively expanding the access control list principle to a per-process basis. However, even this level of granularity may not be precise enough with attack scenarios such as the : service processes with broad access rights can be tricked by malicious requests to use a valid permission for a different purpose than it was intended for. Preventing this kind of attack requires delegating permissions on a per-request level, as it is possible with other security models such as object capabilities. While there are attempts to graft that model onto existing operating systems, such as the Capsicum framework for FreeBSD [2], they will always be encumbered by the fact that they are late and unexpected additions to an existing environment. Both security and performance considerations favor systems that were consistently and thoroughly designed around this model from the start. Adding to these concerns about userland processes, the kernel itself plays a major role in operating system security. Since all kernel components execute at the highest privilege level by definition, it is impossible to enforce any kind of isolation upon them. Attacks or failures in kernel code can easily

6 1.1 Goal bypass all memory protection and affect any part of the system. The logical solution to mitigate this problem is therefore to keep the amount and complexity of these components as small as possible: many established operating systems keep a vast amount of device drivers and assorted utility functionality as part of a big monolithic kernel, even though the right architecture would allow those to function with limited privileges. Shifting that functionality into separate userland processes until only the absolutely necessary parts remain ultimately leads to the microkernel concept, which has been a well-established computer science research target for decades. Initially suffering from bad performance, more recent advances pioneered by Jochen Liedtke [3] have produced newer implementations that can hold their ground against monolithic systems. Today, fast and flexible microkernels based on modern security models offer userland execution environments with both fine-grained permission controls and a small trusted computing base. Combining a sound design with competitive performance, modern microkernels have gained a good track record in security-critical specialty applications, such as the Green Hills INTEGRITY- 178B real-time operating system used in military avionics [4]. The personal computer market is still dominated by monolithic kernels, however, and all attempts at establishing a microkernel-based general purpose operating system have proven unsuccessful to date. The main reason for this lies in legacy software: the broad spectrum of existing applications on personal computers was written for well- established operating systems and designed to fit their APIs and security models. The interface of most microkernel systems is usually so alien in comparison that porting software to it becomes a complex and time-consuming task. A promising idea to remedy part of that problem is virtualization: the microkernel system can run one or several instances of traditional operating systems as virtual machines, which provide separate execution environments for unmodified legacy applications. While those instances do not become more secure by themselves, the microkernel can isolate them from each other and tightly regulate communication between them. One can also implement sensitive components with an increased need for security as dedicated applications for the microkernel system and have them provide secure interfaces to the virtual machines, thus gaining isolation where it is important while still avoiding to port all components of the system to the microkernel interface. An example of this can be seen in the L4Android project, which proposed to use virtualization to move security critical mobile payment services out of an untrusted smartphone operating system [5].

1.1 Goal

Fiasco.OC is one of these modern microkernels. It has already accrued almost a decade of history in virtualizing legacy operating systems, starting with the kernel rehosting project [6]. More recently, it was enhanced with support for true virtual machines after the advent of hardware-assisted virtualization architectures for x86 processors, AMD SVM and Intel VMX. Both are similar but independent instruction set extensions that offer operating systems a hardware interface to facilitate the hosting of virtual machines. However, both architectures initially only focussed on privilege and device isolation, offering no explicit answer for virtualized memory management. As both the host operating system and its virtual machine require access to the processor’s memory translation mechanisms, this forces the virtualization environment to employ shadow paging, a complicated emulation scheme to compensate for the lacking

7 1 Introduction hardware. While that provides a functionally correct solution to the problem, it requires frequent interruptions to the virtual machine’s execution, severely degrading its performance. Recognizing this shortcoming, both processor manufacturers later added optional extensions to their virtualization architectures that provided separate memory translation features for the virtual machine and its host. This technique is called nested paging and circumvents the problem altogether, removing the tightest architectural bottleneck for x86 virtual machines and permitting them to execute at near-native performance. Completing Fiasco.OC to form a full-featured virtualization environment, the experimental virtual machine manager Karma aims to provide a lightweight and flexible platform for further research. Its initial design was focussed on providing fast and secure I/O virtualization, and sidestepped the memory problem altogether by relying solely on nested paging with the AMD SVM technology. After initial successes, an alternative shadow paging implementation was added to broaden its application spectrum, as not all processors include the nested paging extension. This in turn allowed the program to be ported to Intel’s VMX architecture, since the largely software-controlled shadow paging mechanism works almost completely independent from the processor’s hardware-assisted virtualization interface. The usage of Intel’s own nested paging hardware, however, has remained impossible to date, since some major differences to AMD’s approach make it particularly unfit for Fiasco.OC’s internal memory architecture. As a result of these problems, Fiasco.OC’s virtualization performance on Intel processors continues to be disappointing. While AMD systems with nested paging have always achieved near-native performance without effort, Karma’s still very basic VMX architecture performs poorly even when compared to other shadow paging virtualization environments and has no means to use high-end processors to their full potential. The goal of this thesis is therefore to design and implement ways to improve Fiasco.OC’s virtualization performance for VMX processors, aiming to achieve near-native performance comparable to the current SVM implementation with nested paging. An obvious path to reach that would be to somehow integrate support for Intel’s own nested paging mechanism in Fiasco.OC, but this may require far-reaching changes to its internal memory management code in order to overcome the aforementioned incompatibility. Recent research on SVM systems also suggests that a variety of optimizations can dramatically improve shadow paging performance in Fiasco.OC, to the point where it is reasonably competitive on real-world workloads — a possible alternative solution that should be analyzed and weighed up against the first approach. Additionally, some other optional features in Intel’s VMX architecture might still be unused in the current implementation, providing hitherto untapped potential for minor performance increases.

1.2 Structure

Leading to this goal, chapter 2 starts by recapitulating some of the fundamental concepts touched in this thesis in further detail. It outlines the technical intricacies of virtualization and revisits basics of the microkernel concept as it is realized in Fiasco.OC. The subsequent chapter describes existing implementations in the area, starting with the current state of Fiasco.OC and Karma, and proceeding to present some alternative virtualization environments for comparison.

8 1.2 Structure

Chapter 4 defines the general approach of the solution with abstract interfaces and discusses several design choices. The actual implementation including specific techniques to solve some of the finer problems is described in chapter 5. Finally, the sixth chapter contains evaluation results of several benchmarks performed on the finalized solution and selected comparison targets. The thesis is concluded in chapter 7, which recapitulates its results and outlines possibilities for future work.

9 2 Background

The established technical basics and principles applied in this thesis are presented in this chapter. It will give a brief overview of the history and definition of virtualization and its associated concepts. Afterwards, the current solutions for hardware-assisted virtualization on x86 processors are described in further detail, focussing on the Intel VMX architecture used in this thesis. In particular, the different techniques to achieve memory virtualization are outlined and compared. The second section gives a broad introduction on the microkernel concept before focussing on the Fiasco.OC kernel, providing a summary of its general concepts and features that are relevant to the thesis. The actual virtualization subsystem is intentionally left out, and will be described in further detail in the next chapter.

2.1 Virtualization

Virtualization describes a technique in which a computer system (called the host) provides one or more software execution environments, which themselves simulate a whole computer system. These are called virtual machines, and they are ideally indistinguishable from an environment provided by physical computer hardware to the software that they execute. They generally run a whole operating system originally intended for a physical machine (called the guest), which includes its own stack of drivers, services and applications. A virtual machine is generally differentiated from an emulator by providing an indistinguishable environment not only in functionality but also in timing: its execution time and latency must not differ significantly from a physical machine, which means that a large majority of its instructions must be executed directly by the physical processor, without being interrupted by traps or exceptions. This requires that the host processor can interpret the guest’s instruction set, which makes cross-platform virtualization generally impossible without specialized techniques such as binary translation. The earliest experiments with virtualization reach back to the 1960s: the IBM/CP-40 research system is generally considered to be the first implementation of true virtual machines [7]. The concept was soon expanded and commercialized with the CP-67 and its successors in the IBM VM operating system family, the latest iterations of which are still used to this day. Thus, virtualization already became an established and notable research topic in the age of mainframes, with many important theoretical models and concepts reaching back to that time. One of its most important theoretical pioneers was Robert P. Goldberg, who consolidated most of that early research into his PhD thesis in 1973 [8]. It establishes many virtual machine patterns and design principles that are still valid today. Using slightly different terminology, he defines a virtual machine in a similar manner as above:

10 2.1 Virtualization

A virtual computer system is a hardware-software duplicate of a real existing computer system in which a statistically dominant subset of the virtual processor’s instructions execute on the host processor in native mode. He further introduces a formal model that abstracts a virtual machine as a map of virtual to physical resources (cf. [8] chapter 4.2). The job of the virtualization environment is then to apply this mapping whenever a virtual resource is accessed and perform that operation on the corresponding physical resource while ensuring that any transformations or restrictions set up in the environment are enforced. The term virtual resource is left intentionally broad to make the model as general as possible, but practical examples usually fall into the following three groups: virtual processors, and virtual I/O devices. Virtualizing a processor is not as simple as executing guest instructions like they were just another process in the host operating system. It must behave like a full featured physical processor in every way, which includes delivering interrupts and being able to switch between different privilege modes. These features are usually not just mapped directly onto a physical processor and instead carefully emulated by the virtualization environment. In fact, a virtual processor does not necessarily have a definitive physical counterpart — the host system might migrate it between different processors as it sees fit and may also multiplex several virtual processors on a single physical one. When interrupting or descheduling the virtual machine, its processor state must be saved similar to a normal task switch, with the distinction that this does not just include the register file but also all special architectural state that may often not be directly accessible to software (such as the automatic blocking of nested non-maskable interrupts on the x86 processor, cf. volume 3, chapter 6.7 of its Developer’s Manual [9]). A virtual machine’s memory can generally be mapped to any unoccupied physical memory area. If the guest operating system is not specifically aware of the nature of its environment, the host will need to emulate the architecture’s normal method of memory enumeration (on x86, this is usually done with the 0xE820-map that can be queried through software interrupts in 16-bit real address mode). Faithful virtualization may require a predefined starting address (often 0x0), certain memory holes, or pre-populated areas with system information, as defined by the respective architecture. Since the host will generally not be able to reserve exactly those needed physical address ranges for every guest, it must use memory translation mechanisms such as paging to map other areas to them. As guest operating systems will often want to make use of the architecture’s available translation mechanisms themselves, their virtual address spaces will thus be built upon several layers of translation. This is a major problem in memory virtualization, which will be discussed in further detail in section 2.1.2. Virtual I/O devices come in just as many forms as their physical counterparts. Most architectures even require some devices to always be present, such as timers or interrupt controllers. Others may be optional and will be detected at runtime by the guest operating system, but are still necessary to enhance the virtual machine with useful capabilities such as audio or video output, networking, or persistent storage. When faithfully emulating real existing devices, the guest accesses them through the same means as it would on a physical machine, such as memory mappings or dedicated I/O ports. However, many physical device interfaces require frequent interaction with small payloads, which performs exceptionally badly in virtualization environments since every emulated interaction requires constant overhead. This has led to the rise of paravirtualization as an alternative technique, where the guest uses device drivers that are specifically aware of the virtualization environment and can batch their accesses in the most efficient way.

11 2 Background

Even more so than processors, virtual devices do not need to be backed by a physical counterpart. Small architectural devices like timers and interrupt controllers are generally always emulated com- pletely in software. More prominent devices might also be completely virtual: several virtual machines could be connected through virtual network interfaces, even though the host might not really have a physical network interface card. Others may be multiplexed onto a single physical device, such as virtual machines sharing the same hard drive through partitioning. Of course, there is also the most simple solution of direct pass-through to a physical device, with little or no command translation or sanitization. Although device virtualization is a huge field of its own, it will not be further discussed in this thesis, since its primary goal lies in optimizing memory virtualization. The host operating system’s software stack that forms the virtualization environment can generally be divided into two sections: the component that directly interacts with the virtual machine and manages its resources is called the Virtual Machine Monitor (VMM). It will handle all interrupts, exceptions and traps that are intercepted from the virtual machine, as well as other events such as privilege mode changes or device interactions. It will either provide the device emulation itself or forward any accesses to the appropriate backends and inject the responses back into the virtual machine. Every virtual machine is usually controlled by its own separate VMM instance, and to further increase isolation these instances often run in unprivileged mode on the host machine. The second component is called the hypervisor and exists only once per host system. It is a central part of the host kernel executing in privileged mode and performs the bookkeeping and scheduling necessary to multiplex all virtual machines on the system. When the VMMs run unprivileged and thus cannot directly perform some necessary operations, the hypervisor also provides an API that allows them to access and manipulate their delegated virtual machine in a secure and isolated manner. Reaching back to Goldberg’s time (cf. chapter 2.1 of his thesis [8]), hypervisors have traditionally been divided into two classes, as visualized in figure 1: Type I hypervisors make up host operating systems whose sole purpose is forming a virtualization environment. In this case, the hypervisor is the host kernel, although it may still have some rudimentary support for host-level processes which may be used to implement simple device drivers or management interfaces. Other Type I systems opt to run a special virtual machine with extra privileges as an environment for device drivers or user interfaces, among them the well-established virtualization platform Xen [10]. Type II hypervisors such as the Linux module KVM [11], on the other hand, are part of a larger general purpose operating system. They are usually realized as an optional kernel module which hooks into parts of the scheduler and processor driver to allow virtual machines alongside regular processes. These systems can rely on their existing device drivers and user interfaces to support the virtualization environment.

2.1.1 Hardware-Assisted Processor Virtualization

As noted above, virtual processors must provide all features of the corresponding processor architecture. However, guests cannot simply be allowed to access all privilege modes, install their own interrupt tables, and modify settings such as addressing modes or interrupt masking at will — these mechanisms are already occupied by the host operating system, and providing unrestricted access to them would allow the guest to circumvent its isolation. Instead, virtualization environments have to emulate these features to trick guests into believing that they have direct control, when their accesses are actually intercepted by the VMM through traps and memory protection.

12 2.1 Virtualization

Type I Type II

service VM complex user application application device inter- VM VM driver face guest host simple VMM VMM host VMM device device application driver driver unprivileged privileged hypervisor kernel hypervisor

physical hardware physical hardware

Figure 1: Comparison of Type I and Type II hypervisor models

This is only possible on processor architectures where all instructions that are sensitive to these emulated parts of the machine state (i.. their effects can differ depending on privilege mode, memory translation state, etc.) are also privileged (i.e. they always cause a trap in the lowest privilege mode), as proven by Popek and Goldberg in 1973 [12]. Unfortunately, many architectures do not satisfy this restriction, among them x86 [13]. A well-known example for this is the POPF instruction that modifies the processor’s flags register: it can be executed in all privilege modes, but changes to the interrupt masking flag are silently ignored in lower privilege levels. A VMM thus would have no chance to detect when its guest tries to set this flag and cannot emulate its effect correctly. Due to x86’s ubiquity, there have been many attempts to circumvent this problem in the past. One approach is extensive paravirtualization (i.e. modifying guest operating systems to specifically support being virtualized) in order to replace all actions that would need to be intercepted by explicit trap instructions. While this is feasible, the required modifications are often tediously numerous, and licensing restrictions might prevent source code modifications altogether. Others have succeeded in using automated binary translation to replace all sensitive instructions with traps at runtime, but this can prove very difficult to master for some guest operating systems and also incurs a performance impact [14]. As the desire to run efficient virtual machines on commodity hardware kept growing throughout the last decade, processor manufacturers acknowledged the need to improve x86’s virtualizability. Intel led the way with the release of their Virtual Machine Extensions (VMX) in 2005 [9], and AMD quickly followed suit with their independently developed Secure Virtual Machine (SVM) technology [15]. Both are x86 instruction set extensions that differ in implementation details but follow the same principle: they add another layer of privilege modes (usually called root and non-root execution) orthogonal to the existing privilege levels, which are thus free for use by the guest operating system. Execution in non-root mode can be configured to enforce the necessary memory and I/O protection even at the highest traditional privilege level, while all instructions sensitive to this new distinction cause a trap

13 2 Background

into root mode — thus finally making the x86 architecture fully virtualizable according to Popek and Goldberg [12]. The exact behavior in VMX non-root mode can be tightly configured through the use of a Virtual Machine Control Structure (VMCS) — an opaque data structure whose address and contents are set through the use of special instructions, which allow different processor models to decide on their own what parts they will cache internally. It contains state-save areas for both guest and host, including the hidden parts of processor state that have traditionally been inaccessible. The processor can enter non-root mode and exchange its complete state with those fields in one atomic operation (a virtual machine entry), and perform the reverse process when it encounters an exit condition. These exits are caused by certain instructions (sometimes conditional on the current processor state) that can be configured through bit fields in the VMCS. Other reasons are interrupts or I/O accesses, which can be further differentiated by the specific interrupt or port number through additional bit fields. Interrupts that do not exit are processed normally through the guest’s interrupt descriptor table — in addition to that, the host can also inject fabricated interrupts into the guest (e.g. in order to emulate a virtual device). Finally, VMX also offers special fields to disguise some parts of the processor state without actually changing them, such as the control registers, the timestamp counter or the local APIC. This can be used to alter the guests perceived reality, such as hiding the clock ticks used up during a virtual machine exit or pretending to run in unpaged mode even though the control register bit for paging is actually enabled (and enforced by the host). With these features VMX goes to great lengths to ensure that almost all operations which do not access restricted resources can be executed without VMM intervention. The few missing cases mostly concern instructions that are very rarely used during normal operation or legacy features that are no longer relevant to modern operating systems. One such issue is the old hardware-assisted task switch, which has never been widely established due to portability and performance problems and was only used in niche operating systems (most notably OS/2). A more prominent example is the execution of 16-bit code in real address mode: although long obsolete for actual program execution, it is still a necessary step in the traditional x86 boot process. While the recently introduced Extensible Firmware Interface might eventually supersede this as well, current support in operating systems is mostly experimental, and the old BIOS interface will still stay relevant for many years. Both of these issues can, of course, be solved by VMM emulation (with an associated performance penalty). In the latter case this requires either full-blown instruction-level emulation or a complicated setup built around the venerable virtual 8086 mode, which had originally only been intended for application software. In order to address this problem (and thus reduce a significant portion of VMM complexity), Intel has extended VMX with a feature called unrestricted guest in their latest processor generations: finally allowing virtual machines to run in real address or unpaged protected mode, this closed the last major hole in VMX’ virtualization capabilities and also allows old 16-bit operating systems like MS-DOS to run without intervention. However, although the hardware problem has thus been solved, full 16-bit compatibility also requires supporting the interrupt-accessed BIOS software API that is implemented in memory-mapped ROM on physical machines, which can be either pre-placed in the virtual machine’s address space before boot or, once again, emulated through VMM intercepts.

14 2.1 Virtualization

2.1.2 Memory Virtualization and Nested Paging

Even though VMX makes a processor fully virtualizable, efficient memory virtualization still brings its own set of problems. When Popek and Goldberg wrote their historic paper [12], they chose a simplified processor model with an old-style relocation-bounds register as its only form of memory protection. Unfortunately, the far more powerful memory translation mechanisms in modern architectures make them much harder to virtualize, since the guest operating system must be expected to use them to their full potential, indiscriminate of the fact that the host has already set up its own translations. The x86 architecture uses two translation mechanisms chained together: segmentation is a legacy feature that uses a set of internally cached base address and limit registers, similar to the antiquated relocation-bounds systems. A descriptor table in memory defines the available segments (including attributes and access rights), and userland code can explicitly load them at will. As segmentation comes first in the address translation chain, there is no harm in allowing the guest operating system full control during its execution. Memory protection can be enforced through the latter mechanism, and virtual machine exits will restore all segment registers to known-good values from the VMCS. Paging, on the other hand, is a much more powerful system and the primary form of translation employed by current operating systems: the whole virtual address space is divided into 4 KiB pages, each of which can be arbitrarily mapped to a corresponding physical page frame. This mapping is defined in the page table, a radix tree with several levels of depth, which each map the remaining most significant bits of a virtual address to the next lower-level table or ultimately to the resulting physical frame address. A single level of the tree itself occupies exactly one 4 KiB frame, which results in two levels each mapping 10 bits of the virtual address through a field of 1024 32-bit wide entries. Later generations of x86 processors introduced a different format to support 64-bit physical addresses through the Page Address Extension feature and later the full 64-bit addressing mode, resulting in fields of 512 64-bit entries each corresponding to 9 bits of the virtual address. Since the amount of memory occupied by translation tables can be significant on large systems, an attribute bit in individual entries can tell the processor to ignore the last translation level and directly map the whole area to a single, large frame. As translation lengths per level differ depending on the paging mode, these superpages are 4 MiBs large with 32-bit and only 2 MiBs with 64-bit page table entries. In a virtualization environment, the hypervisor creates a page table to form a virtual address space with the size and layout expected by the guest operating system in an unused area of the host’s physical memory. The virtual machine interprets this as its physical address space, so host virtual addresses essentially become guest physical addresses. During its operation, the guest creates its own page tables that build guest virtual address spaces on top of this guest physical memory. Virtual addresses referenced from guest code therefore need to be translated into guest physical addresses through the guest page table, which (being host virtual addresses) must then be translated again into host physical addresses through the host page table. Unfortunately, traditional x86 processors can only use one page table at any given time. There is a solution to this problem that can be implemented purely in software: interpreting both page tables as address-to-address translation functions h and g, the corresponding composition h ◦ g is also an address-to-address function and can thus be represented as a page table itself. This is called a shadow page table, which can be constructed on the fly by the VMM after parsing the existing guest and host tables. However, every time the guest subsequently changes its paging configuration,

15 2 Background

Nested Paging Shadow Paging

index o set index o set guest virtual address guest virtual address

guest guest page page table table

index o set guest physical address

host shadow host page page page table table table

index o set index o set host physical address host physical address

Figure 2: Comparison of page table setup in nested and shadow paging

the VMM needs to intercept and reflect that change — most notably in guest task switches, which require the whole shadow table to be rebuilt from scratch. On top of that, modifications to single entries in the guest table must also be caught, which generally requires the VMM to intercept all guest page faults. Each of these intercepts naturally requires its own virtual machine exit, incurring a processor-dependent but generally high performance overhead and additional software processing in the hypervisor. While there are several techniques using lazy updates or even paravirtualization to reduce the overall amount of intercepts, shadow paging implementations still always perform notably worse than physical machines on real-world workloads. Ever since the introduction of paged memory this has been the most prominent bottleneck to efficient virtualization. Even back in the age of mainframes, solutions were proposed (cf. chapter 4.5 of Goldberg’s thesis [8]) and eventually implemented that circumvent the problem completely through the design of special processor hardware that can perform two consecutive address translations for every memory access. While this leads to slightly higher overall access times (as the processor must walk two page tables to find the ultimate physical address), it completely eliminates the software maintenance overhead for memory virtualization: the hypervisor can just load the host page table into this special second-level paging register during virtual machine entries and let the guest deal with page table setup and page fault handling on its own. Figure 2 illustrates address translation using this technique (known as nested paging) and contrasts it with the shadow paging approach. Although nested paging as a concept had long been known, none of the x86 virtualization extensions initially supported it. AMD was the first to enhance their SVM technology accordingly in 2007. The simple and straightforward implementation just expects a second page table base address to be supplied by the hypervisor, using the same format as page tables for regular host processes. A year later, Intel

16 2.1 Virtualization released a similar enhancement for VMX called Extended Page Table (EPT). Unfortunately, they decided to completely abandon the traditional x86 page table entry format and devise a new, incompatible set of page attributes that allows slightly finer control over access rights. Since the guest operating system and regular host processes still use the traditional format, this requires host operating systems to carefully differentiate between these use cases when writing page tables. As memory translation always applies to all referenced addresses, the processor would theoretically have to do a full page table walk on every memory access, which would lead to disastrous performance. Therefore, processor architectures that use paging have always included a special cache for recently used page translations, the Translation Lookaside Buffer (TLB). When page table entries change, the operating system must manually invalidate the corresponding TLB entries, or stale entries could lead to bugs and even circumvent memory protection. x86 processors can invalidate individual entries using the INVLPG instruction and will automatically flush the whole TLB whenever a new page table is loaded. However, while it might have been an insignificant loss to flush all translation entries on every task switch 27 years ago, it can be very harmful today: TLB misses can be a major performance bottleneck and modern processors often have very powerful TLBs to mitigate this factor. Many processes only need to cache a small working set of page entries during the majority of their lifetime, therefore a significant number of translations from previous processes could potentially stay valid until they get scheduled again. In order to avoid letting this potential go to waste, recent x86 processors can tag their TLB entries with a Process Context Identifier (PCID) that uniquely identifies a process. Tagged TLB entries are not automatically flushed and simply lie dormant until the same PCID is loaded into the processor again. As virtual machines need to cache their translations too, these mechanisms had to be expanded: with EPT, the TLB mostly contains combined mappings that directly translate guest virtual to host physical addresses, but can also cache mere guest physical translations (as they are needed during guest page table walks). When the guest operating system modifies its guest page table entries, its INVLPG calls will invalidate the former kind of mappings as expected. Changes to the host page table need to invalidate both kinds, which can be flushed globally with the new INVEPT instruction. Task switches within the guest can invalidate entries or use PCIDs as normal — however, in order to prevent one virtual machine from using the cached translations of another, virtual machine entries and exits always flush the complete TLB. This mechanism can lead to a similar performance bottleneck as the blanket invalidation on task switches: if a virtual machine gets scheduled, processes a clock tick with no actions, and goes back to a halt state, chances are that it did not touch most of the TLB. The solution to this problem, which Intel added as an extension to VMX, goes analogous to PCIDs: translation entries cached by a virtual machine can be tagged with a Virtual Processor Identifier (VPID) read from the VMCS. These tagged TLB entries get preserved across virtual machine entries and exits, and simply lie dormant when their VPID does not match the current execution context. Invalidations and flushes only affect entries of the current VPID — the host can also use the new INVVPID instruction to flush other or all VPIDs when necessary.

17 2 Background

2.2 Microkernels

The kernel of an operating system designates the most central part that always executes at the highest privilege level. It therefore necessarily contains all functionality whose implementation requires these privileges, such as the core components of process and memory management. Userland applications can use an API of system calls to access this functionality in a secure and isolated manner, usually through execution of a trap instruction. Traditionally, most kernels also contained a lot of functionality that is not strictly dependent on a high privilege level but still convenient to be easily accessed by all applications, such as hardware drivers, file system services, or networking. In the early days of computing, isolation and security were not considered as important as they are today, and there was little interest in moving code away from the kernel. Distributed computing, however, was a popular research topic in the 1970s, and researchers at the University of Rochester developed an operating system that could be distributed across a large network of machines by making its services accessible through a unified, network-capable inter-process communication scheme [16]. As their work progressed, they even made all kernel services available through special communication ports, until there were no ordinary system calls left except for inter-process communication itself. The system was redesigned and derived from several times and eventually spawned its portable, UNIX-compatible and fully multiprocessing-capable successor at Carnegie Mellon University. After several years of development on Mach, researchers investigated the idea of moving some of its services out of the kernel: since other applications already accessed them through inter-process communication channels, this was as simple as redirecting one of those channels from kernel-internal handling to a newly designed userland process. This simple idea formed the basis of what would later be known as a microkernel system: all operating system functionality that does not absolutely require kernel privileges is moved into separate userland processes called servers — their services are made available to other servers and end-user applications through inter-process communication facilities. This results in a fundamental increase in system security and robustness: when a server crashes or becomes unresponsive, it can often simply be restarted without affecting system stability as a whole. Servers can run at reduced privilege modes and with tightly controlled access rights, being only allowed to perform actions that are required to fulfill their purpose and thus implementing the principle of least privilege [17]. This helps trim the amount of code whose corruption could undermine the system as a whole (the trusted computing base [18]) to a minimum, which reduces the chance of security-critical bugs in that code and simplifies its formal verification. Despite these advantages, the microkernel concept faced severe practical difficulties: the overall performance of Mach and other attempts of its time period could not compete with traditional monolithic kernels. Thinking that the message passing approach was simply too slow for some critical services, developers tried to move selected functionality back into the kernel but could not solve the overall problem. For a time, many researchers lost faith in the concept and believed it to be inherently unsuitable for efficient systems. More specific benchmarking later showed that the inter-process communication latency was the main culprit behind the performance problems. German operating systems researcher Jochen Liedtke conjectured that this was caused by inefficient implementation rather than fundamental problems. Mach’s powerful inter-process communication mechanisms offered many optional services and features,

18 2.2 Microkernels

Physical address space Fiasco

σ0 IO Karma Linux VM task Con frame bu er

Figure 3: Simplified depiction of memory hierarchy in an exemplary Fiasco.OC-based system such as port access rights and asynchronous delivery through buffered message queues, which bloated the message passing code to unnecessarily large proportions. Liedtke advocated a radical return to the principle of absolute minimalism, offering nothing but fast, dumb, synchronous message passing. He proceeded to implement a demonstration kernel on the x86 architecture whose hand-crafted assembly routines for inter-process communication severely outclassed prior systems, and came close enough to traditional latencies to prove the microkernel concept’s general viability [19].

2.2.1 Fiasco.OC

Capitalizing on this work, Liedtke proceeded to implement the successful L4 microkernel [3], which gained international acclaim. Initially developed exclusively in x86 assembly, its API was later reim- plemented in C++ to prove that it would still be competitive with a platform-independent design. It would go through several revisions and enhancements in the following decade as multiple independent projects forked off its code or created reimplementations, forming a whole family of L4 kernels. The variant discussed in this thesis started off under the name L4/Fiasco, developed at the Technische Universität Dresden as an L4 successor tailored for preemptibility and hard real-time guarantees. It has been under constant development ever since, leading to the current iteration Fiasco.OC [20]. Userland processes in an L4 system are modeled by kernel objects called tasks, which represent isolated protection domains including address space and access rights. They form the environment for thread objects that represent execution contexts including register state and instruction pointer. During boot, the kernel creates a special task called σ0 that contains all available physical memory, except for parts reserved by the kernel itself. All later tasks are initially created empty, and need to explicitly receive memory to be usable: L4 tasks offer map and grant operations to share or transfer their memory to other tasks. These mappings can be transitive and are recorded in a hierarchical mapping database that tracks all memory regions as tree-like lineages with σ0 at their root. Tasks further up the hierarchy always have the right to unilaterally revoke their mappings with an unmap operation, which removes them not only from the specified child task but also recursively revokes all further mappings of that region it might have created. These three primitives are the only memory management operations offered by the kernel. Any further required functionality must be implemented in userland tasks by employing these concepts. For example, the traditional POSIX malloc() call could be fueled by a server task that receives a large mapping from σ0 on boot and offers an interface through which other tasks can request a page from that allotment. Since memory can be mapped to different target virtual address ranges (up to page

19 2 Background

granularity), the client tasks could receive continuous memory regions even if the server task’s address space is fragmented. Figure 3 visualizes a very simple memory configuration where most tasks were spawned with a fixed memory region mapped by their parent task — however, the Con task mapped some of its memory into an unrelated task so that it could be used as a shared virtual framebuffer. Security and isolation in Fiasco.OC are based on the object capability model [21]: a capability is an unforgeable reference to any kind of kernel object, including the aforementioned tasks and threads as well as communication gates necessary to pass messages to other processes. Possession of a reference automatically implies the authority to access it, e.g. mapping memory into a task object or sending a message through a gate to another process. This is an intentional departure from the common model of access control lists that usually requires both an identifier (such as a file path) and a permission entry to access a resource. Tasks in Fiasco.OC hold capabilities in an object address space analogous to their memory address space, and can use the same mapping and granting primitives to pass them to other tasks (thereby sharing or transferring their authority over the referenced object). Fiasco.OC itself is developed in parallel with its reference userland environment L4Re. This offers applications an assortment of shared libraries that wrap the L4 system call invocation and provide convenience interfaces for the available kernel objects. L4Re also provides reference implementations

for required infrastructure programs such as σ0 . It further offers additional convenience features and extensions, such as libraries that facilitate memory management with the L4 interface or provide basic building blocks to create a server application. A small compatibility layer for certain POSIX features is also included.

20 3 Related Work

This chapter provides a reference frame for the thesis by presenting the current state of the art in hardware-assisted virtualization on microkernels. It starts out by describing the existing implementation of virtual machines in Fiasco.OC and the Karma VMM, which lay the foundation for the following implementation. Afterwards, three additional hypervisors with varying differences are presented to provide a broader overview of the field.

3.1 Existing Implementation

3.1.1 Fiasco.OC

The first attempts to run other operating systems on top of Fiasco.OC have been done through rehosting. The most prominent example for this is the L4Linux project [6], which began on the original L4 kernel and has since been ported to some of its successors. In this model, the guest kernel is modified to run as a normal userland task in the microkernel-based operating system and act as pager and exception handler for its user processes. Initial attempts required complicated workarounds to emulate some non-trivial guest operating system concepts (such as signalling and critical sections) with the primitives offered by the microkernel. In order to provide a more suitable interface for task execution in rehosted operating systems, Fiasco.OC was augmented with a software abstraction of virtual processors, called vCPUs [22]: an alternative to traditional threads, they offer the added functionality of being interruptible through external events. A vCPU can asynchronously receive a message while executing, which will save the current execution state and jump to a predefined handler routine. It may even cross address space boundaries to do that, which allows events from guest user processes to be handled in the kernel task and return to the saved execution context afterwards. Event delivery may be voluntarily suspended to ensure that critical sections are atomic within the context of a single vCPU. Support for true, hardware-assisted virtualization was initially added to Fiasco.OC to be used in combination with an L4Linux-based VMM [23]. In accordance with the microkernel principle, most aspects of virtual machine handling are pushed out of the kernel: the VMM process maintains a dummy VMCS in an extended state-save area of a vCPU. When Fiasco.OC’s scheduler tries to resume that state, it copies and sanitizes the relevant fields into an internal, real VMCS and adds the page table base address of the vCPU’s user task. It will then perform the actual virtual machine entry and, after the eventual exit, return to the VMM with an updated dummy VMCS through the normal vCPU handler mechanism. The downside of this approach is that all virtual machines on a processor share the same physical VMCS, whose guest state area must be completely reinitialized on each virtual machine entry. This effectively nullifies the advantage of VMCS caching facilities integrated into some Intel processors.

21 3 Related Work

3.1.2 Karma

Karma is a lightweight VMM designed to run natively on Fiasco.OC without the need for L4Linux. The original version by Steffen Liebergeld [24] only supported AMD SVM with nested paging, which simplified the memory virtualization problem: it just allocates a new task object for the virtual machine, maps some of its memory into that address space at the locations where it would be expected on a physical computer, and leaves the page table setup to Fiasco.OC. Karma also provides a useful set of virtual devices, including video framebuffer, hard drive, Ethernet network and a serial port. It offers full multiprocessor support through the use of vCPUs and emulation of inter-processor interrupts. In later work, Karma was augmented with shadow paging support to allow its use on less sophisticated processors [25]. In this mode, it uses a second child task to keep track of the guest virtual memory layout, which then provides Fiasco.OC with a page table to use. Due to the theoretical limitations discussed in section 2.1.2, Karma must intercept every guest page fault and completely rebuild the memory map on every guest task switch, which severely limits the performance of this approach. Nevertheless, it provides a working solution for older AMD processors and also allowed Intel VMX to be supported for the first time. The code uses C++ templates to effectively segregate implementation details between the two hardware architectures and the two memory virtualization approaches. Guest operating system specifics are encapsulated in a similar manner, although only Linux is currently supported. The virtual devices do not always emulate existing physical hardware, so that custom paravirtualized device drivers need to be built for the guest. Karma itself needs custom code to bootstrap a specific guest, because it does not provide the full 16-bit execution environment that x86 operating systems expect to start in and instead needs to jump directly into the 32-bit part of the guest kernel. Support for 64-bit systems is currently not included.

3.1.3 SVM Kernel Shadow Paging

Despite its functional correctness, the performance of Karma’s shadow paging implementation is severely lacking even when compared to other virtualization systems with the same hardware require- ments. A major reason for this is the additional context switch required to get from Fiasco.OC to Karma during virtual machine exit handling. Due to the high number of paging-related intercepts, the idea arose to move the shadow paging handlers into the kernel, avoiding the additional detour through host userland unless the virtual machine exit was device-related. This promises a significant overhead reduction but comes at the cost of moving more complexity into the kernel, which runs contrary to the microkernel concept. The approach was realized by Jan C. Nordholz for the SVM architecture as part of his diploma thesis in 2010 [25]. While the initial move into the kernel already resulted in a major performance gain, a multitude of additional optimizations was implemented to further refine the setup. Some of these required paravirtualization of the guest’s paging code, such as explicit, batched notification of page table updates through hypercalls. Even though this complicates portability to other guest operating systems, it notably reduces the amount of virtual machine exits and avoids the need to intercept guest page faults altogether. A final improvement went as far as employing virtual machine specific TLB tagging to differentiate between guest tasks, effectively giving the benefit of PCIDs to guest operating systems even when they are not available on the platform itself. While this technically gives the guest a

22 3.2 Comparable Projects performance benefit over native execution, it is still more than offset by the additional delays caused from virtual machine exits themselves, heavy optimizations notwithstanding. The final results of the implementation were quite promising but still a few notches away from native execution or nested paging. The produced code is heavily intertwined with Fiasco.OC’s SVM-specific classes, and an analogous implementation for VMX would require significant effort to either completely reimplement it or refactor it into abstract interfaces that could fit both hardware architectures. As of the writing of this thesis, the code has not yet been merged into the Fiasco.OC mainline repository.

3.2 Comparable Projects

3.2.1 L4Ka::Pistachio

After the initial assembly language implementation of L4 was discontinued, several independent projects emerged to reimplement its API. Before the kernel that would later become Fiasco.OC was started in Dresden, Liedtke himself worked with the System Architecture Group at the Universität Karlsruhe (TH) on a project that resulted in L4Ka::Pistachio [26]. While there are subtle differences in its approach compared to Fiasco.OC, mostly related to the latter’s focus on preemptibility to ensure the fast interrupt response times required for real-time systems, both kernels are still closely related and thus ideal for comparison. From its early days, virtualization research was one of L4Ka::Pistachio’s strong points: even before the advent of SVM and VMX, it was used to develop a concept called pre-virtualization, which uses automatic source code modifications to build virtualization support into a guest operating system [27]. A custom compiler fuses part of the VMM into the guest source and replaces virtualization-sensitive instructions with calls into that layer, which can either emulate them internally or batch them into concentrated, efficient hypercalls. Later versions of L4Ka::Pistachio were augmented with basic VMX support in a similar fashion to Fiasco.OC. However, there has been little apparent progress in recent years, as the implementation still depends on shadow paging and cannot use EPT — presumably due to difficulties similar to those encountered in this thesis. L4Ka::Pistachio has also never offered support for SVM, instead choosing to focus solely on technology by its official sponsor Intel. On the other hand, it offers a very sophisticated reference VMM (Marzipan), which still spearheads research in pre-virtualization.

3.2.2 NOVA

NOVA is an experimental microhypervisor developed at the Technische Universität Dresden [28] — an operating system kernel dedicated exclusively to the hosting of virtual machines. Its design focusses on security and small size, and is heavily influenced by third-generation microkernels: in fact, its split of protection domains (tasks) and execution contexts (vCPUs), its use of a hierarchical mapping database to propagate memory and object capabilities between those domains, and its focus on keeping privileged code as small as possible all seem directly derived from modern L4-based kernels. Despite its decent support of basic microkernel features, NOVA is not a general purpose operating system. It is meticulously tailored to its hypervisor role and includes specific virtualization features such as shadow paging right in the kernel. Less performance critical jobs like device virtualization are delegated to the userland VMM Vancouver, which runs in the protection domain of its respective virtual

23 3 Related Work

machine and is isolated from the rest of the system, similar to the relationship between Fiasco.OC and Karma. It communicates with the hypervisor through a platform-independent high-level API, which precludes the use of some exotic implementation-specific features (such as Intel’s model-specific register shadowing) but allows NOVA to keep VMCS implementation details hidden and make use of processor-specific caching support. Vancouver also provides support for guests executing in 16-bit real address mode through the use of a software-based instruction emulator and simulated BIOS functions, which allows NOVA to faithfully virtualize almost any operating system without modifications. As it supports both shadow and nested paging for SVM and VMX, NOVA has to deal with the problem of EPT’s custom page attribute format. Recognizing this requirement from the start allowed its memory management to be built with a clear separation between extended and traditional page tables at every point where it matters. In addition, NOVA always creates two page tables per protection domain by design, a traditional one for the use of host applications like the VMM and a separate instance in whatever format is required by the nested paging hardware.

3.2.3 KVM

As a well-established and well-documented monolithic operating system kernel, Linux provides an appropriate counterpoint to the microkernel-based examples above. Having already been a very mature system long before hardware-assisted virtualization became prevalent, several competing commercial and open-source hypervisor solutions have been developed for it over time. However, the Kernel-based Virtual Machine (KVM [11]) has emerged as the most generally accepted reference implementation and has been integrated into the main kernel source tree. It offers mature and feature-rich support for both SVM and VMX with nested and shadow paging, and can be set up at runtime as a loadable kernel module. Even though Linux (as a monolithic kernel) runs all of its host device drivers in privileged mode, it still uses a userland VMM component (QEMU) to provide emulated guest devices. As virtual machines do not behave equivalent to normal Linux processes, they do not need to provide the regular shared memory APIs. Instead, the VMM (and other processes) can use custom system calls to explicitly create guest memory from their address space. Internally, KVM uses a high-level representation for these memory slots and constructs an explicit nested or shadow page table from that as needed.

3.2.4 Turtles

The IBM Turtles project is a relatively recent experimental enhancement to KVM [29]. It aims to implement effective nested virtualization, i.e. executing another hypervisor that can host virtual machines of its own within a virtualization environment. This feature may become increasingly relevant as more operating systems employ virtualization hardware for everyday use cases, such as recent iterations of Microsoft Windows that can transparently virtualize older versions of themselves as a compatibility feature. Turtles was designed to work with the existing virtualization architectures of x86 processors, cur- rently only supporting VMX. As the hardware was primarily designed with a single layer of virtual- ization in mind, additional layers must be collapsed into one through straightforward application of the traditional trap-and-emulate approach: all virtual machine exits are first handled by the topmost hypervisor, which can then delegate them by entering the intermediate guest hypervisors as appropriate.

24 3.2 Comparable Projects

The lower-level hypervisors control their guests through the usual VMX instructions, but these are transparently trapped and emulated by the top-level host. Guests setting up new virtual machines are intercepted by the host, which will create the nested guest itself using the resources assigned to the intermediate hypervisor. Through this scheme, Turtles can create completely faithful virtualization environments to an arbitrary nesting depth. Even with nested paging support, x86 processors can only walk two page tables for any given memory access. In order to allow virtualized memory at arbitrary nesting depth, Turtles uses techniques analogous to shadow paging to collapse all host level page tables into a single, composite EPT. Only the final leaf guest uses the processor’s traditional paging system and can still modify its page tables without causing virtual machine exits. All intermediate-level hypervisors trigger the same costly intercepts usually associated with shadow paging when making paging-related changes, but this is of reduced importance in practice since most virtualization systems rarely modify their page tables after the initial setup of a virtual machine. During their final evaluation, Turtles’ developers analyzed the traits needed to make a hypervisor particularly suited to be an efficient guest. As all VMX instructions in a guest require emulation through an individual virtual machine exit, limiting the amount of such instructions in frequently executed code is essential for efficient nested virtualization. As Fiasco.OC individually reinitializes all VMCS fields (which can only be accessed through VMX instructions) from the userland dummy VMCS on every virtual machine entry, it is unfortunately particularly unsuited as a guest hypervisor to Turtles and would presumably perform far worse than any virtualization environment tested by the project to date. As an optional optimization, however, Turtles has also experimented with using binary translation to replace the costly VMCS-accessing guest instructions, which might be able to mitigate this disadvantage.

25 4 Design

The overall design concept for improving Fiasco.OC’s memory virtualization performance on Intel processors is developed in this chapter. It comprises adding support for nested paging and VPIDs, as justified in the first section. The following sections then define the architectural and interface changes that need to be conducted throughout all components of the virtualization stack in order to realize this approach. More details about specific issues and problems encountered during the actual implementation are described in the next chapter.

4.1 General Approach

Both theoretical considerations and existing empirical data make it obvious that the shadow paging implementation in Karma is the most severe performance bottleneck in the current setup: the inherent overhead incurred on page table manipulations, page faults and task switches can easily be demon- strated through microbenchmarks and results in significant performance degradations for real-world workloads [25]. Prior work for SVM has shown that integrating all shadow page table handling into the kernel can lead to significant improvements, and an analogous approach could be taken for VMX. However, adding such a considerable amount of logic to the kernel just to improve performance for niche functionality runs contrary to the microkernel approach, and even the most efficient implemen- tation would still be constrained by the theoretical limits of shadow paging efficiency. Instead, this thesis prefers to aim for the best achievable performance by making use of all possible architectural features, and therefore opts to completely circumvent this bottleneck through the use of nested paging. The existing nested paging support for SVM processors had been very simple to implement — so simple, in fact, that it had been written even before any of the shadow paging solutions. It consisted of little more than taking the page table base address from a virtual machine task object and loading it as nested page table pointer into the processor on virtual machine entries. Unfortunately, the implementation for VMX will not be that simple: the page table Fiasco.OC builds and maintains for the virtual machine task uses the traditional x86 format, which is incompatible with EPT. Figure 4 illustrates the different attribute formats for page table entries. Even though some bits have reasonably similar effects in both formats, the memory type values are irreconcilable: the x86 default memory type write back is represented as 011 in bits 5 to 3 on EPT, whereas in the traditional format the cache disable and write through bits on the same positions would need to be zero to match. It is therefore impossible to just pass a traditional page table as an EPT and hope for the best, which necessitates the ability to create dedicated EPT tasks in Fiasco.OC. This requires in-depth modifications to its address space management and mapping functionality in order to preserve the system’s core ability of creating memory mappings between arbitrary tasks. As a related but independent measure, this thesis also aims to implement support for VPIDs. The equivalent SVM feature called Address Space Identifier (ASID) had already been supported by existing

26 4.2 Fiasco.OC

physical address ignored G PS D A CD WT U W P 63/31 12 9 8 7 6 5 4 3 2 1 0

Traditional Format dirty global present page size accessed writeable cachewrite disable throughuser accessible

physical address ignored PS IP MT X W R 63 12 8 7 6 3 2 1 0 EPT Format page size writeablereadable ignore PAT executable memory type

Figure 4: Comparison of traditional and Intel Extended page table entry format code. The tagged TLB entries should give a small but steady performance boost to all virtual machine setups, as regular virtual machine exits for interrupt handling and device emulation are necessary even with nested paging. The effect should be more pronounced in shadow paging environments since they inherently require much more frequent virtual machine exits. However, even though VPID support keeps cached translation entries valid across switches between guest and host, overall TLB efficiency is also impeded by global flushes within these domains. Eliminating those would additionally require the use of PCIDs, which are currently not implemented in Fiasco.OC on the host side and cannot be used by the guest operating system as long as Karma does not support 64-bit addressing mode.

4.2 Fiasco.OC

As both page table handling and executing VMX instructions are necessarily privileged operations, the bulk of functionality required to reach the above-stated goals must be added to the kernel itself. In order to implement the envisioned EPT task object, the existing kernel API must first be amended to allow creation of such objects distinct from tasks with traditional page tables. In Fiasco.OC, userland applications can request new virtual machine task objects by invoking the capability to a factory object in the kernel with the respective opcode. This invocation returns the new task’s capability or an error code and traditionally needs no parameters, as the differentiation between VMX and SVM objects is done automatically though runtime processor identification. It is possible to treat this matter the same way and simply return an EPT object whenever an Intel processor could support it. However, it may still be desirable to have the option of using shadow paging even though the processor is EPT capable — if only to allow easier side-by-side testing and evaluation of both paradigms. Therefore, the virtual machine creation interface will instead be changed slightly by adding a new parameter byte that contains flags, the only one defined for now being the nested paging flag in the least significant bit. In addition to Fiasco.OC itself, this change must also be propagated to the L4Re library that wraps this interface on the userland side, and by extension to all application programs that use it. This is a backwards-incompatible change but it is very simple to conform to, and as Fiasco.OC’s virtualization support as a whole is still considered experimental at this point, the few applications using it for now should be expecting to deal with interface revisions.

27 4 Design

Bit L4Re constant Denotes support for 0 L4_VM_KIP_SVM AMD Secure Virtual Machine 1 L4_VM_KIP_VMX Intel Virtual Machine Extensions 4 L4_VM_KIP_NP Nested paging (both SVM and VMX) 8 L4_VM_KIP_VPID Virtual Processor Identifier (VMX only) 9 L4_VM_KIP_UG Unrestricted guest: unpaged and 16-bit real address mode (VMX only) 10 L4_VM_KIP_MSR Direct guest access to machine-specific registers (VMX only, currently unsupported) 11 L4_VM_KIP_APIC Hardware support to virtualize local APIC (VMX only, currently unsupported) 12 L4_VM_KIP_PAT Use guest-specific Page Attribute Table (VMX only) 13 L4_VM_KIP_EFER Use guest-specific Extended Feature Enables Register (VMX only) 14 L4_VM_KIP_PERF Use guest-specific Global Performance Counter Control register (VMX only) 15 L4_VM_KIP_TMR Limit execution time with virtual machine preemption timer (VMX only) 16 L4_VM_KIP_DTE Virtual machine exit on access to processor descriptor table registers (VMX only) 17 L4_VM_KIP_WBIE Virtual machine exit on WBINVD instruction (VMX only) 18 L4_VM_KIP_PLE Virtual machine exit after configurable amount of PAUSE instructions (VMX only)

Table 1: Virtual machine feature reporting bit field format

In order to make use of this new flag, the userland VMM must first be able to tell whether EPT support is available on the current system — an information only obtainable from a model-specific register that requires privileged access. Fiasco.OC already uses a special kernel interface page (read-only mapped into all address spaces) to make information about certain system features and parameters available to userland applications, which would serve as an ideal location to publish virtual machine feature availability. A new bit field will therefore be added to this page subsequent to the processor information field (i.e. at offset 0x68 for 32-bit and 0xD0 for 64-bit x86 architectures), denoting support of VMX in general as well as EPT, VPID, unrestricted guest and some other minor optional features in particular, as defined by table 1. Adding a nested paging code path to the actual virtual machine handling routines will be nearly trivial once a pointer to the correct kind of page table is available — however, a few minor adjustments still need to be performed. Some of the existing VMCS sanitization restrictions should be relaxed to allow the guest to make use of its newly gained paging independence: most importantly, loading a new page table base address no longer needs to cause a mandatory virtual machine exit. Additionally, 64-bit host systems can allow virtual machines to execute in both 64-bit and 32-bit mode. If the unrestricted guest feature is available too, even unpaged and 16-bit real address mode guests can be supported.

4.2.1 Page Table Handling

Fiasco.OC uses two data structures to keep track of memory mappings: one of them is the Mapdb class, which contains an abstract representation of the mapping relations between task objects. It can be queried to determine the parent and child tasks of a virtual memory area in its tree-like mapping hierarchy and is mainly used to recursively revoke transitive mappings during unmap operations. Memory areas are stored at superpage granularity and further broken up into individual pages when required.

28 4.2 Fiasco.OC

The other data structure is the Mem_space class, one instance of which always corresponds to one task object. This is where the actual page table that is read by the processor’s paging hardware is kept and maintained. It is controlled by the mapping infrastructure through an abstract API that basically offers insert, lookup and delete operations. Unfortunately, there are still a few platform-specific parameters exposed in this interface: it only operates on areas of the system’s page or superpage size, and page table attribute bits are directly manipulated without abstracting from the architecture’s representation. On the bright side, the actual page table construction and walking is wrapped by a very convenient template that can easily be adapted for different entry sizes or page walk depths. When adapting the system to allow EPT tasks, the only true concern is modifying the page table format. A minimally invasive approach can therefore be achieved by completely ignoring the Mapdb class and its surrounding logic, and only adapting Mem_space while keeping its external interfaces unchanged. This means that page attributes are received in the traditional format and must be translated on the fly when writing EPT entries — likewise, read EPT attributes are transparently translated to the traditional format during lookup operations. Another problem is the system-wide superpage size, which can be twice as large as EPT superpages on 32-bit systems. In this case, the class interface will still only accept mappings of the larger size and convert them internally into two EPT-sized superpages, while making sure that both are later treated as a single entity during lookup or delete operations. As normal tasks and virtual machines with EPT exist side by side on the system, the enhanced Mem_space class must decide at runtime what page table format to use for any given request. There are two possible solutions for this: the most primitive approach would be to add a Boolean attribute to the class which is checked in every function sensitive to this distinction. However, C++ also offers the more advanced technique of polymorphism, which uses a derived class with overloaded methods to encapsulate the alternative code paths. It incurs a minor function call overhead to select the right method at runtime, which is just slightly greater than the cost of checking a Boolean attribute in the first approach. At any rate, only methods directly involved in page table entry manipulation will incur that overhead — the functions needed for a normal task switch, which are by far the most frequently called in the class, stay unaffected. The performance hit to normal operation should therefore be negligible, and the polymorphism concept results in much cleaner code that clearly separates the new additions from the existing base, thus also making it much easier to completely exclude at compile time when necessary. Particularly when taking Fiasco.OC’s focus on security into account, readability and ease of maintenance clearly trump negligible performance gains, which is why the EPT handling will be encapsulated in the new derived class Mem_space_ept.

4.2.2 VPID Support

VPIDs can simply be set through a corresponding field in the VMCS. However, each VPID in use must be a per-processor unique identifier for its corresponding virtual machine — if multiple guests were running with the same VPID, they could possibly gain access to each other’s address spaces. As the userland is generally untrusted, the VPID must therefore not be supplied by the VMM like most other VMCS parameters: otherwise, a malicious VMM could intentionally try to guess the VPID of another virtual machine and attempt to sniff or corrupt its memory. To avoid this critical security risk, the kernel itself must generate and assign unique, persistent VPIDs to all virtual machines.

29 4 Design

Value Effect 0 No effect (do not flush any TLB entries during this virtual machine entry). 1 Flush individual TLB entry with my VPID for page denoted in bits 63/31 to 12 of this field. 2 Flush all TLB entries with my VPID. 3 Flush all TLB entries with my VPID that do not have the global page attribute.

Table 2: Opcode in bits 11 to 0 of VPID flush control VMCS field

It would still be an option to allow VMMs to opt out of using VPIDs at all. However, while it is technically possible to run virtual machines with and without VPID on the same system, it leads to unfortunate performance implications: in systems with round robin scheduling, a single guest without VPID is enough to cause a global TLB flush in every scheduling cycle, and thereby destroy the cached translations of all virtual machines that do use VPIDs before they have a chance to reuse them. As a single userland VMM could therefore severely disrupt the benefit of the feature for all virtual machines on the system, and there are few conceivable reasons to do so besides intentional denial of service, the use of VPIDs will instead be automatically enforced whenever the processor supports it. When making modifications to the guest page table, VMMs must make sure that no stale translation entries from before that change are left in the TLB after the following virtual machine entry. Since using VPIDs prevents automatic global flushes, they must be manually invalidated, but the required INVVPID instruction is privileged. For this reason, the kernel will provide a special interface to invoke it using a VMCS pseudo-field at index 0x6018. The virtual machine entry routine in the kernel will not load this field from the dummy VMCS into the processor but instead trigger an individual or global TLB flush for the corresponding VPID as specified in table 2. It will also automatically clear the field after use as a convenience feature, just like the VMX hardware clears the valid bit of the interrupt-information field used for event injection.

4.3 Karma

All the conceptual changes of this design are targeted at Fiasco.OC — all that is needed in Karma is to add support for them when using the affected kernel interfaces. The most visible addition is making use of the new virtual machine feature reporting in the kernel interface page: Karma can now automatically fall back to shadow paging when it is the only supported memory virtualization mode. It will still default to nested paging otherwise, and the existing --vtlb command line parameter can now optionally be used to manually enforce shadow paging even on high-end processors. Calls that create virtual machine kernel objects are modified to set the chosen mode with the new flags parameter, and the VMCS setup templates are likewise adapted. Lastly, the new TLB flush control mechanism must be supported when the kernel enforces VPID use: as both task switches and INVLPG instructions are intercepted and emulated in shadow paging mode, Karma needs to manually trigger the resulting TLB invalidations through this interface.

30 5 Implementation

This chapter outlines specific mechanisms and implementation details chosen to implement the behavior specified in the previous chapter. The different parts of the virtualization stack are discussed in order, including minor changes and optimizations that were not directly connected to the main research goal. It will also explain unforeseen consequences and problems encountered during the implementation and what measures were used to mitigate them.

5.1 Fiasco.OC

Overall, many minor improvements of all sorts were made in Fiasco.OC’s VMX classes along the way. Code was restructured and slightly modified to maximize inlining, increase readability and reduce memory footprint. Constant hexadecimal literals for VMCS field indexes were replaced with descriptive enumerations. The assembly routine handling virtual machine entries was improved to correctly return error information on failure, and overall error handling was tightened to easier differentiate between kernel bugs and bad VMM behavior. Code duplication potential was reduced by moving additional abstract base class functionality into Fiasco.OC’s platform-independent code tree, including the nested paging selection mechanism (which is a likely feature in all architectures with hardware-assisted virtualization that use paged memory).

5.1.1 Feature Reporting and Sanitization

As most VMCS values are supplied by the untrusted userland VMM, Fiasco.OC applies a sanitization system to them that prevents potentially compromising settings from reaching the processor: it maintains a pair of bitmasks for each critical field that governs which bits are enforced to zero or enforced to one, and applies these restrictions to the userland’s dummy VMCS before every virtual machine entry. They are initially read from similar bitmasks in machine-specific registers, which the processor uses to communicate optional VMX feature availability. Fiasco.OC then adds its own set of restrictions to them, preventing dangerous features such as direct I/O pass-through, and enforcing the necessary settings to stay in control like mandatory virtual machine exits on external interrupts. In order to accommodate the new requirements of nested paging setups, there is now a second set of bitmasks governing that case. They are mostly identical but honor the differences outlined in section 4.2, and the type of virtual machine automatically decides which one is applied. Since they contain processor information about optional features, they are also used to initialize the new userland-visible virtual machine feature information in the kernel interface page, which is additionally printed out to the kernel console during boot. A minor feature also supported after this overhaul is control register shadowing: all bit field control registers are represented by an actual value, a bitmask and a read shadow in the VMCS. While the

31 5 Implementation

actual value is loaded into the processor on virtual machine entry, the mask decides which bits may be modified by the guest operation system. If the guest tries to write a masked bit it will cause a virtual machine exit, and reads to those bits are transparently redirected to the read shadow — thereby, the register can appear different to the guest than what is actually in use by the processor. The sanitization enforces some critical bits in the actual value and mask fields but leaves the read shadow unchanged. This way, neither the VMM nor its virtual machine can write bits essential to security and isolation, but the VMM has full control about how the registers appear to its guest and has a chance to emulate functionality of the restricted bits through intercepts.

5.1.2 Page Table Handling

Even though its overall design proved to be sound, the implementation of the new Mem_space_ept class got stalled by some technical issues that required creative solutions to work around. In order to keep tight control over memory management, Fiasco.OC mostly shuns the usual dynamic memory allocation mechanism of C++, instead using the in-place new operator to instantiate objects into statically allocated locations. Unfortunately, this technique is also used for the Mem_space class, and its target memory is allocated before information about its intended use is available. Mem_space_ept objects therefore cannot occupy more space than instances of their superclass, which means the class may not introduce new attributes. As the EPT base address pointer has to be stored somewhere, however, the only remaining possibility besides major redesign is to reuse Mem_space’s existing page table base address pointer for the traditional format and solve the resulting type mismatches through relentless application of the reinterpret_cast operator. Repurposing superclass attributes like that is generally not considered good programming style, but in this case it was ensured to be safe at the binary level, and the extraordinary circumstances made it the least destructive solution available. The dynamic translation of page table attributes naturally led to some problems when the two formats did not align: the traditional format contains bits where the processor can mark a page as accessed or dirty, both of which are absent in EPT attributes. Since information about this kind of page access state is completely unavailable in EPTs, both bits are always set in translated lookup results to prevent bugs from code that might rely on pages being untouched. Page attribute handling gets even more difficult when they are updated through insert calls on already mapped regions, since the peculiar interface specification requires a bitwise OR between the architecture-specific representations of the old and new attribute values instead of outright replacement. This leads to obvious problems for the bits governing memory type, where cleared bits in traditional format correspond to set bits in EPTs. This issue cannot be perfectly resolved without disproportionate increases in code complexity and performance overhead, and due to the fact that Fiasco.OC’s memory mapping system currently does not use the interface in these ways, it was ignored for now. However, future developers should be wary of this when working on paging-related functionality and might even want to consider a complete redesign of the page attribute interface towards an architecture-independent and less confusing model. Just like any other operating system kernel, Fiasco.OC must invalidate cached translations for page table entries that are being removed, such as during an unmap operation. However, being a multiprocessing-capable system where the target address space could be concurrently used on a different processor, it must ensure that those processors invalidate the respective translations as well. Fiasco.OC’s

32 5.1 Fiasco.OC existing algorithm for this purpose is not very sophisticated: any successful unmap in multiprocessing mode is followed by an unspecific broadcast inter-processor interrupt causing all other processors to completely flush their TLB. In order to retain functional correctness in all cases after the addition of EPT tasks, this model was extended to also invalidate all EPT guest-physical and combined mappings (regardless of VPID use). However, depending on the amount of unmap operations, this model can result in significant performance drops. It might have originally been justified when Fiasco.OC was laying waste to its translation cache in regular intervals anyway due to its lack of support for tagged TLBs — but the inclusion of VPIDs (and, formerly, SVM ASIDs) has changed that, and the likely inclusion of PCID support at a later date will change it even further. As Fiasco.OC keeps moving forward towards more efficient TLB usage (and larger multiprocessing systems), it may find itself increasingly constrained by performance losses due to unnecessary global flushes and inter-processor interrupts in unmap-heavy environments. For the eventual day when developers will finally upgrade its TLB synchronization with a more surgically precise algorithm, the Mem_space_ept class has already been prepared to provide wrappers for TLB invalidation at all available granularities.

5.1.3 VPID Support

As justified in the previous chapter, VPIDs must be generated and assigned by the kernel to ensure security and isolation between untrusted userland VMMs. An analogous problem had already been solved with ASIDs in the SVM implementation: newly created virtual machines receive their ASID from a global next_asid variable, which is subsequently incremented. This works great as long as there are more ASIDs available than virtual machines created since the last reboot — however, as the amount of valid bits in ASIDs is processor-dependent and potentially very small, the possibility for overflow should not be ignored. For this reason, there is another parameter called asid_generation that is checked on every virtual machine entry: if the machine’s generation does not match the current global generation, it receives a newly generated ASID. When the next_asid value overflows, the global generation is incremented, thereby causing all still running virtual machines of the former generation to eventually renew their ASID. The same general concept was applied to VPIDs. Intel designed them to always use 16 bits, which makes overflows during normal operation extremely unlikely — however, when a malicious VMM intentionally tries to break the system by creating large amounts of short-lived virtual machines, it is still a possibility. Therefore, the concept of a vpid_generation was adopted as well. For non-obvious reasons, the SVM implementation stores the global next_asid and asid_genera- tion values once per processor. That in turn necessitates every virtual machine to store an additional copy of both attributes per processor since it may be executed by vCPUs running on any of them. This seems like an unnecessary waste of kernel memory and code complexity for no apparent gain — therefore, the VMX VPID implementation chose a different path and stores its global values only once in the system. This, of course, requires synchronized access to the next_vpid value through atomic exchange-and-add operations — but the minuscule overhead caused by that should be absolutely negligible, as it is only necessary when creating a new virtual machine. The more frequent access to vpid_generation during every virtual machine entry still achieves correct results with a normal, unsynchronized read operation.

33 5 Implementation

5.2 Karma

Adapting Karma to the new kernel interfaces was, for the most part, as uncomplicated as expected during the design phase. When the VMCS initialization code was enhanced to support nested paging, it became clear that its location in the operating system specific file guest_linux.hpp was not ideal to handle this kind of distinction. To improve code structure and reduce duplication potential, the paging related (and thus operating system agnostic) VMCS settings were extracted from there and are now set up by the paging mode initialization routines of the virtual machine driver.

5.2.1 Unpaged Mode Emulation

Testing the resulting implementation with different VMX feature sets revealed an unforeseen problem: while Karma has always skipped the guest’s 16-bit real address mode initialization code, it still boots from a 32-bit location where paging is supposed to be disabled. Even though the host’s second-level paging would still be active and providing memory protection at that point, VMX categorically forbids processors to execute without guest-level paging. The optional unrestricted guest feature allows to circumvent that restriction, but it is a very recent addition to VMX and not all EPT capable processors support it. For those that do not, a workaround had to be developed that emulates a guest running in unpaged mode during the first few instructions after boot, until it can set up and activate paging itself. In order for paging to be active even though the operating system expects all its accesses to work directly on physical addresses, the page table must represent an identity map where every virtual address is identical to the physical address it maps to. Karma can prepare such a map in the guest-physical address space before boot and load it in the guest page table base address field of the VMCS. It then sets the guest control register bit for paging but clears the corresponding read shadow bit, so that paging appears to be inactive to guest code. When a write access to that bit is intercepted, it can recognize that the guest finally chose to activate paging itself and end the emulation. Before a guest operating system sets the paging bit, it would normally construct an initial page table in memory and load it into the processor. Karma therefore also needs to intercept accesses to the page table base register, because the guest’s new table must not actually be loaded until the paging control bit gets set, but its address still needs to be stored for later use. In a similar manner, the guest might try to modify some other paging-related control bits, such as Page Address Extension or Supervisor Mode Execution Prevention — all of which must be held back and stored, so that they cannot disrupt the unpaged mode emulation but can later be applied after the guest takes control of its paging. While the processor state can thus be sufficiently emulated to maintain functional correctness, the problem remains that the required identity-mapped dummy page table must be located inside the guest-physical address space. As a guest operating system would generally expect to be able to use its memory exclusively, the table’s location must be chosen carefully to prevent it from being overwritten by unaware guest code. Thankfully, the use of superpages allows a whole page table representing the identity mapping of the complete 4096 MiB virtual address space to be built inside a single 4 KiB frame on 32-bit systems. After considering several options, the memory area 0xFF000–0xFFFFF was elected as its hiding spot: on physical x86 computers, this is traditionally hardwired to ROM as part of the BIOS code. Because Karma does not provide BIOS services, it was formerly empty, but operating systems still should usually not expect to find writeable memory there.

34 5.2 Karma

Incidentally, the established virtualization system KVM uses a very similar approach to solve the same VMX shortcoming: it even provides an interface to configure the identity table hiding address (cf. Documentation/virtual/kvm/api.txt:KVM_SET_IDENTITY_MAP_ADDR in the Linux kernel sources [30]). When none is supplied, it defaults to 0xFFFBC000, which is also part of a reserved BIOS area. This was not considered for use in Karma because most virtual machines do not have enough memory to get up to that address range naturally, and adding a new uncontinuous and otherwise unusable page to the guest-physical address space for this use alone was considered an unnecessary complication.

35 6 Evaluation

The resulting implementation’s functionality and performance are evaluated and analyzed in this chapter. The results of several benchmarks are compared between different settings of Fiasco.OC’s upgraded virtualization stack and a few external comparison targets. In addition, the resulting increase in code complexity is discussed and evaluated. Development and early functional testing of the implementation were not conducted on a physical machine but on the Bochs x86 PC emulator [31]. It provides nearly faithful instruction-level emulation of an Intel Core i5 processor (among others) with almost all VMX features supported by that model, including EPT, VPID and unrestricted guest. It therefore represents probably the most simple solution to quickly try out and verify the functionality of this implementation without access to the necessary hardware and setup for a physical test run, since few other emulation and virtualization solutions provide the required degree of VMX support. As a minor setback, the correctness of some of its timing sources seems lacking: it proved impossible to correctly boot Fiasco.OC when compiled for the legacy Programmable Interrupt Timer or the local APIC, leaving only the Real Time Clock to run a working system. This precludes the emulation of multiprocessing environments, since only the APIC timer is available per processor — however, uniprocessor environments could be emulated flawlessly and proved vital to this thesis’ completion.

6.1 Measurements

All of the following performance benchmarks were conducted on the same physical machine with an Intel Core i5-750 processor sporting 4 cores without simultaneous multithreading support running at 2666 MHz. All test environments were executed completely from memory using Linux’ initial RAM file system, and the memory available to the operating system under test was restricted to 1024 MiB. Host and guest were both forced to uniprocessor mode unless otherwise noted. Fiasco.OC and Karma were benchmarked in both shadow and nested paging mode, each with and without active VPID support. These results were compared against native, unvirtualized execution on the bare physical machine and against a Linux KVM setup from the Ubuntu 11.10 distribution with all available optimizations (including nested paging and VPIDs) as a representative of existing, established virtualization environments. Shadow paging without VPIDs represents the baseline performance that was obtainable before this thesis, while nested paging with VPIDs should showcase the greatest achieved speed-up. Native performance should naturally be the fastest in all tests, while KVM is expected to be roughly on par with Fiasco.OC’s nested paging and VPID setup because it makes use of the same underlying hardware features. The test suite contains four microbenchmarks specifically designed to demonstrate the performance differences between nested and shadow paging. To achieve the best possible comparability, three of these were adopted from Jan C. Nordholz’ thesis on shadow paging enhancements to Fiasco.OC’s SVM

36 6.1 Measurements

6 Fibonacci Microbenchmark Duration in Seconds 5

4 3.599 3.596 3.597 3.595 3.590 3.617

3

2

1

n / a 0 Shadow Paging Shadow Paging Nested Paging Nested Paging Native SVM Kernel Linux KVM with VPID with VPID Shadow Paging (projected)

Figure 5: Evaluation results from the Fibonacci microbenchmark implementation [25]. Data points representing the results of the most optimized kernel-internal shadow paging implementation from that thesis (measured relative to native performance) were added to the following charts to provide a rough estimate of how an analogous VMX implementation would have performed. However, as those results were generated in a different test batch on a different processor architecture, they should obviously be considered cautious projections rather than hard data.

6.1.1 Microbenchmark: Fibonacci

This benchmark was newly added for this thesis. It uses a simple iterative algorithm to compute the first 400 million Fibonacci numbers and is intended as a purely processor-bound benchmark with as little operating system involvement as possible. Since it does not require data storage beyond the processor registers, its address space should only consist of a few pages of program code and thus provide no notable slowdown in a shadow paging environment. It therefore serves as a kind of control group for the following benchmarks, proving that there are no systemic inefficiencies in any of the evaluated virtualization environments. Figure 5 illustrates that all environments perform virtually identical with native execution, down to a few milliseconds. Intriguingly, KVM shows the notably worst result. The reason for this is subject to speculation — although based in an Ubuntu installation stripped to the bare minimum, some

37 6 Evaluation

18 Forkwait Microbenchmark 16.03 16 Duration in Seconds 14.13 14

12

10

8

6

4 225%

1.71 1.70 1.75 2 1.51

0 Shadow Paging Shadow Paging Nested Paging Nested Paging Native SVM Kernel Linux KVM with VPID with VPID Shadow Paging (projected)

Figure 6: Evaluation results from the Forkwait microbenchmark

background processes (or even processing overhead in the host Linux kernel itself) might have siphoned this tiny edge of performance off the virtual machine.

6.1.2 Microbenchmark: Forkwait

The Forkwait benchmark focusses on straining the Linux system call interface as well as address space creation and destruction. It creates a child process (that will immediately exit again), waits for its death, and repeats that in a purely sequential loop of 40000 iterations with no inherent parallelism. Even though the program itself is thus dead simple, it is perhaps the most complex microbenchmark in this suite due to Linux’ sophisticated process management. Conforming to the time-honored POSIX fork()/() model, new processes are never created from scratch but always instantiated as a copy of the calling process. This also includes a complete and independent copy of the parent’s memory, but physically duplicating the complete address space would usually be a disastrous waste of time and space. Linux therefore employs the copy-on-write technique by mapping both processes’ virtual memory to the same physical page frames and marking the respective page table entries as read-only. Whenever a parent or child tries to write to a page, it triggers a page fault which directs the kernel to finally create a true, writeable copy of that page for each process.

38 6.1 Measurements

35 32.36 Sockxmit Microbenchmark Duration in Seconds 30 28.70

25

20

15

10

5 225% 1.80 1.79 1.76 1.89

0 Shadow Paging Shadow Paging Nested Paging Nested Paging Native SVM Kernel Linux KVM with VPID with VPID Shadow Paging (projected)

Figure 7: Evaluation results from the Sockxmit microbenchmark

Every single page table entry that is changed to read-only will cause an individual virtual machine exit in shadow paging environments. Additionally, the guest operating system must invalidate the corresponding TLB entries, each triggering another VMM intercept. After the child process is set up, the following task switch causes a complete rebuild of the shadow page table. Even though the child immediately exits again, is pushes at least one return value (from the fork() wrapper function) to the stack, triggering a page fault that is also intercepted in shadow paging systems. The following exit will cause another task switch back to the parent, with the corresponding overhead. The combination of these factors make process creation a very inefficient event for shadow paging environments and therefore a prime example to showcase their drawbacks. Virtualization with nested paging, on the other hand, should demonstrate no significant overhead compared to native performance, as none of the above steps causes a virtual machine exit. However, the results in figure 6 show that this is not quite the whole truth — while the overall trend matches the expectations, a notable overhead of about 12% with even the most performant nested paging setup cannot be denied.

6.1.3 Microbenchmark: Sockxmit

The sole focus of the Sockxmit benchmark is task switching. It spawns two processes connected by a local socket that pass an empty message back and forth in a tight loop of 220 iterations, forcing the

39 6 Evaluation

35 Touchmem Microbenchmark Duration in Seconds 30 28.93

25 23.98

20

15

10 270%

5 3.19 3.13 3.00 3.40

0 Shadow Paging Shadow Paging Nested Paging Nested Paging Native SVM Kernel Linux KVM with VPID with VPID Shadow Paging (projected)

Figure 8: Evaluation results from the Touchmem microbenchmark

processor to constantly switch between them. Shadow paging VMMs need to intercept every switch and rebuild the next task’s shadow page table, although the lack of program data should keep the addresses spaces of both processes relatively small. Nested paging virtualization is once again unaffected and only hindered by the same system call latency and mandatory TLB flush as native execution. Figure 7 evidences this to be the benchmark with the strongest difference between both approaches.

6.1.4 Microbenchmark: Touchmem

The Touchmem benchmark is the only one in this suite to work on a sizeable address space. It maps a 4 MiB file into its memory using the Linux mmap() system call and proceeds to access every one of its 1024 pages once. Afterwards, the mapping is torn down and recreated, repeating the whole process for 5000 times. As Linux uses lazy mappings to implement mmap(), the process’ address space is not updated right away. Instead, every access to a new page will result in a page fault that causes the kernel to retroactively add a page table entry for it. Shadow paging environments need to intercept both the page fault itself and the following page table update, resulting in a strong performance degradation. As before, nested paging can accommodate these operations without additional virtual machine exits, leading to near-native performance as illustrated by figure 8. Touchmem is also notable for incurring the worst

40 6.1 Measurements

260 Linux Compilation Benchmark 240 231.69 Duration in Seconds 221.35 220 on 1, 2, 3 and 4 Processors

200

180 160.45 104% 156.91 156.09 160 152.91

140 124.71 121.67 120 96.18 100 95.68 80.91 80.43 80.40 77.14 80 56.86 56.80 55.16 60 52.00 43.88 43.76 42.15 40.25 40

20 n / a n / a n / a n / a n / a 0 Shadow Paging Shadow Paging Nested Paging Nested Paging Native SVM Kernel Linux KVM with VPID with VPID Shadow Paging (projected)

Figure 9: Evaluation results from the Linux kernel compilation macrobenchmark performance from the existing SVM kernel shadow paging implementation — this is likely due to the fact that it employs several targeted optimizations to reduce the degradation caused by task switches as they were encountered during the previous tests, while it could do relatively little about the shadow page table updates triggered by this benchmark.

6.1.5 Macrobenchmark: Compiling the Linux Kernel

In order to evaluate performance with unsynthetic real-world workloads, a final test run was performed on the traditional showcase example for this purpose: compiling the Linux kernel. Just like the earlier SVM shadow paging thesis that provides comparison data [25], only the fs/ subtree of the kernel was compiled to save time while still taking long enough to provide accurate data. As the shadow-paging-penalizing operations stressed by the microbenchmarks only make up a minor fraction of most real-world workloads, the results for this measurement are expected to lie far closer together. Even though multiprocessing generally has little effect on the overhead added by virtualization systems, the benchmark was repeated with up to four physical and virtual processors running in parallel to confirm that this assumption holds true in practice. While the newly implemented additions performed as expected, there was one unfortunate bug in the existing code base that impaired the evaluation: in shadow paging environments, a race condition would sometimes stall the guest kernel

41 6 Evaluation

during boot as a processor failed to react to an inter-processor function call. It appeared rarely in all multiprocessing setups but always in the setup with exactly two processors. Even though nested paging environments were not affected, this is most likely an indirect consequence of timing differences rather than a true fix. Preliminary debugging efforts revealed Karma’s local APIC emulation or interrupt injection as the most likely culprits, but a full investigation would have exceeded the scope of this thesis. Therefore, figure 9 can only show data for setups that could be correctly booted.

6.2 Analysis

The functional correctness of this implementation has been confirmed by all tests on the emulator as well as the physical machine. Even though the evaluation uncovered one bug, the fact that it affects shadow paging environments with and without VPIDs implies that it has already been part of the existing code base and is unrelated to the newly added features. The Fibonacci control benchmark yielded the expected results, demonstrating that none of the setups contain unaccounted sources of overhead and any differences in the following measurements are purely caused by the respective benchmark program itself. It also provides a nice confirmation that practically native performance really can be achieved by all virtualization techniques when using benign workloads. The other microbenchmarks clearly illustrated the inherent differences between shadow and nested paging: page faults, task switches and page table manipulations all severely strain shadow paging environments, with benchmark durations ranging from 8 up to 16 times longer than native execution. The comparatively low overhead in the Forkwait benchmark probably stems from it being less focussed than the other ones: even though process creation comprises all three kinds of intercepted operations, it also includes a lot of unrelated bookkeeping logic that does not require VMM interference. The Sockxmit benchmark, on the other hand, can show an exceptionally high overhead since it consists of almost nothing but task switches. With nested paging, the results range from a barely visible 2% overhead on the Sockxmit benchmark to mildly noticeable 12% with Forkwait. This is peculiar because none of the benchmarks should force any virtual machine exits, implying that there should be no significant differences between them. The only other sources of variable overhead are virtual machine exits from device accesses and the increased overall page walk length inherent to nested paging. The latter would mostly show in benchmarks with bad TLB utilization because the processor performs a full page walk only when a translation is not already cached. However, since neither a reason for device accesses nor a source of outstandingly frequent TLB misses is readily apparent in the Forkwait setup, the cause of this discrepancy remains a mystery. The addition of VPIDs provides a tiny bonus across the board to nested paging, but the benefit is so small that it is hardly distinguishable from measurement errors. Virtual machines with shadow paging, on the other hand, receive the expected far greater boost due to their higher prevalence of virtual machine exits. Respective to the amplified stress in the microbenchmarks, VPIDs can improve performance by about 13% for the first two and over 20% for the Touchmem benchmark. This coincides with theoretical considerations, as the conserved TLB entries only reduce the performance degradation associated with virtual machine exits themselves. The other benchmarks include task switches, requiring actual effort within the VMM for rebuilding the shadow page table, while Touchmem solely focusses

42 6.3 Complexity on intercepts that are trivial to handle, which means a larger part of the overhead stems from the virtual machine exit itself. The macrobenchmark demonstrates that the implementation truly reaches near-native performance on real-world workloads, achieving an overhead of only 2%. This is in stark contrast to the more than 50% overhead of shadow paging without VPIDs, which was Fiasco.OC’s maximum achievable virtualization performance on VMX before this thesis. The comparison with the kernel shadow paging solution for SVM is much closer, as it reportedly incurred only 4% overhead over native performance. However, while that result gets remarkably close, its overhead is still twice as large, and it is uncertain whether an analogous implementation for VMX could have achieved the same performance. The comparison with KVM, which incurred 5% overhead on the macrobenchmark, demonstrates that this implementation is on par with or even faster than established virtualization solutions. When taking the multiprocessing results into account, however, that picture slightly changes: KVM draws level at two and leads ahead by about 4% when reaching four processors, where Fiasco.OC’s performance has decreased to almost 10% overhead compared to native execution. Since memory virtualization is done per processor and therefore not really affected by the amount of processors, this can probably be attributed to existing inefficiencies in Karma or Fiasco.OC’s multiprocessing support itself, which is still considered experimental. A similar (though less pronounced) effect had also been observed back when the nested paging support for SVM was initially developed and evaluated [24].

6.3 Complexity

As Fiasco.OC is a security-oriented microkernel, the additional value of every new feature needs to be carefully balanced against the increase of complexity in the trusted computing base. Although it is architecturally impossible to securely implement nested page table handling anywhere but the kernel, the availability of functionally identical shadow paging solutions still implies that it is essentially just an optional performance optimization rather than completely new functionality. This means that the performance gained through nested paging would need to be truly essential for Fiasco.OC’s competitiveness to warrant an inclusion — but the more than 50% overhead even for untampered workloads from the ordinary userland shadow paging implementation attests just that. In addition, the use of polymorphism to cleanly encapsulate the bulk of the added complexity makes it possible to offer VMX nested paging as an optional feature to Fiasco.OC, enabling users who require efficient virtualization to include it without forcing others to share this additional burden to the trusted computing base. As a complexity increase for better virtualization performance has thus been generally justified, the question remains whether the specific approach followed by this thesis is vindicable. The most credible alternative would have been kernel-internal shadow page table handling analogous to the existing work with SVM [25]. Despite minor differences in the hardware interfaces of both virtualization architectures, it can be presumed that a corresponding VMX implementation would have required roughly the same complexity. Using the open source tool SLOCCount (originally developed to analyze the composition of the Linux kernel [32]) to measure the amount of logical, non-comment lines of code, the SVM kernel shadow paging implementation contains 1349 more lines than the baseline repository it was forked from, while the code from this thesis only adds 868 lines to Fiasco.OC. On top of that, some of the former’s special optimizations require additional modifications to the guest

43 6 Evaluation operating system to paravirtualize parts of its page table handling, which is unnecessary with nested paging. Therefore, even though its hardware requirements are more demanding, the smaller complexity increase to the trusted computing base and shaving off at least half of the performance overhead from real-world workloads make nested paging on VMX a useful addition to Fiasco.OC’s virtualization repertoire.

44 7 Conclusion

The goal of this thesis was to improve the memory virtualization performance of Fiasco.OC and the Karma VMM on processors using the Intel VMX architecture. It was achieved by implementing support for VMX’ nested paging feature and overcoming the obstacles in Fiasco.OC’s architecture that had prevented this in the past. This thesis has demonstrated that this approach is indeed feasible and can be realized through a small amount of surgical modifications to Fiasco.OC’s memory management code with only a minor increase in kernel complexity. Additional supporting facilities and other minor enhancements were designed around that mechanism, resulting in an implementation that reaches near-native VMX virtualization performance and is both faster and less complex than the projected alternative approach of building an optimized shadow paging framework within the kernel. The design has been cleanly encapsulated in self-contained modules, allowing it to be handled as an optional feature that can be removed at compile time when it is not needed. However, even though both x86 virtualization architectures are now supported by fast and uncom- plicated nested paging implementations, shadow paging is far from obsolete: many existing processors still lack the necessary hardware support for the former and may continue to do so in future models, such as in the low-voltage Intel Atom family. Implementing highly-optimized in-kernel shadow paging support for VMX would still be a useful addition to Fiasco.OC’s virtualization capabilities that would complement rather than compete with the execution model enabled by this thesis. It might even be possible to refactor and reuse the majority of the existing shadow paging code for SVM, further reducing overall complexity of the trusted computing base in kernels compiled for generic distribution.

7.1 Future Work

While the memory virtualization performance in VMX nested paging mode should be nearly optimal now, the virtualization environment still offers potential for optimization in other areas: the existing approach of using a sanitized dummy VMCS to communicate configuration options from userland may be convenient, but reinitializing the whole data structure on every virtual machine entry is cumbersome and precludes the use of any processor-internal VMCS caching. Switching to a new interface that allows more targeted updates to VMCS fields and keeps a persistent copy per virtual machine in kernel memory may further reduce VMM intercept delays. There are also some functional VMX features that Fiasco.OC currently does not support, such as direct I/O pass-through or hardware- assisted APIC emulation — while these are not safe to be completely ceded to the untrusted VMM, a proper sanitization mechanism could make them usable in a secure and isolated setting. On Karma’s side, the potential for future work is not yet exhausted either: as the unforeseen problems during this thesis’ evaluation painfully showed, there are still bugs to be fixed in the existing code base. In addition, the current limitation to 32-bit addressing mode somewhat restricts its applicability for real-world use. While Fiasco.OC’s part of the virtualization stack, including the modules produced in

45 7 Conclusion

this thesis, is already fully compatible with 64-bit execution, Karma will still require a major overhaul to measure up to it in that regard. Nevertheless, while all these factors show that the virtualization environment is still very much in its experimental stage, this thesis has produced another essential component on the path towards becoming a mature, flexible and performant virtualization solution that can solve real-world problems. In addition, other microkernels with a similar memory architecture (such as L4Ka::Pistachio) might struggle with the same underlying problem when implementing VMX nested page tables and could find a useful blueprint in the herein presented solution.

46 Appendix A: Code

The code produced in this thesis is hosted in Git repositories on the servers of the Chair for Security in Telecommunications (SecT) at the Technische Universität Berlin. Reproducing the evaluation setup presented in chapter 6 requires code from the following sources, in order:

• Download the complete Fiasco.OC/L4Re base repository from Technische Universität Dresden, revision 38: http://svn.tudos.org/repos/oc/tudos/trunk

• Replace the kernel/fiasco subtree with the master branch of git+ssh://git.sec.t-labs.tu-berlin.de/julius_werner/fiasco, commit cf07ab893b551bc7ae4a4b5154461202775affa3

• Replace the l4/pkg/l4sys subtree with the l4sys/master branch of git+ssh://git.sec.t-labs.tu-berlin.de/julius_werner/l4re, commit 57b042607ebf169d83fb7e585fdfd6919508dc43

• Add the contents of the julius_werner/karma_3.4 branch from the following repository as subdirectory l4/pkg/karma: git+ssh://git.sec.t-labs.tu-berlin.de/karma/karma, commit 12027207237236d3b4501c9be1541f283b9b5f87

• Build a paravirtualized Linux kernel from the sources at branch karma_3.4 in: git://git.karma-vmm.org/karma-linux, commit c9c915fa33de86b774b5b32666e481bd12c07808

• The initial RAM file system image containing sources and binaries for the evaluation benchmarks can be found in branch julius_werner/thesis of the following repository, alongside the raw evaluation protocol and sources for this document and the corresponding slide presentation: git+ssh://git.sec.t-labs.tu-berlin.de/karma/karma, commit 09d08ae3b1b537531dbb07f8c8800b793ab0e59d

While all source code modifications produced in the course of this thesis are hereby released under the GNU General Public License Version 2, authorization to access some of these repositories may be restricted by the respective hoster. Please refer all questions regarding access and licensing issues to the Chair for Security in Telecommunications (SecT) at the Technische Universität Berlin.

47 Glossary

AMD Advanced Micro Devices — second leading manufacturer of x86 processors.

API Application Programming Interface — specification defining how other components can communicate with or request services from a given software module.

APIC Advanced Programmable Interrupt Controller — x86 device essential for multiprocessing.

ASID Address Space Identifier — SVM extension that allows tagged TLB entries.

BIOS Basic Input/Output System — legacy firmware of most x86-based computers.

CP-40 Control Program 40 — research prototype system that invented virtualization. [7]

CP-67 Control Program 67 — successor of CP-40, first commercial virtualization environment.

C++ Object-oriented programming language based on C.

EPT Extended Page Table — VMX extension that allows nested paging.

Fiasco.OC L4-based microkernel focussed on preemptibility and hard real-time guarantees. [20]

IBM International Business Machines — leading manufacturer of mainframe computers.

Intel Intel Inc. — inventor of the x86 architecture.

I/O Input/Output — communication between a processor and its peripheral devices.

Karma Experimental VMM for Fiasco.OC. [24]

KiB Kibibyte — 1024 Byte.

KVM Linux Kernel Virtual Machine — well-established Type II hypervisor. [11]

Linux Well-established open source operating system kernel, lead by Linus Torvalds. [30]

L4 Original second-generation microkernel created by Jochen Liedtke. [3]

L4Android Port of the Android smartphone operating system to the Fiasco.OC microkernel. [5]

L4Linux Port of the Linux operating system kernel to L4-based microkernels. [6]

L4Re L4 Runtime Environment — collection of userland modules that complement Fiasco.OC.

Mach Most prominent first generation microkernel. [16]

MiB Mebibyte — 1048576 Byte.

48 NOVA Experimental stand-alone Type I hypervisor built on microkernel concepts. [28]

PCID Process Context Identifier — feature in x86 64-bit mode that allows tagged TLB entries.

L4Ka::Pistachio L4-based microkernel focussed on high performance and portability. [26]

POSIX Portable Operating System Interface for UNIX — common userland API specification.

RAM Random Access Memory — volatile, writeable primary data storage in computers.

ROM Read-Only Memory — persistent data storage usually housing firmware.

σ0 Special task that owns all userland memory in L4.

SVM AMD Secure Virtual Machine — x86 extension for hardware-assisted virtualization. [15]

TLB Translation Lookaside Buffer — processor-internal cache for page translations.

Turtles Experimental KVM extension that allows efficient nested virtualization. [29] vCPU Alternative to Fiasco.OC’s threads that can receive asynchronous interrupts. [22]

VMCS Virtual Machine Control Structure — data structure used to configure VMX.

VMM Virtual Machine Monitor — host userland program that supervises a virtual machine.

VMX Intel Virtual Machine Extensions — x86 extension for hardware-assisted virtualization. [9]

VPID Virtual Processor Identifier — VMX extension that allows tagged TLB entries. x86 Successful processor architecture family started by the , also called IA-32. [9]

49 Bibliography

[1] Dionysus Blazakis. The Apple sandbox, 2011. URL https://media.blackhat.com/ bh-dc-11/Blazakis/BlackHat_DC_2011_Blazakis_Apple_Sandbox-wp.pdf. Black Hat DC 2011.

[2] Robert N. M. Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway. Capsicum: practical capabilities for UNIX. In Security ’10: Proceedings of the 19th USENIX Security Symposium, Berkeley, CA, USA, 2010. USENIX Association.

[3] Jochen Liedtke. On µ-kernel construction. In SOSP ’95: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 237–250, New York, NY, USA, 1995. ACM.

[4] . INTEGRITY-178B RTOS, 2012. URL http://www.ghs.com/products/ safety_critical/integrity-do-178b.html.

[5] Matthias Lange, Steffen Liebergeld, Adam Lackorzynski, Alexander Warg, and Michael Peter. L4Android: A generic operating system framework for secure smartphones. In SPSM ’11: Proceed- ings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices, pages 39–50, New York, NY, USA, 2011. ACM.

[6] Hermann Härtig, Michael Hohmuth, Jochen Liedtke, Sebastian Schönberg, and Jean Wolter. The performance of µ-kernel-based systems. In SOSP ’97: Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, pages 66–77, New York, NY, USA, 1997. ACM.

[7] Robin J. Adair, Richard U. Bayles, Les W. Comeau, and Robert J. Creasy. A virtual machine system for the 360/40. Cambridge Scientific Center Report 320-2007, 1966.

[8] Robert P. Goldberg. Architectural Principles for Virtual Computer Systems. PhD thesis, Harvard University, 1973. DTIC accession no. AD0772809.

[9] Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer’s Manual, 2011.

[10] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pages 164–177, New York, NY, USA, 2003. ACM.

[11] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. kvm: the Linux virtual machine monitor. In Proceedings of the Linux Symposium, volume 1, pages 225–230, 2007.

[12] Gerald J. Popek and Robert P. Goldberg. Formal requirements for virtualizable third generation architectures. In SOSP ’73: Proceedings of the Fourth ACM Symposium on Operating System Principles, pages 121–130, New York, NY, USA, 1973. ACM.

50 [13] John S. Robin and Cynthia I. Irvine. Analysis of the Intel Pentium’s ability to support a secure virtual machine monitor. In Proceedings of the 9th USENIX Security Symposium, pages 229–240, Berkeley, CA, USA, 2000. USENIX Association.

[14] Mendel Rosenblum and Tal Garfinkel. Virtual machine monitors: Current technology and future trends. Computer, 2005.

[15] Advanced Micro Devices, Inc. Secure Virtual Machine Architecture Reference Manual, 2005.

[16] Richard F. Rashid. From RIG to Accent to Mach: The evolution of a network operating system. In ACM ’86: Proceedings of 1986 ACM Fall Joint Computer Conference, pages 1128–1137, New York, NY, USA, 1986. ACM.

[17] Jerome H. Saltzer. Protection and the control of information sharing in Multics. Communications of the ACM, 17(7):388–402, 1974.

[18] John M. Rushby. Design and verification of secure systems. In SOSP ’81: Proceedings of the Eighth ACM Symposium on Operating Systems Principles, pages 12–21, New York, NY, USA, 1981. ACM.

[19] Jochen Liedtke. Improving IPC by kernel design. In SOSP ’93: Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, pages 175–188, New York, NY, USA, 1993. ACM.

[20] Operating Systems Group, Technische Universität Dresden. Fiasco.OC, revision 38, 2011. URL http://os.inf.tu-dresden.de/fiasco.

[21] Adam Lackorzynski and Alexander Warg. Taming subsystems — capabilities as universal resource access control in L4. In IIES ’09: Proceedings of the Second Workshop on Isolation and Integration in Embedded Systems, pages 25–30, New York, NY, USA, 2009. ACM.

[22] Adam Lackorzynski, Alexander Warg, and Michael Peter. Virtual processors as kernel interface. In Proceedings of the Twelfth Real-Time Linux Workshop, Nairobi, Kenia, 2010.

[23] Michael Peter, Henning Schild, Adam Lackorzynski, and Alexander Warg. Virtual machines jailed — virtualization in systems with small trusted computing bases. In VDTS ’09: Proceedings of the 1st EuroSys Workshop on Virtualization Technology for Dependable Systems, pages 18–23, New York, NY, USA, 2009. ACM.

[24] Steffen Liebergeld. Lightwheight Virtualization on Microkernel-based Systems. Diploma thesis, Technische Universität Dresden, 2010.

[25] Jan C. Nordholz. Efficient Virtualization on Hardware with Limited Virtualization Support. Diploma thesis, Technische Universität Berlin, 2011.

[26] System Architecture Group, University of Karlsruhe. L4Ka::Pistachio, 2010. URL http://www. l4ka.org/pistachio.

[27] Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman, Peter Chubb, Ben Leslie, and . Pre-virtualization: Slashing the cost of virtualization. Technical Report 2005-30, Fakultät für Informatik, Universität Karlsruhe (TH), November 2005.

51 [28] Udo Steinberg and Bernhard Kauer. NOVA: A microhypervisor-based secure virtualization architecture. In EuroSys ’10: Proceedings of the Fifth European Conference on Computer Systems, pages 209–222, New York, NY, USA, 2010. ACM.

[29] Muli Ben-Yehuda, Michael D. Day, Zvi Dubitzky, Michael Factor, Nadav Har’El, Abel Gordon, Anthony Liguori, Orit Wasserman, and Ben-Ami Yassour. The Turtles project: Design and implementation of nested virtualization. In OSDI ’10: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, pages 423–436, Berkeley, CA, USA, 2010. USENIX Association.

[30] Linus Torvalds et al. Linux, version 3.4.0, 2012. URL http://www.kernel.org/pub/linux/ kernel/v3.0/linux-3.4.tar.bz2.

[31] Bryce Denny, Greg Alexander, Todd Fries, Donald Becker, Tim Butler, et al. Bochs, version 2.5.1, 2012. URL http://bochs.sourceforge.net.

[32] David A. Wheeler. More than a gigabuck: Estimating GNU/Linux’s size, 2002. URL http: //www.dwheeler.com/sloc.

(All web sources were last accessed on August 19, 2012.)

52