Improvement of the Virtualization Support in the Fiasco.OC Microkernel

Technische Universität Berlin Master’s Thesis Chair for Security in Telecommunications (SecT) School IV — Electrical Engineering and Computer Science Technische Universität Berlin Improving Virtualization Support in the Fiasco.OC Microkernel Author: Julius Werner (#310341) Major: Computer Science (Informatik) Email: [email protected] Primary Advisor: Prof. Dr. Jean-Pierre Seifert Secondary Advisor: Prof. Dr. Hans-Ulrich Heiß Tutor: Dipl.-Inf. Matthias Lange April – August 2012 Abstract This thesis aims to improve the memory virtualization efficiency of the Fiasco.OC microkernel on processors with the Intel VMX hardware-assisted virtualization architecture, which is plagued by severe performance degradation. After discussing possible approaches, this goal is accomplished by implementing support for Intel’s nested paging mechanism (EPT). Fiasco.OC’s memory management system is enhanced to allow a new kind of task whose address space is stored in the EPT page attribute format while staying compati- ble with the existing shared memory mapping interface through dynamic page attribute translation. This enhancement is complemented with support for tagged TLB entries (VPIDs), independently providing another minor performance increase. The kernel’s external virtualization API and the experimental userland VMM Karma are adapted to support these changes. The final implementation is evaluated and analyzed in the context of several benchmarks, achieving near-native performance on real-world workloads while only constituting a minor increase in complexity for the microkernel. Zusammenfassung Diese Arbeit zielt darauf ab die Speichervirtualisierungseffizienz des Fiasco.OC Mikro- kerns auf Prozessoren mit der hardwareunterstützten Virtualisierungsarchitektur Intel VMX zu verbessern, welche von schweren Performanzverlusten geplagt wird. Nach der Diskussion möglicher Lösungsansätze wird dieses Ziel mit der Unterstützung von Intels Mechanismus für verschachtelte Seitentabellen (EPT) umgesetzt. Fiasco.OCs Speicherver- waltungssystem wird um eine neue Art von Prozess erweitert, dessen Adressraum im EPT-Seitenattributsformat gespeichert wird, der aber durch dynamische Übersetzung der Seitenattribute dennoch kompatibel mit der existierenden Schnittstelle für geteilte Speicher- abbildungen bleibt. Diese Verbesserung wird durch die Unterstützung von ausgezeichneten TLB-Einträgen (VPIDs) komplementiert, welche unabhängig davon einen weiteren kleinen Performanzvorteil einbringen. Die externe Virtualisierungs-API des Kerns und der experi- mentelle Userland-VMM Karma werden angepasst um diese Änderungen zu unterstützen. Die endgültige Implementation wird im Rahmen mehrerer Benchmarks ausgewertet und analysiert, wobei sie nahezu native Performanz bei realistischen Arbeitspaketen erreicht und nur einen geringen Komplexitätszuwachs für den Mikrokern bedeutet. Die selbstständige und eigenhändige Ausfertigung versichert an Eides statt: Berlin, den 30. August 2012 Julius Werner 1 Fakultät Elektrotechnik und Informatik Aufgabenstellung für die Masterarbeit Name des Studenten: Julius Werner Studiengang: Informatik Matrikel: 310341 Thema: Verbesserung der Virtualisierungsunterstützung auf dem FiascoMikrokern Zielstellung: Die Anforderungen an Betriebssysteme hinsichtlich Verfügbarkeit, Robustheit gegen Angriff und Unterstützung zahlreicher LegacyAnwendungen steigen kontinuierlich. Die Implementierung von Virtualisierungsunterstützung auf Mikrokernsystemen konnte diese Anforderungen überzeugend befriedigen. Fiasco.OC ist ein Mikrokern der dritten Generation und bietet Unterstützung für die Virtualisierungsfunktionen aktueller AMD (SVM) und Intel (VTx) x86Prozessoren. Die aktuelle Implementierung für VTx leidet jedoch unter Leistungsdefiziten, insbesondere bei speicherintensiven Anwendungen. Karma ist ein VMM der als UserlevelAnwendung auf Fiasco.OC läuft. Momentan wird nur Linux als Gastsystem unterstützt. In dieser Masterarbeit soll untersucht werden, wie die bestehenden Leistungs und Effizienzdefizite in der derzeitigen VTxImplementierung beseitigt werden können. Als Arbeitslast ist dabei ein Linux System anzunehmen. Die einzelnen Varianten sind dabei hinsichtlich ihrer Auswirkungen auf Leistungsumfang, Entwicklungsaufwand und Kompatibilität zu bestehender Software zu bewerten. Die am besten beurteilte Variante soll implementiert und anhand von Testszenarien bewertet werden. Verantwortlicher Hochschullehrer: Prof. Dr. JeanPierre Seifert Betreuer: Dipl.Inf. Matthias Lange Institut: Security in Telecommunications (FG SecT) Berlin, 08.05.2012 Unterschrift des verantwortlichen Hochschullehrers 2 Contents 1 Introduction 6 1.1 Goal . 7 1.2 Structure . 8 2 Background 10 2.1 Virtualization . 10 2.1.1 Hardware-Assisted Processor Virtualization . 12 2.1.2 Memory Virtualization and Nested Paging . 15 2.2 Microkernels . 18 2.2.1 Fiasco.OC . 19 3 Related Work 21 3.1 Existing Implementation . 21 3.1.1 Fiasco.OC . 21 3.1.2 Karma . 22 3.1.3 SVM Kernel Shadow Paging . 22 3.2 Comparable Projects . 23 3.2.1 L4Ka::Pistachio . 23 3.2.2 NOVA . 23 3.2.3 KVM . 24 3.2.4 Turtles . 24 4 Design 26 4.1 General Approach . 26 4.2 Fiasco.OC . 27 4.2.1 Page Table Handling . 28 4.2.2 VPID Support . 29 4.3 Karma . 30 5 Implementation 31 5.1 Fiasco.OC . 31 5.1.1 Feature Reporting and Sanitization . 31 5.1.2 Page Table Handling . 32 5.1.3 VPID Support . 33 5.2 Karma . 34 5.2.1 Unpaged Mode Emulation . 34 3 Contents 6 Evaluation 36 6.1 Measurements . 36 6.1.1 Microbenchmark: Fibonacci . 37 6.1.2 Microbenchmark: Forkwait . 38 6.1.3 Microbenchmark: Sockxmit . 39 6.1.4 Microbenchmark: Touchmem . 40 6.1.5 Macrobenchmark: Compiling the Linux Kernel . 41 6.2 Analysis . 42 6.3 Complexity . 43 7 Conclusion 45 7.1 Future Work . 45 Appendix A: Code 47 Glossary 48 Bibliography 50 4 List of Figures 1 Comparison of Type I and Type II hypervisor models . 13 2 Comparison of page table setup in nested and shadow paging . 16 3 Simplified depiction of memory hierarchy in an exemplary Fiasco.OC-based system . 19 4 Comparison of traditional and Intel Extended page table entry format . 27 5 Evaluation results from the Fibonacci microbenchmark . 37 6 Evaluation results from the Forkwait microbenchmark . 38 7 Evaluation results from the Sockxmit microbenchmark . 39 8 Evaluation results from the Touchmem microbenchmark . 40 9 Evaluation results from the Linux kernel compilation macrobenchmark . 41 List of Tables 1 Virtual machine feature reporting bit field format . 28 2 Opcode in bits 11 to 0 of VPID flush control VMCS field . 30 5 1 Introduction As computer systems keep becoming more prevalent in our daily life, their security and reliability is more important than ever. Be it personal computers, mobile and embedded devices, or industrial and financial information systems: most of today’s computers are connected to hostile networks or otherwise exposed to attacks. In addition to that, the multitude of software and hardware components from different vendors with varying quality assurance concepts that work together in a system form a large surface for possible vulnerabilities or defects. Although security and reliability are always dependent on the whole system stack, the operating system is the most critical component: while a vulnerability or defect in a single application may cause that component to fail or be subverted, good operating systems isolate their components from another, thus making the difference between a single loss of functionality and breakdown or subversion of the whole system. Sadly, most mass-market operating systems of today are severely lacking in regards to this isolation. While memory protection between userland processes has long been commonplace, other domains such as the file system allow attacks to easily spread through the system: bound by conventions from a time before malware became prevalent, they often allow every process write access to all resources of its user, regardless of whether it would actually need to do so under normal operation. This clearly violates the principle of least privilege, but the coarse-grained access control lists usually employed by such systems are inherently unsuited to efficiently enforce minimalist permissions on a process level. Therefore, even the most mundane programs often constitute prime targets for attack, as they can be used as a stepping stone to infect more important applications. Realizing this danger, more and more established operating systems expand their security models to provide tighter isolation on request. An increasingly common strategy for this is sandboxing, ranging from the already more than a decade old jail mechanism in FreeBSD to the recently added Seatbelt framework in Mac OS X [1]. Systems like these artificially restrict the resources a process may use, effectively expanding the access control list principle to a per-process basis. However, even this level of granularity may not be precise enough with attack scenarios such as the confused deputy problem: service processes with broad access rights can be tricked by malicious requests to use a valid permission for a different purpose than it was intended for. Preventing this kind of attack requires delegating permissions on a per-request level, as it is possible with other security models such as object capabilities. While there are attempts to graft that model onto existing operating systems, such as the Capsicum framework for FreeBSD [2], they will always be encumbered by the fact that they are late and unexpected additions to an existing environment. Both security

Improvement of the Virtualization Support in the Fiasco.OC Microkernel

Extensible Distributed Operating System for Reliable Control Systems

Capabilities

Operating Systems & Virtualisation Security Knowledge Area

Microkernel Construction Introduction

Runtime Data Driven Partitioning of Linux Code

L4 – Virtualization and Beyond

Capsicum Slides

Semperos: a Distributed Capability System

L4 Microkernel

The L4 Ecosystem

Rust and IPC on L4re”

KOS - Principles, Design, and Implementation