DISS. ETH NO. 24811

On the Construction of Dynamic and Adaptive Operating Systems

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Zurich)

presented by

Gerd Zellweger

Master of Science ETH in Computer Science, ETH Zurich

born on 26.06.1987

citizen of Switzerland

accepted on the recommendation of

Prof. Dr. Timothy Roscoe (ETH Zurich), examiner Prof. Dr. Gustavo Alonso (ETH Zurich), co-examiner Prof. Dr. Jonathan Appavoo (Boston University), co-examiner

2017

Abstract

Trends in hardware development indicate that computer architectures will go through considerable changes in the near future. One reason for this is the end of Moore’s Law. It implies that CPUs can no longer just become faster or made more complex due to having more and smaller transistors in a new chip. Instead, applications either have to use multiple cores or specialized hardware to achieve performance gains. Another reason is the end of Dennard scaling which means that as transistors get smaller, the consumed power per chip area does no longer remain constant. The implications are that large areas of a chip have to be powered down most of the time and system software has to dynamically enable and disable the hardware that applications want to use. Operating systems as of today were not designed for such a future, and rather assume homogeneous hardware that remains in the same static configuration during its runtime. In this dissertation, we study how operating systems must change to handle dynamic hardware, where cores and other devices in the system can power on and off individually. Furthermore, we examine how operating systems can be made more adaptive to suit the needs of applications better or specialize themselves for the underlying hardware. First, we present Barrelfish/DC, which decouples physical cores from the OS, and the OS itself from application state. The resulting design allows the OS to treat all available cores in the system as fully dynamic. Next, we present Badis, an OS architecture that enables applications to execute on top of adapted OS kernels and services. We show that the flexibility to run specialized kernels helps applications by optimizing the OS for the workload requirements, but also allows the OS to optimize for the complexity and heterogeneity present in modern and future machines. Finally, we propose a mechanism that promotes virtual address spaces to first-class citizens, thus enabling a to attach to, detach from, and switch between multiple virtual address spaces. The system enables applications to quickly and dynamically change their logical view of the memory system, and we show how this can increase performance considerably.

Zusammenfassung

Trends der Hardwareentwicklung zeigen, dass Computerarchitekturen in naher Zukunft beachtliche Veränderungen erfahren werden. Ein Grund dafür ist das Ende von Moore’s Law. Es bedeutet dass zukünftige CPUs nicht mehr einfach schneller oder komplexer werden indem man sich darauf verlässt immer kleinere und deswegen mehr Transistoren zur Verfügung zu haben. Stattdessen müssen Programme mehrere CPUs oder spezial- isierte Hardware verwenden um ihre Leistung zu verbessern. Ein weiterer Grund ist das Ende der Dennard Skalierung, was bedeutet dass mit kleiner werdenden Transistoren der Stromverbrauch nicht mehr nur von der eingenommenen Chipfläche abhängt. Die Auswirkung davon ist dass in Zukunft grosse Flächen eines Chips ausgeschaltet sein müssen, und die Systemsoftware dynamisch jene Hardware welche eine Applikation verwenden will einschalten muss. Heutige Betriebssysteme wurden entwickelt für homogene Hardware welche während ihrer Laufzeit in der gleichen, statischen Konfiguration bleibt. In dieser Dissertation studieren wir wie Betriebssysteme sich verändern müssen um mit dynamischer Hardware, in welcher CPUs oder andere Geräte individuell an und ausgeschaltet werden, umzugehen. Desweiteren analysieren wir wie Betriebssysteme anpassungsfähiger entwickelt werden können, um besser den Anforderungen von Applikationen gerecht zu werden oder um sich besser an die verwendete Hardware anzupassen. Zuerst präsentieren wir Barrelfish/DC, ein System welches physische CPUs von dem OS und das OS selber von dem Applikationsstate entkoppelt. Das resultierende Design erlaubt es dem OS alle CPU-Cores eines Systems als dynamisch zu behandeln. Als nächstes präsentieren wir Badis, eine OS Architektur welche es Applikationen erlaubt unter angepassten OS Kernels und Systemdiensten zu laufen. Wir zeigen wie diese Flexibilität hilft um die Anforderungen von Applikationen besser zu erfüllen, aber es zugleich auch erlaubt das OS zu optimieren für komplexe und heterogenen Hardware. Schlussendlich stellen wir einen Mechanismus vor welcher virtuelle Adressräume als Objekte erster Klasse behandelt und es Prozessen erlaubt diese zu erstellen, zu teilen, und zwischen ihnen zu wechseln. Der Mechanismus erlaubt Applikationen schnell und dynamisch ihre Sichtweise auf den verfügbaren Speicher zu wechseln, was wiederum verschiedene Leistungssteigerungen ermöglicht. Acknowledgments

The work presented in this dissertation was shaped, created and described through collaboration and interactions with many wonderful and exceptional friends and col- leagues. First of all, I would like to express my gratitude to my advisor Timothy Roscoe for always believing in me, supporting me and mentoring me during my masters and doctoral studies. Your optimism and joyful approach to everything you do make it truly a pleasure to work with you. Likewise, I thank Gustavo Alonso for being a fantastic and encouraging co-advisor that always gave insightful advice on my research and this dissertation. Finally, thanks to you Jonathan for taking part in my committee and all the reassuring feedback that you have given me. During my studies and internships, I had the opportunity to collaborate with many incredibly smart and dedicated colleagues who directly and indirectly contributed to this dissertation: Adrian Schüpbach, Alexander Merritt, Besmira Nushi, Dejan Milojicic, Denny Lin, Gabriel Kliot, Izzat El Hajj, Jana Giceva, Kornilios Kourtis, Paolo Faraboschi, Reto Achermann, Simon Gerber and Wen-mei Hwu. A special thanks also goes to all the friends I have gained during my time in the Systems Group: Akhi, Andreas, Anja, Besmira, Claude, Darko, Georgios, Gitalee, Ingo, Lefteris, Lukas, Marco, Moritz, Nina, Pratanu, Pravin, Reto, Renato, Roni, Simon and Stefan. Our ski-trips, vacations, and events are unforgettable and made the last five years a pleasure to be part of the group. Finally, for all great things I have been able to do in life, I thank my parents Irène and Max and my brother Urs. They have been supportive like no one else and always encouraged me to do what I enjoy.

Contents

1 Introduction 1 1.1 Problem statement ...... 3 1.2 Contributions ...... 4 1.3 Background: Barrelfish ...... 5 1.3.1 CPU Driver and Monitor ...... 6 1.3.2 Capabilities ...... 7 1.3.3 ...... 8 1.3.4 Device management ...... 9 1.4 Evaluation methodology ...... 9 1.5 Overview ...... 10

2 Decoupling Cores, Kernels and Operating Systems 11 2.1 Motivation ...... 12 2.1.1 Hardware ...... 13 2.1.2 Software ...... 14 2.2 Related work ...... 15 2.2.1 CPU Hotplug ...... 15 2.2.2 Kernel updates ...... 16 2.2.3 Multikernels ...... 18 2.2.4 ...... 19 2.3 Design and Implementation ...... 19 Contents

2.3.1 a new core ...... 20 2.3.2 Per-core state ...... 21 2.3.3 Capabilities in Barrelfish/DC ...... 23 2.3.4 Kernel Control Blocks ...... 24 2.3.5 Replacing a kernel ...... 25 2.3.6 Kernel sharing and core shutdown ...... 26 2.3.7 Dealing with time ...... 26 2.3.8 Dealing with ...... 27 2.3.9 Application support ...... 28 2.3.10 Discussion ...... 29 2.4 Evaluation ...... 30 2.4.1 Core management operations ...... 31 2.4.2 Applications ...... 35 2.4.2.1 Ethernet driver ...... 35 2.4.2.2 Web server ...... 36 2.4.2.3 PostgreSQL ...... 36 2.5 Concluding remarks ...... 39

3 A Framework for an Adaptive OS Architecture 41 3.1 Motivation ...... 42 3.1.1 Use-case: Coordinated parallel data processing ...... 42 3.1.2 Use-case: Eliminating OS noise ...... 44 3.2 Related work ...... 46 3.2.1 customization ...... 46 3.2.2 High-performance computing ...... 47 3.2.3 Scheduling parallel workloads ...... 48 3.2.4 OS abstractions for parallel execution ...... 49 3.3 Customization Goals ...... 50 3.3.1 Run-to-completion execution ...... 50

viii Contents

3.3.2 Co-scheduling ...... 52 3.3.3 Spatial isolation of tasks and threads ...... 52 3.3.4 OS interfaces ...... 53 3.3.5 Data aware task placement ...... 54 3.4 Design and Implementation ...... 54 3.4.1 Control plane ...... 55 3.4.2 Compute plane ...... 57 3.4.3 Discussion ...... 59 3.5 Basslet: A kernel based, task-parallel runtime system ...... 59 3.5.1 Task-parallel compute plane kernel ...... 60 3.5.2 Compute plane configuration ...... 61 3.5.3 Basslet runtime libraries ...... 62 3.5.3.1 Porting pthreads to Basslet ...... 63 3.5.3.2 Porting OpenMP to Basslet ...... 63 3.5.4 Basslet code size ...... 65 3.6 bfrt: A real-time OS kernel ...... 65 3.7 Evaluation ...... 65 3.7.1 Basslet runtime ...... 66 3.7.1.1 Interference between a pair of parallel jobs ...... 66 3.7.1.2 System throughput scale-out ...... 68 3.7.1.3 Standalone runtime comparison ...... 70 3.7.2 Performance isolation with bfrt ...... 71 3.7.3 Badis OS architecture ...... 73 3.7.3.1 Control plane applications ...... 73 3.7.3.2 Overhead of Badis enqueuing ...... 74 3.8 Concluding remarks ...... 75

ix Contents

4 Using Multiple Address Spaces in Applications 77 4.1 Motivation ...... 79 4.1.1 Memory technology ...... 79 4.1.2 Preserving pointer-based data structures ...... 80 4.1.3 Large-scale sharing of memory ...... 80 4.1.4 Problems with legacy methods ...... 80 4.2 Related work ...... 82 4.2.1 Operating systems ...... 82 4.2.2 ...... 83 4.2.3 Communication and Sharing ...... 84 4.2.4 Hardware ...... 85 4.3 Design ...... 86 4.3.1 Lockable Segments ...... 86 4.3.2 Multiple Virtual Address Spaces ...... 87 4.4 Implementation ...... 89 4.4.1 Barrelfish ...... 89 4.4.2 DragonFly BSD ...... 90 4.4.3 Runtime library ...... 92 4.4.4 Discussion ...... 92 4.5 Evaluation ...... 94 4.5.1 Microbenchmarks ...... 94 4.5.2 GUPS: Addressing Large Memory ...... 97 4.5.3 Redis with Multiple Address Spaces ...... 100 4.5.4 SAMTools: In-Memory Data Structures ...... 103 4.6 Concluding remarks ...... 104

5 Conclusion 107 5.1 Specialized hardware ...... 109 5.2 Rack-scale systems ...... 110 5.3 Near-data processing ...... 110

x 1 Introduction

This dissertation argues that operating systems face a significant challenge to adapt to future hardware and the diverse set of workloads we are executing on machines. With the beginning of the multi-core era around 2004, we have seen renewed interest in the design and implementation of scalable operating systems, , and algorithms to avoid contention in the presence of shared resources. Since then, hardware vendors have continued to scale out processor designs with more cores and larger caches. Hardware complexity has increased significantly due to the addition of new instructions, features, and highly specialized execution units. In addition, a variety of non-uniform system designs were introduced by hardware vendors and the scientific community: Asymmetric multi-core processors that trade-off performance and energy characteristics in servers as well as smart-phones, or reconfigurable processors whereby independent cores can be morphed into a larger, more powerful CPU and split again dynamically. This trend is a result of the failure of Dennard scaling, which refers to the observation that the transistor and voltage scaling is no longer in line with each other. As a result, we see sharp increases in power densities that prevent powering on all transistors simultaneously in a given area. This has several implications for system software: Future operating systems can no longer assume that all cores are of the same type and need mechanisms to handle platform-specific code in the OS for different cores. The OS can no longer rely on cores or other devices being static entities that remain in the same configuration or quantity all the time. Proposed techniques such as core fusion require support in operating systems to adjust dynamically to changes in CPU arrangements. Such re-adjustments Chapter 1. Introduction

may happen in quick succession as power budgets or workloads change. Traditional operating systems are not designed with such hardware in mind and optimize for a static environment with rare changes in the compute resources. In contrast, Chapter 2 explores the design, implementation, and implications of an operating system that treats underlying cores as dynamic devices that can appear and disappear quickly and can adapt to such changes with minimal disruptions to running programs and services. Furthermore, the dissertation argues for making operating systems more adaptive to better adjust to application requirements. Over the past few years, the popularity of general data-analytics applications, using machine learning, graph processing, SQL and NoSQL databases, just to name a few, have led to the introduction of a mix of systems with very different workload characteristics. Large-scale servers and rack-scale computers are intriguing platforms to execute these workloads on. First, these systems offer plenty of volatile memory to keep data access latency low. Secondly, they contain enough compute resources to consolidate multiple data-processing systems that provide different services to clients. While sharing leads to gains in efficiency and cost savings, it also brings several challenges for system software. Today’s sophisticated cache hierarchies, bus topologies, and CPU architectures lead to surprising interactions that negatively impact the performance of individual applications. The scheduling and resource management in operating systems needs to take this into account to maximize processing performance. However, many data-processing systems rely on intrinsic runtime systems or integrate existing runtimes for parallel processing. Every runtime makes their own, independent scheduling decisions, and each runtime’s view is limited by what little information an operating system can provide. One cause for this is that the traditional APIs and execution policies offered by default from the operating systems are not expressive enough to coordinate and schedule large data- processing workloads. As a result, many different data-processing systems re-implement and replicate much code and policy decision that should be done in the OS. Chapter 3 explores a new approach to customize an operating system for particular applications and workloads. We present an adaptive OS architecture with the ability to specialize the kernel on individual cores of a machine, all while guaranteeing the seamless inter-operation with traditional OS services. The resulting system provides services tailored to specific application classes. We show one instance of such an installment by implementing an OS based, task-parallel runtime service. The approach offers an efficient and balanced runtime environment that effectively coordinates parallel data-processing systems while simplifying the application logic of the systems itself.

2 1.1. Problem statement

Finally, this dissertation argues for better OS mechanisms to aid with memory manage- ment and data processing, given the overall increase in data-volume that is produced daily without any foreseeable end in sight. Current advances in memory technology in- dicate that by 2020 we can realistically expect massive pools (petabytes) of non-volatile memory (NVM) at rack-scale, which will be byte-addressable and accessible by a large number of compute nodes over low-latency networks [FKMM15]. System software needs to ensure that applications can continue to access the vast amount of available data with low overheads. Traditional block-based storage will lose importance in favor of persistent memory that is byte addressable directly by CPUs. While such technology has the potential to simplify the traditional storage layer considerably, there are several challenges: First, main memory capacity is starting to exceed the limits of today’s CPUs virtual and physical address bits. Processes that want to address very large amounts of memory run into the limitation that they are unable to access everything. Secondly, the data format inside of a process does typically differ from the data format used when persisting the data, or when communicating the information among different processes. In such cases, data is normally serialized and stored as byte-blobs on block-based storage systems or exchanged as byte-streams over communication endpoints. The reader or receiving endpoint then de-serializes the data back into pointer-based data-structures within its address space. This impedance is the cause of massive performance overheads. However, memory-centric architectures have the potential to avoid much of this overhead if communication happens between nodes that are connected to the same storage pools. Chapter 4 explores the limitations of existing OS memory APIs and how they can be avoided to leverage the potential of new, memory-centric architectures fully.

1.1 Problem statement

The work in this dissertation addresses the following problems and research questions in the context of system software.

1. Future hardware with radically new architectures pose a challenge for traditional OS designs. In order to leverage novel architecture proposals, future operating systems must adapt to a dynamic set of physical cores with regards to number, micro-architecture, or offered intrinsic properties and abilities. We explore the

3 Chapter 1. Introduction

research question of how an OS should be structured to dynamically adjust to hardware changes with minimal effects to applications.

2. Operating systems policies and APIs are traditionally designed and implemented by trying to find and satisfy the requirements across many applications. This often proofs to be a poor match for heavily specialized or resource intensive applications. Standard operating systems are often inflexible and allow for little customization or cooperation with applications and runtime systems. In this dissertation, we take a fresh look at enabling the customization of systems code in the context of the architecture: We explore the question how the OS can be specialized on parts of the machine using a multi-kernel and investigate how this benefits applications.

3. The continued growth in main-memory capacities is pushing the limits of tra- ditional memory systems in hardware and application software. This problem is expected to get worse with new memory technologies and memory-centric hardware architectures. This dissertation explores operating system software mechanisms that benefit the programming and performance of data-processing applications in the light of such new hardware.

1.2 Contributions

This dissertation makes three principal contributions:

1. It presents an operating system architecture based on the principle that all cores are fully dynamic. This is achieved by two innovations: (a) by leveraging the multikernel design to separate all OS as well as application state from the underlying hardware cores and operating system implementation, and (b) the introduction of boot drivers to abstract cores as regular devices. The result is a system where native kernel code on any core can be quickly replaced, kernel state moved between cores, and cores added and removed from the system, transparently to applications and OS services.

2. It describes the necessary additions to the proposed architecture in (1), to cus- tomize the operating system on parts of the machine and allow seamless integration with the rest of the system. Namely, we introduce a kernel based runtime system that supports coordinated execution of disjoint data-processing systems and can

4 1.3. Background: Barrelfish

act as a drop-in replacement for their respective user-space parallel runtime sys- tems. We show how this maximizes overall system throughput by coordinating the scheduling of different data-processing systems while minimizing hardware resource contention.

3. It designs and evaluates OS mechanisms for processes to create, structure, compose, and access multiple virtual address spaces. The main novelty of the proposed design allows a process to arbitrarily switch its context between multiple address spaces in a safe and controlled way. We show that dynamically switching address spaces can be leveraged by applications in more than one way: For example, to address memory sizes larger than what is supported by the system or to avoid communication and serialization overheads.

1.3 Background: Barrelfish

We implemented the proposed systems in this dissertation using the Barrelfish research OS. Barrelfish supports multiple hardware platforms and can run a variety of common applications such as databases and web servers. The operating system is structured as a multikernel, a distributed system of cores communicating solely via asynchronous messages. The multikernel model, originally proposed by Baumann et al. [BBD+09], is based on three design principles:

• Explicit inter-core communication: In a multikernel, cross-core communication and synchronization are performed using explicit messages. As a result, the OS does not rely on shared memory between two cores.

• Hardware neutral OS structure: The multikernel factors the OS into hardware neutral and hardware dependent code. By default, code is written hardware neutral, which means that adapting the OS to support a new architecture does not require extensive changes to the OS. Architecture specific code is used in parts where performance is critical (e.g., the message passing implementation).

• Replicated instead of shared state: Typically operating systems share data- structures such as page-tables, memory pools or scheduling queues across cores. Therefore, they rely on locking schemes to guarantee the integrity of OS state. In- stead, a multikernel copies its state across the cores and uses replication techniques to keep it consistent.

5 Chapter 1. Introduction

Although a multikernel occupies a very different point in the design space compared to traditional based systems, we note that the concepts and designs proposed in this dissertation have more general applicability and are feasible for other operating systems as well (as shown for example in Chapter 4, where a reference implementation in a BSD variant OS and Barrelfish is presented). In this section, we briefly give an overview of the relevant concepts in Barrelfish that affected and influenced the chapters presented in this thesis.

1.3.1 CPU Driver and Monitor

The kernel – also called a CPU driver in Barrelfish – is responsible for managing an individual core. It implements features such as context switching or handling and provides a system call interface that allows applications to interact with various on- core devices (the , serial device, interrupt controller etc.). The CPU driver adopts the design philosophy of [EKO95]: It tries to minimize abstractions and instead mediates direct hardware access securely. The security model in Barrelfish is based on a partitioned capability system (Section 1.3.2). Barrelfish represents all resources (physical memory, I/O space, interrupts etc.) as capabilities. The CPU driver ensures – with the help of the MMU – that memory used to store capabilities cannot be modified directly by a user-process. Instead, a program creates, modifies, and copies capabilities safely with system calls, by providing a reference to a capability. One advantage of the capability system for the CPU driver is that it can eliminate any memory allocation as part of its code. Instead, the CPU driver is designed such that all kernel objects are allocated by user programs on its behalf. Furthermore, the CPU driver is built as a completely event-driven and non-preemptable program and does not have any built-in concurrency. Those design decisions greatly reduce the complexity of the kernel and prevent any deadlocks, race conditions, memory leaks, or the need to worry about running out of memory in kernel code. Anything that requires system-wide coordination is built into a trusted user-space service called the monitor. Similar to the CPU driver, every core runs a separate instance of the monitor program. The monitor program carries out the inter-core messaging in order to initiate low-level operations that need to be coordinated among many CPU drivers. Examples of such operations include TLB shootdowns, initializing cross-core communication channels, sending capabilities across cores, or setting up shared memory regions.

6 1.3. Background: Barrelfish

The Barrelfish OS with CPU driver and monitors represent a structure similar to a micro-kernel: The functionality in the kernel is kept to a minimum and most OS services, including device drivers, and a large part of the OS code that does not require privileged hardware access runs as user-space services.

1.3.2 Capabilities

Capabilities are used in Barrelfish to enforce protection and authorization and manipu- late OS state. Barrelfish uses a capability system which is inspired by seL4 [EDE08, Tea06]: Capabilities are implemented using a partitioned capability system [CJ75, DdBF+94], meaning only the kernel itself can access and manipulate the memory where an application’ capabilities are stored. All system resources, including physical memory, threads, communication endpoints etc. are represented as capabilities in the system. Capabilities are typed and can be retyped by users. However, the retype operation is privileged, and the kernel ensures that retyping capabilities adheres the specific rules and is within the rights of the program requesting the retype operation. A capability can also be split into two or more parts. For example, one can split a capability representing a big region of memory into multiple smaller chunks. The two operations together, retyping and splitting, provide a powerful mechanism to manage resources in user-space. For example, in Barrelfish every user-space program manages its address space by retyping physical memory regions to page-tables and frames (mappable regions of memory). The system prevents a region of type page-table to be mapped writeable into a process’ address space for security reasons. Instead, the program can use invocations to change the page-table state and safely install or remove mappings. Invocations are regular system calls and can be thought of as calling a method on a capability object. For example, to insert a physical memory mapping in a last-level page-table a program would invoke the map operation on the page-table capability, passing it a frame capability and slot number as arguments. To store and find capabilities, every Barrelfish process has a CSpace which can be thought of as a two-level page-table. The applications construct a CSpace by allocating capabilities of type CNode. A CNode is a region of memory that is only accessible by the CPU driver and contains the capability meta-data. Since a user-space program can not manipulate capabilities directly, it needs to refer to them using capability references. A capability reference in Barrelfish is a 32-bit address where the top bits indicate the slot in the top-level CNode and the next bits to the capability information in the second

7 Chapter 1. Introduction

level CNode. On every invocation the CPU driver can perform a lookup, similar to a page-table walk, to find the capability data based on the provided capability reference. One complication of the capability model in Barrelfish over seL4 is that the multikernel does not allow shared state between cores. An application running on multiple cores needs to maintain many different CSpaces, one for every core. The monitor is responsible for keeping the capability system in sync and ensures that operations like copying or revoking capabilities across cores are executed correctly [Nev12, SKN13].

1.3.3 Scheduling

In the kernel, Barrelfish schedules a set of dispatchers. A dispatcher is the kernel object which is the nearest equivalent to a process or kernel from Unix. Dispatchers typically contain a set of meta-data, including the physical address of the root page-table or message endpoints. They are also represented as capabilities and thus are allocated by user-space programs. The CPU driver uses Rate-Based Earliest Deadline First (RBED) [BBLB03] as the default scheduling strategy for the OS. The scheduling algorithm allocates resources to dispatchers as a percentage of CPU such that the total is always less or equal to 100%. Hard and soft real-time tasks define their worst-case execution time and period. The algorithm rejects hard real-time tasks in case it can not guarantee their execution within the proposed constraints and the current system state. Scheduling decisions in Barrelfish are done on a per-core basis: Every CPU driver executes an independent RBED scheduler. However, Peter et al. [Pet12] later extended Barrelfish with gang-scheduling. This made it possible for the application to request to the OS that a set of dispatchers running on multiple cores should execute simultaneously. The Barrelfish library OS implements user-level threads, allowing every program to spawn multiple threads per dispatcher. The user-level thread scheduler can be cus- tomized by the application, but by default it will schedule threads in a simple round-robin fashion. To react to events and notification from the hardware (e.g., interrupts or timers), the CPU driver uses an up-call mechanism [ABLL91] to notify the user-level thread scheduler.

8 1.4. Evaluation methodology

1.3.4 Device management

Driver domains are responsible for controlling devices in Barrelfish. They run in user- space and can contain several driver instances. Each driver instance is responsible for managing a single device. A driver instance is typically written as a C library and contains the necessary code to interface with the device. An instance exports two communication interfaces: First, a generic interface for the device manager to interact with the driver instance (e.g., to stop or restart the driver). Second, a class-specific interface for a given device type such that applications can interact with the device. Holistically managing devices of an entire machine can be quite involved: Each device has to be found at runtime and can disappear and reappear at any point in time. Barrelfish uses a publish–subscribe system, called Octopus [ZSR12], to notify the system about the appearance and disappearance of devices. The device manager, called Kaluga, uses Octopus to subscribe to all events that are published due to hardware changes. In case an event is received that notifies about the appearance or disappearance of a device, the following happens: Kaluga will react by spawning a new driver domain and instance if necessary. Next, the device manager will communicate with the instance assigned to the device to either attach to or detach from the device. The driver instance then assumes control over a device by setting it into a well known state and starts to accept requests over the exported device interface.

1.4 Evaluation methodology

This dissertation proposes radical changes to the design of today’s operating systems. These designs are driven by how we anticipate hardware to change in the future. It is inherently difficult to evaluate such work since no generally accepted methodology or benchmarks exist to reason about the performance of operating systems. Furthermore, to fully leverage the potential of the OS, often applications would have to be redesigned or written from scratch. The approach we followed in this dissertation, is to design experiments specifically to exercise the new parts of the system while using relevant, well-established applications that are in use today. These applications are executed either unmodified or with minimal modifications to the code base. This approach allows us to make a comparison with the performance achieved on our system compared to the performance when the application under test is running on its native host OS.

9 Chapter 1. Introduction

1.5 Overview

Chapter 2: introduces Barrelfish/DC, an extension of the Barrelfish Operating System which treats cores as fully dynamic devices. We describe in detail the new mechanisms and changes added to the OS and evaluate the systems capability to adapt quickly to changes in the underlying cores and its potential to specialize the kernel on certain cores for specific workloads. This work is a result of collaboration with Simon Gerber, Kornilios Kourtis, and Timothy Roscoe. Parts of this chapter have been published in the 11th Symposium on Operating System Design and Implementation (OSDI) in 2014 [ZGKR14].

Chapter 3: presents Badis, a framework to adapt and customize an OS on certain parts of the machine. By leveraging the multi-kernel design alongside the techniques presented in Chapter 2 we show how partitioning a machine in a control and compute plane can improve the performance of co-existing, parallel runtimes. This is joint work with Jana Giceva, Timothy Roscoe and Gustavo Alonso. Parts of the chapter were published in the 12th International Workshop on Data Management on New Hardware (DaMoN) in 2016 [GZAR16].

Chapter 4: describes SpaceJMP, an OS extension that gives applications the ability to efficiently create, modify and share entire address spaces and control the switching among them. We show that such an approach has several benefits by reducing serialization and communication costs in existing applications and benchmarks. This work is the result of a collaboration with researchers from industry and academia: Izzat El Hajj, Alexander Merritt, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe and Karsten Schwan. Parts of this chapter were published in the 21st International Conference on Architectural Support for Programming Languages and Operating systems (ASPLOS) in 2016 [EHMZ+16]. Some of the techniques discussed in this chapter are pending patent applications [EHMZM17b, EHMZM17a, EHMZM17c].

Chapter 5: Concludes this dissertation and lays out possible directions towards better operating systems in the light of future developments in hard- and software.

10 2 Decoupling Cores, Kernels and Operating Systems

The hardware landscape is increasingly dynamic. Future machines will contain large numbers of heterogeneous cores which will be powered on and off individually in response to workload changes. Cores themselves will have porous boundaries: some may be dynamically fused or split to provide more energy-efficient computation. Existing OS designs like and Windows assume a static number of homogeneous cores, with recent extensions to allow core hotplugging. This chapter introduces Barrelfish/DC, an OS design based on the principle that all cores are fully dynamic. Based on the Barrelfish research OS [Wika] it exploits the “multikernel” architecture to separate the OS state for each core. Barrelfish/DC handles dynamic cores more flexible and with far less overhead than Linux, while bringing additional benefits in functionality, such as hot-plug and on-demand runtime specialization of the OS kernel. A key challenge with dynamic cores is safely disposing of per-core OS state when removing a core from the system: this process takes time and dominates the hardware latency of powering the core down, reducing any benefit in energy consumption. This chapter presents a technique that externalizes all the per-core OS and application state of a system into objects called OSnodes, which can be executed lazily on another core. While transparent to applications, this new design choice implies additional benefits not Chapter 2. Decoupling Cores, Kernels and Operating Systems

seen in prior systems: Barrelfish/DC can completely replace the OS kernel code running on any single core or subset of cores in the system at runtime, without disruption to any other OS or application code, including that running on the core. Kernels can be updated and bugs fixed without downtime, or replaced temporarily. Furthermore, per-core OS state can be moved between slow, low-power cores and fast, energy-hungry cores. Multiple cores’ state can be temporarily aggregated onto a single core to further trade-off performance and power, or to dedicate an entire package to running a single job for a limited period. Parts of Barrelfish/DC can be moved onto and off cores optimized for particular workloads. Cores can be fused [IKKM07] transparently, and SMT threads [MDH+02, KAO05] or cores sharing functional units [BBSG11] can be selectively used for application threads or OS accelerators. In this chapter, we present several innovations which form Barrelfish/DC: Barrelfish/DC treats a CPU core as being a special case of a peripheral device, and introduces the concept of a boot driver, which can start, stop, and restart a core while running elsewhere. We use a partitioned capability system for memory management which allows us to completely externalize all OS state for a core. This in turn permits a kernel to be essentially stateless, and easily replaced while Barrelfish/DC continues to run. We factor the OS into per-core kernels1 and OSnodes, and a Kernel Control Block provides a kernel-readable handle on the total state of an OSnode.

2.1 Motivation

Barrelfish/DC fully decouples cores from kernels (supervisory programs running in kernel mode), and moreover both of them from the per-core state of the OS as a whole and its associated applications (threads, address spaces, communication channels, etc.). This goes considerably beyond the core hotplug or dynamic core support in today’s OSes. Figure 2.1 shows the range of primitive kernel operations that Barrelfish/DC supports transparently to applications and without downtime as the system executes:

• A kernel on a core can be rebooted or replaced.

• The per-core OS state can be moved between cores. 1Barrelfish uses the term CPU driver to refer to the kernel-mode code running on a core. In this thesis, we use the term “kernel” instead, to avoid confusion with boot driver.

12 2.1. Motivation

OSnode β

OSnode α

k. B3 k. C

kernel A kernel B1 kernel B2 kernel B2 multiplexer

core 0 core 1 core 2 core 1

Time update move park unpark

Figure 2.1: Shows the supported operations of a decoupled OS. Update: The entire kernel, dispatching OSnode α, is replaced at runtime. Move: OSnode α containing all per-core state, entailing applications is migrated to another core and kernel. Park: OSnode α is moved to a new core and kernel that temporarily dispatches two OSnodes. Unpark: OSnode α is transferred back to its previous core.

• Multiple per-core OS components can be relocated to temporarily “share” a core.

In this section, we argue why such functionality will become important in the future, based on recent trends in hardware and software.

2.1.1 Hardware

In the last years we have seen a continuous rise in core counts, both on a single chip and in a complete system, with a corresponding increase in the complexity of the memory system – non-uniform memory access and multiple levels of cache sharing. Systems software, and in particular the OS, must tackle the complex problem of scheduling both OS tasks and those of applications across a number of processors based on memory locality. At the same time, cores themselves are becoming non-uniform: Asymmetric multicore processors (AMP) [KFJ+03] mix cores of different microarchitectures (and therefore performance and energy characteristics) on a single processor. A key motivation for this is power reduction for embedded systems like smartphones: under high CPU load,

13 Chapter 2. Decoupling Cores, Kernels and Operating Systems

complex, high-performance cores can complete tasks more quickly, resulting in power reduction in other areas of the system. Under light CPU load, however, it is more efficient to run tasks on simple, low-power cores. While migration between cores can be transparent to the OS (as is possible with, e.g., ARM’s “big.LITTLE” AMP architecture) a better solution is for the OS to manage a heterogeneous collection of cores itself, powering individual cores on and off reactively. Alternatively, Intel’s Turbo Boost feature, which increases the frequency and voltage of a core when others on the same die are sufficiently idle to keep the chip within its thermal envelope, is arguably a dynamic form of AMP [CJS+09]. At the same time, hotplug of processors, once the province of specialized machines like the Tandem NonStop systems [Bar81], is becoming more mainstream. More radical proposals for reconfiguring physical processors include Core Fusion [IKKM07], whereby multiple independent cores can be morphed into a larger CPU, pooling caches and functional units to improve the performance of sequential programs. Ultimately, the age of “dark silicon” [EBSA+11] may well lead to increased core counts, but with a hard limit on the number that may be powered on at any given time. Performance advances and energy savings subsequently will have to derive from specialized hardware for particular workloads or operations [VSG+10]. The implications for a future OS are that it must manage a dynamic set of physical cores, and be able to adjust to changes in the number, configuration, and microarchitecture of cores available at runtime, while maintaining a stable execution environment for applications.

2.1.2 Software

Alongside hardware trends, there is increasing interest in modifying, upgrading, patching, or replacing OS kernels at runtime. Baumann et al. [BHA+05] implement dynamic kernel updates in K42, leveraging the object-oriented design of the OS, and later extend this to interface changes using object adapters and lazy update [BAW+07]. More recently, [AK09] allows binary patching of Linux kernels without reboot, and works by comparing generated object code and replacing entire functions. Dynamic instrumentation systems like Dtrace [CSL04] provide mechanisms that modify the kernel at run-time to analyze program behavior.

14 2.2. Related work

All these systems show that the key challenges in updating an OS online are to maintain critical invariants across the update and to do so with minimal interruption of service (the system should pause, if at all, for a minimal period). This is particularly hard in a multiprocessor kernel with shared state. With Barrelfish/DC, we argue for a system that addresses all these challenges in a single framework for core and kernel management in the OS.

2.2 Related work

Our work combines several directions in OS design and implementation: core hotplug- ging, kernel updates and replacement and multikernel architectures. In this section, we discuss similar along with tangential solution such as virtualization and library operating systems.

2.2.1 CPU Hotplug

Most modern OS designs today support some form of core hotplug. Since the overriding motivation is reliability, unplugging or plugging a core is considered a rare event and the OS optimizes the common case where the cores are not being hotplugged. For example, Linux CPU hotplug uses the __stop_machine() kernel call, which halts application execution on all online CPUs for typically hundreds of milliseconds [GMG12], overhead that increases further when the system is under CPU load [Lina]. We show further evidence of this cost in Section 2.4.1 where we compare Linux’ CPU hotplug with Barrelfish/DC’ core update operations. Recognizing that processors will be configured much more frequently in the future for reasons of energy usage and performance optimization, Chameleon [PS12] identifies several bottlenecks in the existing Linux hotplug implementation due to global locks, and argues that current OSes are ill equipped for processor sets that can be reconfigured at runtime. Chameleon extends Linux to provide support for changing the set of processors efficiently at runtime, and a scheduling framework for exploiting this new functionality. Chameleon can perform processor reconfiguration up to 100 000 times faster than Linux (version 2.6). Barrelfish/DC is inspired in part by this work, but adopts a very different approach. Where Chameleon targets a single, monolithic shared kernel, Barrelfish/DC adopts

15 Chapter 2. Decoupling Cores, Kernels and Operating Systems

a multikernel model and uses the ability to reboot individual kernels one by one to support CPU reconfiguration. Chameleon abstracts hardware processors behind processor proxies and execution objects, in part to handle the problem of per-core state (primarily interrupt handlers) on an offline or deconfigured processor. However, proxies can only be used for short durations as they do not reschedule threads from an offline CPU. In contrast, Barrelfish/DC abstracts the per-core state (typically much larger in a shared-nothing multikernel than in a shared-memory ) behind OSnode and kernel control block abstractions. The introduced parking mechanism in Barrelfish/DC gives the OS the ability to continue executing the core’s threads. In a very different approach, Kozuch et al. [KKR09] show how commodity OS hibernation and hotplug facilities can be used to migrate a complete OS between different machines (with different hardware configurations) without virtualization. are typically capable of simulating hotplugging of CPUs within a virtual machine. Barrelfish/DC can be deployed as a guest OS to manage a variable set of virtual CPUs allocated by the . Indeed, Barrelfish/DC addresses a long- standing issue in virtualization: it is hard to fully virtualize the microarchitecture of a processor when VMs might migrate between asymmetric cores or between physical machines with different processors. As a guest, Barrelfish/DC can natively handle such heterogeneity and change without disrupting operation.

2.2.2 Kernel updates

While most modern mainstream OSes support dynamic loading and unloading of kernel modules, which can be used to update or specialize limited parts of the OS. The problem of patching system software without downtime of critical services has been a research area for some time, and several operating system have implemented some form of dynamic updates in the past.

Solaris introduced a live upgrade mechanism [Sun01] which substantially reduced the service outage of an OS update by completely duplicating the running environment at runtime. While the original environment continued to run unhindered, the new one could be upgraded and tested. However, activation of the new system still required a full system reboot.

16 2.2. Related work

K42 explored updates of a running kernel, exploiting the system’s heavily object- oriented design. Baumann et al. [BHA+05] showed that by hot-swapping the object factory, responsible for creating a certain type of objects, in addition with a well-defined API for conversing existing objects to a new version it is possible to achieve dynamic updates at the granularity of an individual class. A quiescent state for applying the update is found by having only short lived kernel threads which are grouped into epochs. By tracking live threads in every generation it is possible to determine when all threads using a specific object instance have terminated. The original proposal transformed every object at the time an update was loaded. Later the technique was expanded to allow changes to an object’s interface and to apply the updates lazily, the next time the object is referenced [BAW+07].

Linux has limited support to replace the entire kernel using the kexec facility. How- ever, this overwrites existing, non-persistent state and can be viewed essentially a fast reboot of the machine. KSplice [AK09] can patch a running Linux kernel without the need for reboot by replacing code in the kernel at a granularity of complete functions. It uses the Linux stop_machine() call to ensure that no CPU is currently executing a function to be replaced, and places a branch instruction at the start of the obsolete function to direct execution of the replacement code. Systems like KSplice replace individual functions across all cores at the same time. In contrast, Barrelfish/DC replaces entire kernels, but on a subset of cores at a time. KSplice makes sense for an OS where all cores must execute in the same, shared-memory kernel and the overhead incurred by quiescing the entire machine is unavoidable.

Otherworld [DS10] also enables kernel updates without disrupting applications, with a focus on recovering system crashes. Otherworld can microreboot the system kernel after a critical error without clobbering running applications’ state, and then attempt to restore applications that were running at the time of a crash by recreating application memory spaces, open files and other resources.

Proteos [GKT13] uses a similar approach to Barrelfish/DC by replacing applications in their entirety instead of applying patches to existing code. In contrast to Ksplice, Proteos automatically applies state updates while preserving pointer integrity in many cases, which eases the burden on programmers to write complicated state transformation functions. In contrast to Barrelfish/DC, Proteos does not upgrade kernel-mode code but

17 Chapter 2. Decoupling Cores, Kernels and Operating Systems

focuses on updates for OS processes running in user-space, in a micro-kernel environment. However, much of the OS functionality in Barrelfish/DC resides in user-space as well, and the proposed techniques to automate state migration between the old and new versions would be applicable as well for kernel upgrades in Barrelfish/DC. Rather than relying on a single, system-wide kernel, Barrelfish/DC exploits the multi- kernel environment to offer both greater flexibility and better performance: kernels and cores can be updated dynamically with (as we show in Section 2.4) negligible disruption to the rest of the OS.

2.2.3 Multikernels

Multikernels such as Akaros [RKZB11], Barrelfish, fos [WGB+10], Hive [CRD+95], and Tessellation [LKB+09], are based on the observation that modern hardware is a networked system and so it is advantageous to model the OS as a distributed system. For example, Barrelfish runs a small kernel on each core in the system, and the OS is built as a set of cooperating processes, each running on one of these kernels, sharing no memory, and communicating via message passing. Multikernels are motivated by both the scalability advantages of sharing no cache lines between cores, and the goal of supporting future hardware with heterogeneous processors and little or no cache-coherent or shared physical memory. Barrelfish/DC exploits the multikernel design for a new reason: dynamic and flexible management of the cores and the kernels of the system. A multikernel can naturally run different versions of kernels on different cores. These versions can be tailored to the hardware, or specialized for different workloads. Furthermore, since (unlike in monolithic kernels) the state on each core is relatively decoupled from the rest of the system, multikernels are a good match for systems where cores come and go, and intuitively should support reconfiguration of part of the hardware without undue disruption to software running elsewhere on the machine. Finally, the shared-nothing multikernel architecture allows us to wrap kernel state and move it between different kernels without worrying about potentially harmful concurrent accesses. Multikernels have been combined with traditional OS designs such as Linux [Jos10, NSN+11] so as to run multiple Linux kernels on different cores of the same machine using different partitions of physical memory, in order to provide performance isolation between applications. Popcorn Linux [She13] boots a modified Linux kernel in this fashion, and

18 2.3. Design and Implementation

supports kernel- and user-space communication channels between kernels [SBRQ13], and process migration between kernels. In principle, Popcorn extended with the ideas in Barrelfish/DC could be combined with Chameleon in a two-level approach to dynamic processor support. While their goals of security and availability differ somewhat from Barrelfish/DC, KeyKOS [Har85] and EROS [SSF99] use partitioned capabilities to provide an essentially stateless kernel. Memory in KeyKOS is persistent, and it allows updates of the OS while running, achieving continuity by restoring from disk-based checkpoints of the entire capability state. Barrelfish/DC by contrast achieves continuity by distributing the capability system, only restarting some of the kernels at a time, and preserving each kernel’s portion of the capability system across the restart.

2.2.4 Virtualization

Virtualization techniques encapsulate a complete OS and its applications in a virtual machine (VM). Multiple VMs are then multiplexed over a set of cores by the virtual machine monitor. Nothing in our work precludes or requires virtualization techniques. Nevertheless, virtualization touches our work in two ways. Firstly, our techniques might be used inside the hypervisor to manage the physical processors in the system. Secondly, the ability to specialize the kernel on certain cores, achieves a similar effect than virtualization by running different operating system implementations side-by-side.

Library Operating System approaches like Drawbridge [PBWH+11] and Bas- cule [BLF+13] encapsulate a single application’s state into a container which offers high-level abstractions (relative to hardware) like threads, sockets, and virtual address spaces and communicates with OS over a small, well-defined API. Drawbridge and Bascule encapsulated programs have been shown to run on Barrelfish, but the underlying OS (whether Barrelfish or Windows) views a library OS as no different to any other application, and so Drawbridge operates at a very different level of abstraction to our techniques.

2.3 Design and Implementation

We now describe how Barrelfish/DC decouples cores, kernels, and the rest of the OS. We focus entirely on mechanism, and so do not address scheduling and policies for kernel

19 Chapter 2. Decoupling Cores, Kernels and Operating Systems

replacement, core power management, or application migration. Note also that our main motivation in Barrelfish/DC is adapting the OS for performance and flexibility, and so we do not consider fault tolerance and isolation for now. We first describe how Barrelfish/DC boots a new core, and then present in stages the problem of per-core state when removing a core, discussing the Barrelfish/DC capability system and kernel control block. We then discuss the challenges of time and interrupts, and finish with a discussion of the wider implications of the design.

2.3.1 Booting a new core

Current CPU hotplug approaches assume a single, shared kernel and a homogeneous (albeit NUMA) machine, with a variable number of active cores up to a fixed limit, and so a static in-kernel table of cores (whether active or inactive) suffices to represent the current hardware state. Bringing a core online is a question of turning it on, updating this table, and creating per-core state when needed. Previous versions of Barrelfish also adopted this approach, and booted all cores during system initialization, though there has been experimental work on dynamic booting of heterogeneous cores [Men11]. Barrelfish/DC targets a broader hardware landscape, with complex machines comprising potentially heterogeneous cores. Furthermore, since Barrelfish/DC runs a different kernel instance on each core, there is no reason why the same kernel code should run everywhere. In fact, we evaluate some benefits of not doing this more closely in Chapter 3. For this reasons, we argue for an OS representation of a core on the machine which abstracts the hardware-dependent mechanisms for bringing that core up (with some kernel) and down. Therefore, Barrelfish/DC introduces the concept of a boot driver, which is a piece of code running on a “home core”, manages a “target core” and encapsulates the hardware functionality to boot, suspend, resume, and power-down the latter. Currently boot drivers run as processes, but closely resemble device drivers and could equally run as software objects within another process. A new core is brought online as follows:

1. The new core is detected by some platform-specific mechanism (e.g., ACPI) and its appearance registered with the device management subsystem. 2. Barrelfish/DC selects and starts an appropriate boot driver for the new core.

20 2.3. Design and Implementation

3. Barrelfish/DC selects a kernel binary and arguments for the new core, and directs the boot driver to boot the kernel on the core. 4. The boot driver loads and relocates the kernel, and executes the hardware protocol to start the new core. 5. The new kernel initializes and uses existing Barrelfish protocols for integrating into the running OS.

The boot driver abstraction treats CPU cores much like peripheral devices, and allows us to reuse the OS’s existing device and hotplug management infrastructure [ZSR12] to handle new cores and select drivers and kernels for them. It also separates the hardware-specific mechanism for booting a core from the policy question of what kernel binary to boot the core with. Boot drivers remove most of the core boot process from the kernel: in Barrelfish/DC we have entirely replaced the existing multiprocessor booting code for multiple architectures (which was spread throughout the system) with boot drivers, resulting in a much simpler system structure, and reduced code in the kernels themselves. Booting a core (and, indeed, shutting it down) in Barrelfish/DC only involves two processes: the boot driver on the home core, and the kernel on the target core. For this reason, we require no global locks or other synchronization in the system, and the performance of these operations is not impacted by load on other cores. We demonstrate these benefits experimentally in Section 2.4.1. Since a boot driver for a core requires (as with a ) at least one existing core to execute, there is a potential dependency problem as cores come and go. For the PC platform we focus on here, this is straightforward since any core can run a boot driver for any other core, but we note that in general the problem is the same as that of allocating device drivers to cores. Boot drivers provide a convenient abstraction of hardware and are also used to shutdown cores, but this is not the main challenge in removing a core from the system.

2.3.2 Per-core state

Taking a core out of service in a modern OS is a more involved process than booting it, since modern multicore OSes include varying amounts of per-core kernel state. If they did not, removing a core would be simply require migrating any running thread somewhere else, updating the scheduler, and halting the core.

21 Chapter 2. Decoupling Cores, Kernels and Operating Systems

The challenge is best understood by drawing a distinction between the global state in an OS kernel (i.e., the state which is shared between all running cores in the system) and the per-core state, which is only accessed by a single core. The kernel state of any OS is composed of these two categories. In, for example, older versions of Unix, all kernel state was global and protected by locks. In practice, however, a modern OS keeps per-core state for scalability of scheduling, memory allocation, virtual memory, etc. Per-core data structures reduce write sharing of cache lines, which in turn reduces interconnect traffic and cache miss rate due to coherency misses. For example, Linux and Windows use per-core scheduling queues, and distributed memory allocators. Corey [BWCC+08] allowed configurable sharing of page tables between cores, and many Linux scaling enhancements (e.g., [BWCM+10]) have been of this form. K42 [ADK+07] adopted reduced sharing as a central design principle, and introduced the abstraction of clustered objects, essentially global proxies for pervasive per-core state. Multikernels like Barrelfish [BBD+09] push this idea to its logical conclusion, sharing no data (other than message channels) between cores. Multikernels are an extreme point in the design space, but are useful for precisely this reason: they highlight the problem of consistent per-core state in modern hardware. As core counts increase, we can expect the percentage of OS state that is distributed in more conventional OSes to increase. Shutting down a core therefore entails disposing of this state without losing information or violating system-wide consistency invariants. This may impose significant overhead. For example, Chameleon [PS12] devotes considerable effort to ensuring that per-core interrupt handling state is consistent across CPU reconfiguration. As more state becomes distributed, this overhead will increase. Worse, how to dispose of this state depends on what it is: removing a per-core scheduling queue means migrating threads to other cores, whereas removing a per-core memory allocator requires merging its memory pool with another allocator elsewhere. Rather than implementing a succession of piecemeal solutions to this problem, in Barrelfish/DC we adopt a radical approach of lifting all the per-core OS state out of the kernel, so that it can be reclaimed lazily without delaying the rest of the OS. This design provides the means to completely decouple per-core state from both the underlying kernel implementation and the core hardware.

22 2.3. Design and Implementation

OSNode (§ 2.3.2)

Partitioned Capability System (§ 2.3.3) … … …

Kernel State Frame Null Frame

Scheduler State Frame Frame Frame

Cap Derivation Tree Frame CNode Frame

Timer Offset CNode CNode IRQ State

KCB (§ 2.3.4) PCB PCB

Figure 2.2: State in the Barrelfish/DC OSnode

We find it helpful to use the term OSnode to denote the total state of an OS kernel local to a particular core. In Linux the OSnode changes with different versions of the kernel; Chameleon identifies this state by manual annotation of the kernel source code. In Barrelfish, the OSnode is all the state – there is no shared global data.

2.3.3 Capabilities in Barrelfish/DC

Barrelfish/DC captures the OSnode using its capability system: all memory and other resources maintained by the core (including interrupts and communication end-points) are represented by capabilities, and thus the OSnode is represented by the capability set of the core. The per-core state of Barrelfish/DC is shown schematically in Figure 2.2. Barrelfish/DC’s capability system, an extension of that in Barrelfish[SKN13], is derived from the partitioned capability scheme used in seL4 [EDE08, KEH+09, Tea06]. In seL4 (and Barrelfish), all regions of memory are referred to by capabilities, and capabilities are typed to reflect what the memory is used for. For example, a “frame”

23 Chapter 2. Decoupling Cores, Kernels and Operating Systems

capability refers to memory that the holder can map into their address space, while a “c-node” capability refers to memory that is used to store the bit representations of capabilities themselves. The security of the system as a whole derives from the fact that only a small, trusted computing base (the kernel) holds both a frame capability and a c-node capability to the same memory, and can therefore fabricate capabilities. A capability for a region can be split into two smaller regions, and also retyped according to a set of system rules that preserve integrity. Initially, memory regions are of type “untyped”, and must be explicitly retyped to “frame”, “c-node”, or some other type. This approach has the useful property that a process can allocate memory without being able to access its contents. This is used in seL4 to remove any dynamic memory allocation from the kernel, greatly simplifying both the formal specification of the kernel and its subsequent proof [EDE08]. All kernel objects (such as process control blocks, or page tables) are allocated by user-level processes which can, themselves, not access them directly. A key insight of Barrelfish/DC is that this approach can externalize the kernel state entirely, as follows.

2.3.4 Kernel Control Blocks

In developing Barrelfish/DC, we examined the Barrelfish kernel to identify all the data structures which were not direct (optimized) derivations of information already held in the capability tree (and which could therefore be reconstructed dynamically from the tree). We then eliminated from this set any state that did not need to persist across a kernel restart. For example, the runnable state and other scheduling parameters of a process2 are held in the process’ control block, which is part of the capability system. However, the scheduler queues themselves do not need to persist across a change of kernel, since (a) any scheduler will need to recalculate them based on the current time, and (b) the new scheduler may have a completely different policy and associated data structures anyway. What remained was remarkably small: it consists of:

• The minimal scheduling state: the head of a linked list of a list of process control blocks. 2Technically, it is a Barrelfish “dispatcher”, the core-local representation of a process. A process usually consists of a set of distinct “dispatchers”, one in each OSnode.

24 2.3. Design and Implementation

• Interrupt state. We discuss interrupts in Section 2.3.8. • The root of the capability derivation tree, from which all the per-core capabilities can be reached. • The timer offset, discussed in Section 2.3.7.

In Barrelfish/DC, we introduce a new memory object, the Kernel Control Block (KCB), and associated capability type, holding this data in a standard format. The KCB is small: for 64-bit x86 it is about 28 KiB in size, almost all of which is used by communication endpoints for interrupts.

2.3.5 Replacing a kernel

The KCB effectively decouples the per-core OS state from the kernel. This allows Barrelfish/DC to shut down a kernel on a core (under the control of the boot driver running on another core) and replace it with a new one. The currently running kernel saves a small amount of persistent state in the KCB, and halts the core. The boot driver then loads a new kernel with an argument supplying the address of the KCB. It then restarts the core (using an IPI on x86 machines), causing the new kernel to boot. This new kernel then initializes any internal data structures it needs from the KCB and the OSnode capability database. The described technique allows for arbitrary updates of kernel-mode code. By design, the kernel does not access state in the OSnode concurrently. Therefore, having a quiescent state in the OSnode before we shut-down a core is always guaranteed. The simplest case for updates requires no changes in any data structures reachable by the KCB and can be performed as described by simply replacing the kernel code. Updates that require a transformation of the data structures may require a one-time adaption function to execute during initialization, whose overhead depends on the complexity of the function and the size of the OSnode. The worst-case scenario is one that requires additional memory, since the kernel by design delegates dynamic memory allocation to userspace. As we show in Section 2.4, replacing a kernel can be done with little performance impact on processes running on the core, even device drivers.

25 Chapter 2. Decoupling Cores, Kernels and Operating Systems

2.3.6 Kernel sharing and core shutdown

As we mentioned above, taking a core completely out of service involves not simply shutting down the kernel, but also disposing of or migrating all the per-core state on the core, and this can take time. Like Chameleon, Barrelfish/DC addresses this problem by deferring it: we immediately take the core down, but keep the OSnode running in order to be able to dismantle it lazily. To facilitate this, we created a new kernel which is capable of multiplexing several KCBs (using a simple extension to the existing scheduler). Performance of two active OSnodes sharing a core is strictly best-effort, and is not intended to be used for any case where application performance matters. Rather, it provides a way for an OSnode to be taken out of service in the background, after the core has been shut down. Note that there is no need for all cores in Barrelfish/DC to run this multiplexing kernel, or, indeed, for any cores to run it when it is not being used – it can simply replace an existing kernel on demand. In practice, we find that there is no performance loss when running a single KCB above a multiplexing kernel. Decoupling kernel state allows attaching and detaching KCBs from a running kernel. The entry point for kernel code takes a KCB as an argument. When a new kernel is started, a fresh KCB is provided to the kernel code. To restart a kernel, the KCB is detached from the running kernel code, the core is shut down, and the KCB is provided to the newly booted kernel code. We rely on shared physical memory when moving OSnodes between cores. This goes against the original multikernel premise that assumes no shared memory between cores. However, an OSnode is still always in use by strictly one core at the time. Therefore, the benefits of avoiding concurrent access in OSnode state remain. The combination of state externalization via the KCB and kernel sharing on a single core has a number of further applications, which we describe in Section 2.3.10.

2.3.7 Dealing with time

One of the complicating factors in starting the OSnode with a new kernel is the passage of time. Each kernel maintains a per-core internal clock (based on a free-running timer, such as the local APIC), and expects this to increase monotonically. The clock is

26 2.3. Design and Implementation

used for per-core scheduling and other time-sensitive tasks, and is also available to application threads running on the core via a system call. Unfortunately, the hardware timers used are rarely synchronized between cores. Some hardware (for example, modern PCs) define these timers to run at the same rate on every core (regardless of power management), but they may still be offset from each other. On other hardware platforms, these clocks may simply run at different rates between cores. In Barrelfish/DC we address this problem with two fields in the KCB. The first holds a constant offset from the local hardware clock; the OS applies this offset whenever the current time value is read. The second field is set to the current local time when the kernel is shut down. When a new kernel starts with an existing KCB, the offset field is reinitialized to the difference between this old time value and the current hardware clock, ensuring that local time for the OSnode proceeds monotonically.

2.3.8 Dealing with interrupts

Interrupts pose an additional challenge when moving an OSnode between cores. It is important that interrupts from hardware devices are always routed to the correct kernel. In Barrelfish interrupts are then mapped to messages delivered to processes running on the target core. Some interrupts (such as those from network cards) should “follow” the OSnode to its new core, whereas others should not. We identify three categories of interrupt.

1. Interrupts which are used exclusively by the kernel, for example a local timer interrupt used to implement preemptive scheduling. Handling these interrupts is internal to the kernel, and their sources are typically per-core hardware devices like APICs or performance counters. In this case, there is no need to take additional actions when reassigning KCBs between cores.

2. Inter-processor interrupts (IPIs), typically used for asynchronous communication between cores. Barrelfish/DC uses an indirection table that maps OSnode iden- tifiers to the physical core running the corresponding kernel. When one kernel sends an IPI to another, it uses this table to obtain the hardware destination address for the interrupt. When detaching a KCB from a core, its entry is updated

27 Chapter 2. Decoupling Cores, Kernels and Operating Systems

to indicate that its kernel is unavailable. Similarly, attaching a KCB to a core, updates the location to the new core identifier.

3. Device interrupts, which should be forwarded to a specific core (e.g., via IOAPICs and PCIe bridges) running the handler for the device’s driver.

When Barrelfish/DC device drivers start up they request forwarding of device interrupts by providing two capability arguments to their local kernel: an opaque interrupt descriptor (which conveys authorization to receive the interrupt) and a message binding. The interrupt descriptor contains all the architecture-specific information about the interrupt source needed to route the interrupt to the right core. The kernel associates the message binding with the architectural interrupt and subsequently forwards interrupts to the message channel. For the device and the driver to continue normal operation, the interrupt needs to be re-routed to the new core, and a new mapping is set up for the (existing) driver process. This could be done either transparently by the kernel, or explicitly by the device driver. We choose the latter approach to simplify the kernel. When a Barrelfish/DC kernel shuts down, it disables all interrupts. When a new kernel subsequently resumes an OSnode, it sends a message (via a scheduler upcall) to every process which had an interrupt registered. Each driver process responds to this message by re-registering its interrupt, and then checking with the device directly to see if any events have been missed in the meantime (ensuring any race condition is benign). In Section 2.4.2.1 we show the overhead of this process.

2.3.9 Application support

From the perspective of applications which are oblivious to the allocation of physical cores (and which deal solely with threads), the additional functionality of Barrelfish/DC is completely transparent. However, many applications such as language runtimes and database systems deal directly with physical cores, and tailor their scheduling of user-level threads accordingly. For these applications, Barrelfish/DC can use the existing scheduler activation [ABLL91] mechanism for process dispatch in Barrelfish to notify userspace of changes in the number of online processors, much as it can already convey the allocation of physical cores to applications.

28 2.3. Design and Implementation

2.3.10 Discussion

From a broad perspective, the combination of boot drivers and replaceable kernels is a radically different view of how an OS should manage processors on a machine. Modern general-purpose kernels such as Linux try to support a broad set of requirements by implementing different behaviors based on build-time and run-time configuration. Barrelfish/DC offers an alternative: instead of building complicated kernels that try to do many things, build simple kernels that do one thing well. While Linux selects a single kernel at boot time for all cores, Barrelfish/DC allows selecting not only per-core kernels, but changing this selection on-the-fly. When replacing kernels, Barrelfish/DC assumes that the OSnode format (in particular, the capability system) remains unchanged. If the in-memory format of the capability database changes, then the new kernel must perform a one-time format conversion when it boots. It is unclear how much of a limitation this is in practice, since the capability system of Barrelfish has changed relatively little since its inception, but one way to mitigate the burden of writing such a conversion function is to exploit the fact that the format is already specified in a domain-specific, high-level language called Hamlet [DBR09] to derive the conversion function automatically. By design, the Bar- relfish kernel does not support dynamic memory allocation, therefore a transformation that requires more memory to represent the new OSnode format would be difficult to support. The kernel would have to rely on the applications to allocate memory on its behalf during the conversion. The ability to exchange the entire kernel code quickly is of course limited by the fact that the applications already running as part of an OSnode expect the same ABI as before the update. The ABI includes the existing capability types etc. However, it is fairly trivial to add new capability types for future applications, or to fix bugs in the existing implementation. Ultimately, Barrelfish/DC could be coupled with user-space update mechanisms such as Proteos [GKT13] to make the system more versatile, for example to also apply updates to the monitor service. Barrelfish/DC currently assumes cache-coherent cores, where the OS state (i.e., the OSnode) can be easily migrated between cores by passing physical addresses. The lack of cache-coherency can be handled with cache flushes during OSnode migrations, but on hardware platforms without shared memory, or with different physical address spaces on different cores, the OSnode might require considerable transformation to migrate between cores. The Barrelfish/DC capability system does contain all the information necessary to correctly swizzle pointers when copying the OSnode between nodes, but the copy

29 Chapter 2. Decoupling Cores, Kernels and Operating Systems

or serialization is likely to be expensive, and dealing with shared-memory application state (which Barrelfish fully supports outside the OS) is a significant challenge. Similar problems arise when the OSnode is migrated between heterogeneous processors (e.g., with different endianness). The current OSnode format is underspecified with such regard. However, the OSnode data-structures including many accessors are already generated C code based on the Hamlet specification, which means such constraints can in principle be handled by just expanding the DSL compiler. There is no requirement for the boot driver to share memory with its target core, as long as it has a mechanism for loading a kernel binary into the latter’s address space (e.g., via DMA) and controlling the core itself. In case an OSnode is to be disposed of entirely, the system needs to dismantle it. This means revoking all resources as part of the OSnode which also implies stopping the execution of all dispatchers. Barrelfish/DC relies on an up-call mechanism to notify applications about an impending OSnode disappearance. The application can then take necessary steps to migrate away from the affected OSnode and inform its counterparts on other cores. In practice, this can bring significant complexity overhead for certain applications, but can also be a no-op depending on the software design. It is possible to adopt solutions that are similar to process migrations in distributed systems or virtual machines, a problem that has received a lot of attention in the past [MDP+00]. In addition, such an event should be considered rare in Barrelfish/DC as an OSnode can be parked temporarily with minimal overhead. While Barrelfish/DC applications are notified when the core set they are running on changes (via the scheduler activations mechanism), they are currently insulated from knowledge about hardware core reconfigurations. However, there is no reason why this must always be the case. There may be applications (such as databases, or language runtimes) which can benefit from being notified about such changes to the running system, and we see no reason to hide this information from applications which can exploit it.

2.4 Evaluation

We present here a performance evaluation of Barrelfish/DC. First (Section 2.4.1), we measure the performance of starting and stopping cores in Barrelfish/DC and in Linux.

30 2.4. Evaluation

Name Memory Processors Freq. 2×2 Santa-Rosa 8 GiB 2x2c Opteron 2200 2.8 GHz 4×4 Shanghai 16 GiB 4x4c Opteron 8380 2.5 GHz 2×10 IvyBridge 256 GiB 2x10c Xeon E5-2670 v2 2.5 GHz 1×4 Haswell 32 GiB 1x4c Xeon E3-1245 v3 3.4 GHz

Table 2.1: Architectural details of different systems we use in our evaluation.

Second (Section 2.4.2), we investigate the behavior of applications when we restart kernels, and when we park OSnodes. We perform experiments on the set of x86 machines shown in Table 2.1. Hyperthreading, TurboBoost, and SpeedStep technologies are disabled in machines that support them, as they complicate cycle counter measurements. TurboBoost and SpeedStep can change the processor frequency in unpredictable ways, leading to high fluctuation for repeated experiments. The same is true for Hyperthreading due to sharing of hardware logic between logical cores. However, TurboBoost and Hyperthreading are both relevant for this work as discussed in Section 2.

2.4.1 Core management operations

In this section, we evaluate the performance of managing cores in Barrelfish/DC, and also in Linux using the CPU Hotplug facility [Ash]. We consider two operations: shutting down a core (down) and bringing it back up again (up). Bringing up a core in Linux is different from bringing up a core in Barrelfish/DC. In Barrelfish/DC, each core executes a different kernel which needs to be loaded by the boot driver, while in Linux all cores share the same code. Furthermore, because cores share state in Linux, core management operations require global synchronization, resulting in stopping application execution in all cores for an extended period of time [GMG12]. Stopping cores is also different between Linux and Barrelfish/DC. In Linux, applications executed in the halting core need to be migrated to other online cores before the shutdown can proceed, while in Barrelfish/DC we typically would move a complete OSnode after the shutdown and not individual applications. In Barrelfish/DC, the down time is the time it takes the boot driver to send an appropriate IPI to the core to be halted plus the propagation time of the IPI and the cost of the IPI handler in the receiving core. For the up operation we take two

31 Chapter 2. Decoupling Cores, Kernels and Operating Systems

Barrelfish/DC Linux idle load idle load up up down down driver core driver core down up down up (µs) (ms)(ms) (µs) (ms)(ms) (ms) (ms) (ms) (ms) 2×2 Santa-Rosa 2.7 / —a 29 1.2 2.7 / — 34 ± 17 1.2 131 ± 25 20 ± 1 5049 ± 2052 26 ± 5 4×4 Shanghai 2.3 / 2.6 24 1.0 2.3 / 2.7 46 ± 76 1.0 104 ± 50 18 ± 3 3268 ± 980 18 ± 3 2×10 IvyBridge 3.5 / 3.7 10 0.8 3.6 / 3.7 23 ± 52 0.8 62 ± 46 21 ± 7 2265 ± 1656 23 ± 5 1×4 Haswell 0.8 / —a 7 0.5 0.8 / — 7 ± 0.1 0.5 46 ± 40 14 ± 1 2543 ± 1710 20 ± 5 Results in cycles ×103 ×106 ×106 ×103 ×106 ×106 ×106 ×106 ×106 ×106 2×2 Santa-Rosa 8 / — 85 3.4 8 / — 97 ± 49 3.5 367 ± 41 56 ± 2.0 14139 ± 5700 74 ± 21 4×4 Shanghai 6 / 6 63 2.6 6 / 7 115 ± 192 2.6 261 ± 127 44 ± 2.0 8170 ± 2452 46 ± 8 2×10 IvyBridge 9 / 10 27 2.1 9 / 10 59 ± 133 2.1 155 ± 116 53 ± 2.0 5663 ± 4141 57 ± 12 1×4 Haswell 3 / — 26 1.9 2.9 / — 26 ± 0.40 2.0 156 ± 137 50 ± 0.5 8647 ± 5816 69 ± 16

Table 2.2: Performance of core management operations for Barrelfish/DC and Linux (3.13) when the system is idle and when the system is under load. For the Barrelfish/DC down column, the value after the slash shows the cost of stopping a core on another socket with regard to the boot driver. aWe do not include this number for Santa-Rosa because it lacks synchronized timestamp counters, nor for Haswell because it only includes a single package.

measurements: the boot driver cost to prepare a new kernel up until (and including) the point where it sends an IPI to the starting core (driver), and the cost in the booted core from the point it wakes up until the kernel is fully online (core). In Linux, we measure the latency of starting or stopping a core using the log entry of the smpboot module and a sentinel line echoed to /dev/kmsg. For core shutdown, smboot reports when the core becomes offline, and we insert the sentinel right before the operation is initiated. For core boot, smpboot reports when the operation starts, so we insert the sentinel line right after the operation. For both Barrelfish/DC and Linux we consider two cases: an idle system (idle), and a system with all cores under load (load). In Linux, we use the stress tool [str] to spawn a number of workers equal to the number of cores that continuously execute the sync system call. In Barrelfish/DC, since the file-system is implemented as a user-space service, we spawn an application that continuously performs memory management system calls on each core of the system.

32 2.4. Evaluation

Table 2.2 summarizes our results. We show both time (msecs and µsecs) and cycle counter units for convenience. All results are obtained by repeating the experiment 20 times, and calculating the mean value. We include the standard deviation where it is non-negligible. Stopping cores: The cost of stopping cores in Barrelfish/DC ranges from 8 µs (Haswell) to 3.5 µs (IvyBridge). Barrelfish/DC does not share state across cores, and as a result no synchronization between cores is needed to shut one down. Furthermore, Barrelfish/DC’ shutdown operation consists of sending an IPI, which will cause the core to stop after a minimal operation in the KCB (saving the timer offset). In fact, the cost of stopping a core in Barrelfish/DC is small enough to observe the increased cost of sending an IPI across sockets, leading to an increase of 5% in stopping time on IvyBridge and 11% on Shanghai. These numbers are shown in Table 2.2, in the Barrelfish/DC down columns after the slash. As these measurements rely on timestamp counters being synchronized across packages, we are unable to present the cost increase of a cross-socket IPI on the Santa-Rosa machine whose timestamp counters are only synchronized within a single package. In stark contrast, the cost of shutting down a core in Linux ranges from 46 ms to 131 ms. More importantly, the shutdown cost in Linux explodes when applying load, while it generally remains the same for Barrelfish/DC. For example, the average time to power down a core in Linux on Haswell is increased by 55 times when we apply load. Starting cores: For Barrelfish/DC, the setup cost in the boot driver (driver) dominates the cost of starting a core (core). Figure 2.3 shows a breakdown of the costs for bringing up a core on all micro-architectures. Starting core corresponds to the core Table 2.2 column, while the rest corresponds to operations performed by the boot driver: loading the image from storage, allocating memory, ELF loading and relocation, etc. Loading the kernel from the file system is the most expensive operation. If multiple cores are booted with the same kernel, this image can be cached, significantly improving the time to start a core as shown in the second bar in Figure 2.3. We note that the same costs will dominate the restart operation since shutting down a core has negligible cost compared to bringing it up. Downtime can be minimized by first doing the necessary preparations in the boot driver and then halting and starting the core. Even though Barrelfish/DC has to prepare the kernel image, when idle, the cost of bringing up a core for Barrelfish/DC is similar to the Linux cost (Barrelfish/DC is faster on our Intel machines, while the opposite is true for our AMD machines). Bringing a core up can take from 7 ms (Barrelfish/DC/Haswell) to 29 ms (Barrelfish/DC/Santa-Rosa).

33 Chapter 2. Decoupling Cores, Kernels and Operating Systems

VFS load ELF relocation VFS load ELF relocation malloc (CPU driver) Sending IPIs malloc (CPU driver) Sending IPIs malloc (monitor) Starting core malloc (monitor) Starting core ELF loading ELF loading 35 35

30 30

25 25

20 20 Time [ms] Time [ms] 15 15

10 10

5 5

0 0 with fetch without fetch with fetch without fetch (a) 2×2 Santa-Rosa (b) 4×4 Shanghai

14 14

12 12

10 10

8 8 Time [ms] Time [ms] 6 6

4 4

2 2

0 0 with fetch without fetch with fetch without fetch (c) 2×10 IvyBridge (d) 1×4 Haswell

Figure 2.3: Breakdown of the cost of bringing up a core for various machines.

34 2.4. Evaluation

Load affects the cost of booting up a core to varying degrees. In Linux such an effect is not observed in the Shanghai machine, while in the Haswell machine load increases average start time by 33%. The effect of load when starting cores is generally stronger in Barrelfish/DC (e.g., in IvyBridge the cost is more than doubled) because the boot driver time-shares its core with the load generator. Overall, Barrelfish/DC has minimal overhead stopping cores. For starting cores, re- sults vary significantly across different machines but the cost of bringing up cores in Barrelfish/DC is comparable to the respective Linux cost.

2.4.2 Applications

In this section, we evaluate the behavior of real applications under two core management operations: restarting, where we update the core kernel as the application runs, and parking. In parking, we run the application in a core with a normal kernel and then move its OSnode into a multi-KCB kernel running on a different core. While the application is parked it will share the core with another OSnode. We use a naive multi-KCB kernel that runs each KCB for 20 ms, which is two times the scheduler time slice. Finally, we move the application back to its original core. The application starts by running alone on its core. All experiments were executed on the Haswell machine from Table 2.1.

2.4.2.1 Ethernet driver

Our first application is a Barrelfish NIC driver for the Intel 82574 chipset, which we modify for Barrelfish/DC to re-register its interrupts when instructed by the kernel (see Section 2.3.8). During the experiment we use ping from a client machine to send ICMP echo requests to the NIC. We run ping as root with the -A switch, where the inter-packet intervals adapt to the round-trip time. The ping manual states: “on networks with low rtt this mode is essentially equivalent to flood mode.” Figure 2.4a shows the effect of the restart operation in the round-trip time latency experienced by the client. Initially, the ping latency is 0.042 ms on average with small variation. Restarting the kernel produces two outliers (packets 2307 and 2308 with an RTT of 11.1 ms and 1.07 ms, respectively). Note that 6.9 ms is the measured latency to bring up a core on the respective Haswell machine (Table 2.2). We present latency results for the parking experiment in a timeline (Figure 2.4b), and in a cumulative distribution function (CDF) graph (Figure 2.4c). Measurements

35 Chapter 2. Decoupling Cores, Kernels and Operating Systems

taken when the driver’s OSnode runs exclusively on a core are denoted Exclusive, while measurements where the OSnode shares the core are denoted Shared. When parking begins, we observe an initial latency spike (from 0.042 ms to 73.4 ms). The spike is caused by the parking operation, which involves sending a KCB capability reference from the boot driver to the multi-KCB kernel as a message.3 After the initial coordination, outliers are only caused by KCB time-sharing (maximum: 20 ms, mean: 5.57 ms). After unparking the driver, latency returns to its initial levels. Unparking does not cause the same spike as parking because we do not use messages: we halt the multi-KCB kernel and directly pass the KCB reference to a newly booted kernel.

2.4.2.2 Web server

In this experiment we examine how a web server4 that serves files over the network behaves when its core is restarted and when its OSnode is parked. We initiate a transfer on a client machine in the server’s LAN using wget and plot the achieved bandwidth for each 50 KiB chunk when fetching a 1 GiB file. Figure 2.5a shows the results for the kernel restart experiment. The effect in this case is negligible on the client side. We were unable to pinpoint the exact location of the update taking place from the data measured on the client and the actual download times during kernel updates were indistinguishable from a normal download. As expected, parking leads to a number of outliers caused by KCB time-sharing (Figures 2.5b and 2.5c). The average bandwidth before the parking is 113 MiB/s and the standard deviation 9 MiB/s, whereas during parking the average bandwidth is slightly lower at 111 MiB/s with a higher standard deviation of 19 MiB/s.

2.4.2.3 PostgreSQL

Next, we run a PostgreSQL [Wikc] database server in Barrelfish/DC, using TPC-H [Tra] data with a scaling factor of 0.01, stored in an in-memory file-system. We measure the latency of a repeated CPU-bound query (query 9 in TPC-H) on a client over a LAN. PostgreSQL was pinned on a single core during this experiments.

3We follow the Barrelfish approach, where kernel messages are handled by the monitor, a trusted OS component that runs in user-space. 4The Barrelfish native web server.

36 2.4. Evaluation

102

101

100 restartrestart 10-1 corecore RTT [ms] 10-2

10-3 0 1000 2000 3000 4000 Packets (a) Ethernet driver restart

102

101

100

10-1 RTT [ms]

10-2 Exclusive Shared 10-3 0 200 400 600 800 1000 1200 Packets (b) Ethernet driver parking

1.0

0.8

0.6

CDF 0.4

0.2 Exclusive Shared 0.0 10-3 10-2 10-1 100 101 RTT [ms] (c) Ethernet driver parking CDF

Figure 2.4: Ethernet driver behavior when restarting kernels and parking OSnodes.

37 Chapter 2. Decoupling Cores, Kernels and Operating Systems

102

101 Bandwidth [MiB/s] 100

0 0 2000 4000 6000 8000 10000 12000 14000 Consecutive chunks of 50 KiB (a) Web server restart

102

101

Exclusive Bandwidth [MiB/s] 100 Shared 0 0 1000 2000 3000 4000 5000 6000 Consecutive chunks of 50 KiB (b) Web server parking

1.0

0.8

0.6

CDF 0.4 Exclusive 0.2 Shared 0.0 0 50 100 150 200 250 Bandwidth [MiB/s] (c) Web server parking CDF

Figure 2.5: Webserver behavior when restarting kernels and parking OSnodes.

38 2.5. Concluding remarks

Figure 2.6a shows how restart affects client latency. Before rebooting, average query latency is 36 ms. When a restart is performed, the first query has a latency of 51 ms. After a few perturbed queries, latency returns to its initial value. Figures 2.6b and 2.6c show the effect of parking the OSnode that contains the Post- greSQL server. As before, during normal operation the average latency is 36 ms. When the kernel is parked we observe two sets of outliers: one (with more points) with a latency of about 76 ms, and one with latency close to 56 ms. This happens, because depending on the latency, some queries wait for two KCB time slices (20 ms each) of the co-hosted kernel, while others wait only for one. Overall, we argue that kernel restart incurs acceptable overhead for online use. Parking, as expected, causes a performance degradation, especially for latency-critical applica- tions. This is, however, inherent in any form of resource time-sharing. Furthermore, with improved KCB-scheduling algorithms the performance degradation can be reduced or tuned (e.g., via KCB priorities).

2.5 Concluding remarks

Barrelfish/DC presents a radically different vision of how cores are exploited by an OS and the applications running above it, and implements it in a viable software stack: the notion that OS state, kernel code, and execution units should be decoupled and freely interchangeable. Barrelfish/DC is an OS whose design assumes that all cores are dynamic. As hardware becomes more dynamic, and scalability concerns increase the need to partition or replicate state across cores, system software will have to change its assump- tions about the underlying platform, and adapt to a new world with constantly shifting hardware. Barrelfish/DC offers one approach to meeting this challenge.

39 Chapter 2. Decoupling Cores, Kernels and Operating Systems

90 80 70 60 50 40 30 20 Latency of Query [ms] 10 0 0 50 100 150 200 250 Time [Query] (a) PostgreSQL restart

140

120

100

80

60

40 Exclusive Latency of Query [ms] 20 Shared 0 0 20 40 60 80 100 120 140 Time [Query] (b) PostgreSQL parking

1.0

0.8

0.6 CDF 0.4

0.2 Exclusive Shared 0.0 0 20 40 60 80 100 120 140 Latency of Query [ms] (c) PostgreSQL parking CDF

Figure 2.6: PostgreSQL behavior when restarting kernels and parking OSnodes.

40 3 A Framework for an Adaptive OS Architecture

This chapter presents Badis, an OS architecture which dynamically specializes the system software on a select subset of CPUs to run different flavors of operating systems. We will see how Badis gives users the means to have different lightweight OS stacks (including kernel, library OS and runtime) customized for particular classes of applications and workload requirements in a multi-kernel OS. The key problem that Badis addresses is simply that one size does not fit all when choosing appropriate OS resource management policies and mechanisms for a mix of diverse workloads and their often conflicting requirements. In this chapter, we will show how Badis can customize OS interfaces, mechanisms and policies to provide a more tailored service for given workload classes. While the key ideas have broad applicability, this chapter explores the ability of Badis to provide better application support through OS customization for two use-cases:

• Although rarely published, commercial database engines like Oracle and SQL Server have for decades either developed new or modified existing OS stacks with custom policies and mechanisms. The demand for specialization is becoming even more prevalent with the need to execute diverse workloads (graph processing, R, machine learning, etc.) on top of traditional query processing engines. Data Chapter 3. A Framework for an Adaptive OS Architecture

appliances like SAP Hana [FCP+12] and BigDAWG [EDS+15] which concurrently serve a diverse set of workloads must efficiently schedule heterogeneous work- loads on modern multi-core machines which themselves have become increasingly complex [GARH14, LBKN14, PLTA14, LLF+16]. We show challenges in such systems in Section 3.3. Then, we design and implement Basslet using Badis in Section 3.5, a kernel-based runtime for parallel data processing that addresses the presented limitations.

• Despite years of development to provide real-time support in Linux and other general-purpose kernels [Cor], many users resort to specialized real-time OSes, or modified versions of general-purpose OSes [The]. As a second use-case for Badis, we demonstrate support for hard real-time applications in Section 3.6 by fully provisioning an entire core for a given program with a custom, real-time kernel.

Badis builds on the techniques using Barrelfish/DC, presented in Chapter 2 and exploits its ability to dynamically replace the kernel on a core within hundreds of microseconds. This enables dynamic adaptation of cores and OS to changing workload requirements.

3.1 Motivation

3.1.1 Use-case: Coordinated parallel data processing

Database workloads are becoming more complex and heterogeneous. One trend is to consolidate OLTP and OLAP into a single engine [FCP+12, GAK12, KN11, EDS+15]. The results to date show that, in such systems, load interaction causes a significant performance loss [PWM+14]. This class of mixed workload data processing applications is one key motivation for Badis and where its design can give immediate benefits. Substantial work is done by researchers towards parallelization of key database op- erators [RSBJ14, KKL+09, BATO13, YRV11]. It is challenging to schedule such par- allel data-processing algorithms on modern machines due to a variety of effects ob- served in multicore: NUMA [LPM+13, LQF15], hardware islands [PLTA14], data movement across parallel operators [LBKN14, LDC+09], and spatio/temporal schedul- ing of complex query plans [GARH14]. Techniques to reduce interference typically rely on the adaption of runtimes [TMS11] or OS [ZBF10, LCX+12] or use hardware support [TLW+09, BYD+15].

42 3.1. Motivation

Figure 3.1 illustrates how a naïve and default execution of a simple, concurrent workload on Linux can result in poor scaling, both for the clients (slower runtime) and for the data processing engine (lower throughput). The experiment measures the overall throughput obtained when concurrent clients each submit a sequence of pagerank (PR) OpenMP jobs over the LiveJournal dataset to a Linux server. The machine (4×12 Magny-Cours) has 8 NUMA nodes with 6 cores each. We add one socket (6 cores) with every additional client, and as default practice, let OpenMP choose the level of parallelism for each job. Ideally, the throughput should increase linearly with the number of clients (the “Ideal” line). Instead, the throughput per client rises only slowly because the response time for each job increases (“Linux/OpenMP” line), even though in principle each client has the same number of resources as a single client (1 NUMA node, 6 cores). There are many factors contributing to this lousy scaling: poor co-location of threads within a single parallel job which need to communicate, migration of threads by the OS, inappropriate selection of the degree of parallelism by the OpenMP runtime, cache pollution due to context switching on a single node between threads of different clients, straggler threads, and memory contention due to poor NUMA placement of data relative to threads. For instance, in the experiment, the OpenMP runtime assumes by default that it can use all the cores available (or the programmer sets this as a parameter). In both cases, this is done not knowing what else is happening in the machine and what resources are needed by the concurrent tasks. What looks like an optimal choice for a single job, may turn out to be a sub-optimal choice for both that job and everybody else’s. From the experiment, we can see that even if the over-subscription would be avoided by limiting each client to six parallel threads (“Fix parallelism” line), the achieved throughput is still lacking behind due to non-optimal thread placements of the default Linux scheduler. Similar problems exist with other mechanisms for performance isolation, like Linux containers. They rely on the system administrator or a higher-level workload manager to assign specific resources (e.g., core-pinning, disk, and network I/O provisioning), which requires a system-wide view. The OS could arbitrate in such situations, but modern conventional operating systems, like Linux, optimize for load balancing across cores [LLF+16] and do not account for how that affects the performance of the applications running atop.

43 Chapter 3. A Framework for an Adaptive OS Architecture

70

Ideal 60

50 Basslet

40

Fix parallelism 30

Linux/OpenMP

Throughput [PR/min] 20

10

0 1 2 3 4 5 6 7 Number of clients

Figure 3.1: System throughput for executing concurrent pagerank jobs.

3.1.2 Use-case: Eliminating OS noise

A critical issue in operating systems is to guarantee consistency for the amount of time it takes to complete a given task. Any variability in the completion time of a task is also referred to as jitter. The elimination of such jitter is critical for many different workloads. We mention two of them in more detail:

• Workloads subject to real-time constraints must guarantee a response or reaction to an event within a specified time. Such applications are often critical in environments for controlling or monitoring physical equipment. In such scenarios, it is considered a failure if the processing is not completed within an agreed-upon deadline, regardless of system load.

• High performance computing (HPC) workloads run at very large scale, often involving hundreds to thousands of machines. In such systems, jitter can lead to severe performance penalties: Delaying an individual task on a machine for much longer than the average completion time can unnecessarily delay the whole

44 3.1. Motivation

10000

8000

6000

Duration [ns] 4000

2000

0 0.0 0.5 1.0 1.5 2.0 Samples 1e11

Figure 3.2: A selfish detour benchmark to measure OS noise. Each sample represents an outlier that was running 9x slower than expected.

computation from proceeding. The effects and causes have been well studied in the HPC community [HSL10].

There are several sources inside a system that cause jitter: the operating system, the hardware or interactions with other applications. In this use-case, we focus specifically on the elimination of operating system noise, an important contributor to jitter in the workloads mentioned above. As an example of noise, we present a recording of a selfish detour benchmark performed with the netgauge [HMLR07] tool on a Linux machine (1×4 Haswell). Netgauge is designed to quantify operating system noise for HPC scenarios. In this benchmark, a tight loop is executed, and we measure the time for each iteration. If an iteration takes longer than the minimum times a particular threshold, then the timestamp (detour) is recorded. The graph in Figure 3.2 indicates all outliers took nine times longer to complete, than the minimal recorded iteration of the loop. We can see that there is a non-negligible amount of outliers due to noise. The OS is causing these delays by interrupting the execution of the benchmark to perform some other work. Modern general-purpose kernels such as Linux try to support a broad set of requirements

45 Chapter 3. A Framework for an Adaptive OS Architecture

by implementing different behaviors based on build-time and run-time configuration. However, satisfying stringent requirements of real-time or HPC systems remains difficult. For OS noise specifically, the issue has been well studied on Linux, and several operations that require global coordination (read-copy-update (RCU), TLB shootdowns etc.) or interrupts have been identified as causes for noise. To date, it remains a challenge to eliminate noise in general purpose operating systems completely. Thus, despite years of development of real-time support features in Linux and other general-purpose kernels [Cor], many users resort to specialized real-time OSes, or modified versions of general-purpose OSes [The]. Similarly, super-computing systems initially proposed the idea of using lightweight kernels [RMG+15, GGIW10, KB05] to reduce noise and customize the OS for their needs.

3.2 Related work

3.2.1 Operating System customization

Although the motivation for kernel updates as discussed in Section 2.2.2 is usually to fix bugs with little downtime, often quite similar techniques have been proposed to specialize the system at runtime to better meet the requirements of running applications and workloads. VINO [ESG+94] and SPIN [BSP+95] have the ability to let applications override the default policies of an operating system by injecting custom code in the kernel at runtime. While SPIN relied on the use of a safe language to guarantee that the injected code does not compromise the OS, VINO used techniques to isolate C and C++ code in a sandbox with additional support from the compiler. The hot-swap and interposition mechanisms in K42 [SAH+03], are able to replace a component fully or add additional components in front of an existing component by rerouting method invocations. The authors show how several applications improve their performance by using customized file-system code or page replacement algorithms. An extreme point in this design space is the Unikernel [MMR+13], which was mo- tivated by the rise of cloud computing. The ability to easily rent an entire fleet of servers and deploy custom operating systems on them resulted in the idea to eschew any protection and isolation features known from a traditional operating system and instead run only single applications per machine instance. A Unikernel compiles and

46 3.2. Related work

links a single application directly with a minimal set of required operating system services. The resulting binary is then deployed either on a virtual or physical machine. Application and OS code then execute in a single address space. Aside from specializing the OS to the application, the Unikernel has additional benefits, allowing link time optimizations and reducing the trusted computing base significantly. A challenge with Unikernels is the significant domain expertise and low-level knowledge which is now shifted to the application developer when deploying applications as part of a Unikernel. EbbRT [SCD+16] addresses this by splitting OS services into reusable components that applications rely on. The glue that holds application and OS services together consists of a light-weight, event-driven runtime. In Badis, the unit that is exchanged for specialization purposes is always a complete kernel per-core. This is different from K42, SPIN or VINO that focused on more fine-grained modifications within an existing kernel. The presented mechanism would in principle also allow the user to run a (trusted) Unikernel on a specific core. A Unikernel typically focuses on a single application, Badis instead aims to provide better-tailored services for running multiple applications together on a single system. As an example, we will see an application of Badis providing a task-based execution service on a subset of the machine.

3.2.2 High-performance computing

Super-computing systems initially proposed the idea of using customized lightweight kernels (LWKs) [RMG+15]. Generally, high-performance computing (HPC) systems are susceptible to OS “noise”. Such noise, aggravated by the scale at which these systems run, results in severe performance problems [HSL10] which warrant the case for a specialized OS to run their workloads. A common characteristic of HPC LWKs is that they are in charge of reserving resources which are then passed directly to the applications. FWKs, on the other hand, tend to maintain ownership over resources and instead expose a high-level interface to abstract them. Over time, several lightweight kernel implementations were designed and used as part of HPC systems: Catamount [KB05] tries to reduce OS overheads by giving control over memory, CPU or network resources to applications. The system architecture divides a machine into a few service processors, and multiple compute processors. The service processors run a regular Linux OS, whereas the compute processors use a minimal kernel. A runtime

47 Chapter 3. A Framework for an Adaptive OS Architecture

system is responsible for coordinating the distribution of jobs on many compute nodes, even across machines. Jobs executed on the compute nodes can communicate with the service nodes using an RPC library. CNK [GGIW10] has the goal to be light-weight but still provide a Linux-like operating environment for HPC applications. Compared to Catamount, it includes support for large portions of GNU/Linux code (e.g., GNU C library, NPTL threading library) in user-space. The kernel itself has support for memory management and features a simple, non-preemptive scheduler that supports a fixed number of threads per core. The CNK kernel does not support I/O and offloads these tasks to other nodes running Linux. For many HPC applications CNK leads to a reduction of OS noise, less L1 cache misses, and fewer TLB misses compared to running with Linux. A different approach followed by ZeptoOS [BIYC06] is to take an existing Linux image and make modifications to the OS (reserving memory in advance for applications, removing OS daemons and services as much as possible, etc.) to reduce any OS overhead.

3.2.3 Scheduling parallel workloads

Many systems have addressed the challenges of scheduling parallel applications on modern multicore systems. Tessellation [LKB+09] used space-time partitioning to factor the machine’s resources into separate units and virtualize them for user-level runtimes. In contrast to Badis, Tessellation does not specialize the OS kernels for either task- or thread-based workloads. Instead, it relies on user-level schedulers [CEH+13] for the fine-grained thread and memory management. Such two-level scheduling has also been adopted in other operating systems [MSLM91, ABLL91, WA09]. The HPC community is actively exploring scheduling techniques such as gang-scheduling [Ous82] or flexible co-scheduling [FFPF05]. The scheduling policy for Basslet is to have tasks run coordinated with temporal locality. In contrast to gang-scheduling, such a system does not require a complex scheduling logic across the whole machine [Pet12]. Fos [WA09] proposed to use a cooperative scheduling model for OS services which was motivated by the expectation that the amount of cores will soon be as large as the number of parallel tasks in a system. Programming models and runtimes as provided by Cilk [FLR98], X10 [CGS+05], or OpenMP [Ope15] offer a convenient way to express parallel algorithms which is simple both for the programmer to use and for the runtime to parallelize. However, they typically have little knowledge about the overall system state (for example, libgomp in

48 3.2. Related work

dynamic mode uses the average load over the last 15 minutes to estimate the amount of parallelism [Fre]) and contend for resources in presence of multiple parallel runtimes. Several approaches have been proposed to deal with the issue: Lithe [PHA09] addresses the problem of composability of such parallel runtimes within a single application by building a hierarchical assignment of “harts” (1:1 abstraction of physical H/W threads) in order to avoid over-subscription. Windows uses fibers [Micc] as a light- weight alternative to threads for co-operative multitasking. In comparison to Badis, fibers are multiplexed on a set of OS threads. Callisto [HMM14] coordinates multiple runtimes by injecting a shared library in the existing runtime to coordinate their execution within a single machine and avoid resource contention. It relies, however, on the cooperation of all parallel applications to collaborate with each other. Rossbach et al. [RCS+11] introduces a PTask abstraction for describing parallel work units on GPUs, and database systems use task-based or morsel-driven parallelism [LBKN14, PSM+15] in implementations.

3.2.4 OS abstractions for parallel execution

Operating systems may provide support beyond threads for executing parallel programs on NUMA machines. Various features to query the topology of the machine or isolate applications exist. However, existing APIs like the NUMA policy library (libnuma) are often not detailed enough or too general to be useful [CGHT17, KAH+16]. Solaris [Ora09] introduced locality groups to describe NUMA machines and to set thread and memory affinities for to an entire group. Badis uses NUMA sized instances, and in addition to spatial scheduling, provides coordinated co-scheduling and run-to-completion tasks. Control groups (cgroup) [Heo] are used in Linux to divide tasks into hierarchical groups and perform fine-grained scheduling for every group. Industry solutions like Docker containers [Mer14] rely on cgroups to solve dependency conflicts and provide Linux distribution independence. In contrast to Badis, the resource provisioning of a Docker container is done manually by a system administrator. The solution provided by Badis is orthogonal and relies on customized kernels for the compute plane. Hypervisors like Xen [BDF+03] can run multiple operating systems and kernels on the same machine by relying on hardware virtualization and/or partially emulating non-virtualizable hardware. The idea to specialize the OS with multiple virtual machines has been proposed in the past [BDSK+08]. Badis does not need to rely on virtualization

49 Chapter 3. A Framework for an Adaptive OS Architecture

techniques because it leverages the multikernel architecture with a well defined shared communication interface between different kernels. While nothing would prevent to run the different kernels in a virtual environment, it is instead a trade-off between having better isolation guarantees for misbehaving kernels vs. avoiding any overheads due to the use of virtualization.

3.3 Customization Goals

Next, we discuss in more detail the factors that impact performance for data processing or cause noise during execution of critical tasks. Then we will see how the Badis OS architecture helps to solve such issues.

3.3.1 Run-to-completion execution

Most conventional operating systems today operate by using a preemptive scheduling model. Thus they divide time into small slices and alternate the execution of all runnable processes by switching the context to a new process at the beginning of a new time-slice. While this model has proven to be successful over time, there are two cases where a can significantly impact the execution time or cause outliers: First, synchronization heavy applications running with preemptive scheduling can suffer from the well-known convoy effect [Bla79], especially when a thread is context switched while holding a lock, or when reaching a critical barrier [HMM14, OWZS13]. Second, data processing applications are cache-sensitive, and can result in cache pollution if two data-intensive algorithms are executed at the same time. To evaluate the worst-case scenario of context switching on a data-sensitive application, we show results from the micro-benchmark provided by Liu et al. [LDS07]. The benchmark uses a synthetic program that performs access on an array of floating point numbers to mimic random accesses. Two identical instances of the program are executed and a context switch is made from one to the other instance after the array has been fully read. The time for the context switch is then calculated by subtracting the time it took for one of the instances to read the array without being preempted. Figure 3.3 gives the measured results from running this benchmark on three machines (4×12 Magny-Cours, 2×10 IvyBridge, and 1×4 Haswell) with different last-level cache (LLC) sizes (between 5 and 25 MiB).

50 3.3. Customization Goals

7 MagnyCours (5MB LLC) 6 Haswell (8MB LLC) IvyBridge (25MB LLC) 5

4

3

2

1 Cost of context switch [ms] 0 0 5 10 15 20 25 30 35 Working set size (MB)

Figure 3.3: The time of a context switch, including time lost due to cache-pollution with increasing working set sizes.

The results show that the cost of a context switch can increase from 0.7 microseconds to more than 6 milliseconds. As a comparison, depending on the machine type and the working set size of the application, a context switch may be longer than a standard scheduling time-slice on modern Linux schedulers (about 6 milliseconds [Linb]). We further note that the cost of a context switch is highest when the working set is roughly equal to the size of the LLC. In order to optimize their execution on multicore architectures, modern data processing operators are highly tuned to use a working sets of exactly the cache size [BATO13, CR07]. Thus, each preemption has a potential to destroy the carefully arranged data locality. User-space runtime libraries capture more information about the program’s parallelism, but they lack the necessary global view and runtime information from the OS. The example in Section 3.1.1, is one such instance where the OpenMP programs assume to have all the cores to themselves and do not solve the initial problem of dividing compute resources. The OS is then forced to preempt and temporally schedule the clients.

51 Chapter 3. A Framework for an Adaptive OS Architecture

Therefore, we argue that short-running, latency or time critical jobs should be executed without preemption. Badis allows to provide a run-to-completion model for application jobs. This is best achieved with the implementation of a customized kernel on certain cores as it also allows to fully remove the possibility of any OS tasks interfering with the execution.

3.3.2 Co-scheduling

Unfortunately, running a single thread to completion is not enough for the optimal execution of parallel programs. Hence, a group of threads working on the same operation should ideally run simultaneously. This requirement is also called strict co-scheduling. If done differently, a delay in single tasks can slow-down the entire parallel job. This phenomenon, also known as stragglers, often becomes apparent in combination with barriers that are installed in programs to wait for a specific computation to end. Prominent use-cases for co-scheduling are workloads from the HPC community [HSL10], but increasingly also data-processing workloads. The main challenge in both cases is to avoid stragglers in synchronization steps which can impact the performance [OWZS13, HMM14, GSS+13]. Another well known scheduling strategy is gang-scheduling which requires that all threads of a process execute simultaneously. In large data processing systems gang-scheduling is typically not required and counter productive as the amount of parallel jobs tends to be much larger than the amount of available cores. Badis gives applications the means to benefit from custom, variable strategies in parts of the system by combining radical approaches such as gang or co-scheduling together with regular preemptive scheduling. We explore a combination of co-scheduling and run-to-completion scheduling for parallel jobs as part of the Basslet runtime described later in Section 3.5.

3.3.3 Spatial isolation of tasks and threads

Modern multicore architectures machines offer many opportunities for resource sharing when executing parallel program, and hence, for load interference. For example, Figure 3.4 shows all the places where hardware interference may occur for an Intel Sandy Bridge machine used in some of the experiments presented in Section 3.7. The main factors for interference in multicore machines are mainly due to cache contention but also increasingly on the memory subsystem (due to sharing of DRAM

52 3.3. Customization Goals

core core core core Thr. Thr. 1 5 9 13 NUMA node node NUMA 5 13 45 1 L3 cache 2 L1 cache 4 core core core core L2 cache QPI PCIe 3 17 21 25 29 Processor layout Multi-threaded core

Figure 3.4: An Intel Sandy Bridge processor with labeled opportunities for resource sharing: (1) Last-level cache, (2) local DRAM controller, (3) bus interconnect, (4) L1 and L2 caches, and (5) simultaneous multithreading. bandwidth, memory controller, memory access scheduling) [SSG+15]. Thus, we conclude that there is more to allocation of cores by accounting for other resources such as shared caches, DRAM bandwidth, etc. In some cases, like the example above, sharing them can lead to performance interference. In other cases, threads belonging to the same parallel job can benefit from constructive resource sharing to lower the cost for synchronization and communication. Given the properties of modern multicore machines, one innate hardware island is a NUMA node. Having its own cache hierarchy, several DRAM channels and a mem- ory bus, it is suitable for sharing it with communicating [DGT13] and data-sharing tasks [PSM+15, LBKN14]. Moreover, the same properties make it ideal for achieving performance isolation. It can restrict destructive resource sharing as a result of cache pollution [LDC+09], and bandwidth contention [TMV+11]. Arguably, in contexts like data processing, the unit of allocation should be entire NUMA regions rather than cores.

3.3.4 OS interfaces

Traditional OS interfaces offer limited to no opportunity for parallel applications to express the properties of their algorithms. For example, the OS does not distinguish

53 Chapter 3. A Framework for an Adaptive OS Architecture

between threads working on one concurrent task over another. In Linux, for many years system developers, in the absence of more expressive OS interfaces, have abused the Linux’s cgroups mechanisms to group threads as opposed to processes, in order to let the OS know that they ought to be scheduled differently [Jon]. We argue that the operating system should provide parallel applications with an API that allows such information (a group of threads working on the same parallel job) to be passed down to the OS, together with hints on how-to manage the allocated resources.

3.3.5 Data aware task placement

Finally, data processing applications are sensitive to the location of the data they access. In particular when executing on large NUMA systems. A common algorithmic pattern in data processing is to access the data in multiple stages, thereby creating a data-flow path between parallel sub-jobs [RTG14, LBKN14]. In such cases, it is important to preserve data locality and reduce the traffic across the machine’s interconnect between the communicating jobs. Prior studies have shown the effect that data and thread placement can have on performance [PSM+15]. Thus, we need to enable such applications to additionally specify a preference on which node they want to execute the job on, and thus moving the computation to the data and define that a sequence of parallel operations are to be executed as a pipeline and should thus execute on the same set of resources. That way the system can avoid any unwanted migration or inappropriate placement decisions by the OS. The policy decisions of how the Basslet runtime allocates and shares its compute resources, is based on the needs and requirements we discussed above and explained in more detail in Section 3.5.2.

3.4 Design and Implementation

Badis is primarily an OS-based approach to compute resource management. Badis is implemented as an extension of the Barrelfish OS, with a key difference: In Badis, at any point in time the cores in a multicore machine are divided into control and compute planes, as shown in Figure 3.5. The control plane runs the regular Barrelfish OS, whereas the compute plane executes customized kernels on its cores. The ability of Barrelfish/DC to exchange or update kernels on a set of cores quickly provides

54 3.4. Design and Implementation

BF BF LWK A LWK B OS

core core core core QPI PCIe

Hardware DRAM DRAM GDDRAM NUMA 1 NUMA 2 Accelerator

Figure 3.5: Shows the system architecture of Badis. Its key idea is that at any point in time, the cores in a multicore machine are partitioned into a control and compute plane. the mechanism for this customization. For many complex applications, the need for compliance with existing conventional OS services remains, and it is therefore critical that both types of OS kernels are available. The multikernel design allows that, and dynamically managing the customized kernels for the compute plane cores enables adaptability based on workload requirements. The rest of this section describes the ideas of the compute and control plane division as well as the interaction between the planes.

3.4.1 Control plane

The control plane runs the regular Barrelfish kernel, including various user-space OS services and drivers, which form the core of the existing operating system. Applications typically start their execution as threads on the control plane, regardless of whether they execute all of their code on the control plane’s OS stack or offload tasks to be executed on the compute plane. Apart from executing application’s threads, the control plane also hosts the OS service responsible for setting up the compute plane. In essence, it has to manage the compute plane and exposes an interface so that applications can interact with compute plane kernels. We will explain these parts next in more detail. The initial partitioning of the resources allocated to the two planes happens when the system boots through the Badis management service. The service is directly integrated

55 Chapter 3. A Framework for an Adaptive OS Architecture

with the OS device manager and takes care of the set-up. At boot time, it reserves a certain amount of cores for compute plane kernels and launches specific kernels on every one of these cores. The initial set-up is statically configured, but the system can in principle also change the configuration dynamically at runtime. This makes sense in the case where the OS has a suitable policy to make decisions about reconfigurations, or if a system administrator requests it. The Badis management service uses boot drivers presented in Chapter 2 to spawn cores with new, customized kernels. The compute plane cores are also handled as a special case in the system and are not announced as regular cores. For traditional APIs used on the control plane (e.g., get_nprocs), the compute plane cores remain invisible. For the communication between kernels on the control and compute plane, Badis uses task queues which are serviced by one or more compute plane kernels. They are implemented as capabilities in the system and thus have specific properties and invocations associated with it. The Badis management service takes care of creation and the distribution of queues to kernels. Currently, this happens only when a kernel is started. Which queues to assign to which kernels is determined by reading the provided system configuration. Queues are created by retyping them from frame capabilities and are not mappable in user-space. However, they can be modified in kernel space, and indirectly by applications through invocations (e.g., to enqueue a job). From an application’s side interacting with the Badis’ compute plane resembles how one would interact with task-parallel runtimes or batch systems by allowing applications to enqueue jobs in the available queues (Figure 3.6). Badis provides two main abstractions for jobs: tasks and parallel tasks (ptask). A task is the equivalent of a function call with additional arguments (i.e., it is implemented to contain a register context and reference to the process’ address space), whereas a ptask is a combination of tasks that logically belong together. Invocations on queue capabilities are used by applications to enqueue tasks and ptasks. During an enqueue operation, the Barrelfish kernel will copy a job (provided as a ptask) to the queue, and signal the compute plane kernel instance(s) which service the queue that work is available. In user-space, applications using Badis can rely on a library with a set of high-level functions (Figure 3.7) for use on the control plane (to create, enqueue and wait for tasks to complete). Badis also supports aborting ptasks, either by removing them when pending in a queue or by stopping the corresponding computation on the compute plane. The additional kernel ABI includes system calls for abort, finding out about the current status of a task and the enqueue capability invocation for adding a ptask to a

56 3.4. Design and Implementation

Application Threads Tasks Application int main() { void t(arg) { void t2(arg) { badis_ptask_ f(arg); enqueue(ptask, …) … } } }

1 Enqueue Parallel Task

ControlC0 Plane Compute Plane task 4 Dispatch Tasks

task Badis

ptask

ptask ptask Memory mgmt. ptask Queue … task

IRQ Handling Badis Thread scheduler 3 Distribute Tasks 2 Distribute to Badis compute-plane

Figure 3.6: The Badis architecture: Applications submit parallel tasks to the ptask queue(s). The Badis compute plane dequeues ptasks and dispatches the individual tasks to the hardware contexts owned by the compute plane. queue. In addition, applications can use the messaging facility in the Barrelfish library OS for communication between threads on the control plane and tasks on the compute plane. The implementation is based on the algorithm proposed by Bershad et al. [BALL91], and optimizes for low-latency communication on multi-core systems. We will see in Section 3.5.3, how these facilities are used by systems building on top of Badis to enable system call forwarding to the control plane.

3.4.2 Compute plane

The compute plane consists of several, potentially different, customized OS stack(s): specialized light-weight kernels, with corresponding user-space libraries. Compute plane cores either have a separate kernel per core, or many cores form an instance with the same kernel. This flexibility allows Badis to specialize in various aspects: to meet specific requirements of applications, strictly partition resources (e.g., caches, memory controllers), or to simplify many aspects of the OS for heterogeneous hardware resources. The exact behavior of a compute plane kernel is implementation specific. However, at

57 Chapter 3. A Framework for an Adaptive OS Architecture

Badis Data Structures struct task { task_fn fun, void* arg, ... } struct ptask { struct task* tasks, size_t count, ... } Kernel ABI additions badis_rt_enqueue(queue, ptask) → ptid badis_rt_abort(ptid) → error_code badis_rt_status(ptid) → error_code badis_is_compute_plane() → false Application API badis_ptask_create(tasks) → struct ptask* badis_ptask_enqueue(queue, ptask, ...) badis_ptask_free(ptask) badis_ptask_wait(ptask) badis_ptask_abort(ptask)

Figure 3.7: Badis APIs as provided to applications running on the control plane for interaction with compute plane kernels.

some point after start-up and core initialization they will usually check the received task queues for work and start to execute any enqueued tasks. Dequeuing a job from the queue(s) is essentially the same as starting a thread: In case the task is dispatched for the first time, the kernel switches to the address space of the owning process, otherwise it initializes the register set with the provided register values (including stack and instruction pointer) and switches in user-mode. While not a strict requirement, the compute plane kernels we present avoid taking part in any globally coordinated OS routines, to get rid of any OS interference. They also do not need to have a KCB. Such simplicity comes at a price because certain operations are no longer possible. For example, while a job submitted for execution on the compute plane inherits the address space of the corresponding process running on the control plane, the compute plane is unable to manipulate it. On the other hand, compute plane kernels can be constructed without having any OS state and therefore, they are easy to dispose. We discuss these shortcomings in Section 3.5.3, and describe ways around it. The required minimal system call ABI that needs to be implemented by a compute plane kernel contains currently just four calls (Figure 3.8) for exiting a task, printing, and retrieving information about where a task is running in the system.

58 3.5. Basslet: A kernel based, task-parallel runtime system

Badis Compute Plane Kernel ABI badis_print(buf, length) badis_exit(error, retval) badis_get_info() → (instance_id, core_id, task_id)

Figure 3.8: Badis compute plane minimally required API

3.4.3 Discussion

The flexible design of the compute plane allows kernel customization. For example, one can specialize the kernel scheduling policy and mechanism for specific workload requirements: transactional queries, which are typically short-running, can be executed by a pool of threads scheduled at very fine time-intervals [TTH16] and spatially placed to benefit from constructive LLC sharing [PPB+12]; HPC-like workloads can be scheduled using gang-scheduling [Ous82], while a mix of synchronization-heavy algorithms can be executed using Callisto’s scheduling policies [HMM14]. Further customizations of the OS mechanisms, policies, and interfaces, can also be done for other resources: memory management, interaction with I/O devices, etc. A light-weight OS kernel can also be specialized for the heterogeneous computational resources present in modern and future multicore machines. Such architectures should not need to be engaged in system-wide decision making and execute heavy OS services. Instead, they should run a thin layer of the operating system, optimized for job execution and provide necessary system services. There are many applications for specialized kernels, including those tailored for running databases or language run-times, debugging or profiling, or directly executing verified user code as in Google’s native client [YSD+09]. In the next two sections, will discuss tow concrete instantiations of compute plane kernels for Badis and their interaction with the rest of the system in more details.

3.5 Basslet: A kernel based, task-parallel runtime system

In this section, we describe Basslet, a task based runtime system that is directly integrated with the OS. Basslet is built on top of Badis and has three main components that we describe more closely:

59 Chapter 3. A Framework for an Adaptive OS Architecture

• A customized kernel for the Badis compute plane that allows co-scheduling and run-to-completion execution of data processing jobs.

• A deployment configuration for Badis that sets-up the compute plan for data processing systems by minimizing interference and maximizing overall system throughput.

• A Basslet library that supports API compatible interfaces for POSIX threading- and OpenMP libraries. This makes it possible to run unmodified OpenMP or multi-threaded applications with Basslet.

3.5.1 Task-parallel compute plane kernel

We implemented the compute plane kernel by forking the existing Barrelfish kernel. A Barrelfish kernel in its basic form has support for multi-tasking, implements a capability system, and has a set of drivers for core local devices such as interrupt controllers, timers or MMUs. The Basslet kernel differs in the following key aspects from an original Barrelfish CPU driver:

• Barrelfish CPU drivers typically do not share any memory across cores but rather rely on message passing in user-space services for coordination. In Basslet all cores on the compute plane are grouped into different instances. The Basslet kernel uses all cores within an instance to distribute the individual tasks in a parallel task and execute them together. In order to coordinate this effort, the kernels within an instance do share internal task queues and use inter-processor interrupts as well as compare and exchange instructions for coordination and task distribution.

• The conventional Barrelfish CPU driver adopts a preemptive model for scheduling processes. The Basslet kernel instead implements a task-based scheduling model that uses strict co-scheduling for tasks within a ptask and runs ptasks always to completion. This is achieved by executing the Basslet kernels on an instance in either one of two modes, a master or worker mode. The master is notified by the control plane once a new ptask is available. It then tries to acquire a lock on the queue and fetch the ptask. If successful, it notifies the workers in the instance (i.e., the other cores) that a new ptask is available. Then, it runs the task scheduler which starts dispatching tasks until all the tasks of a ptask have

60 3.5. Basslet: A kernel based, task-parallel runtime system

finished execution. The workers immediately start dispatching tasks after they are woken up by the master and go back to sleep once the ptask is fully executed.

• Finally, the Basslet kernel does not take part in any distributed OS operations and therefore does not run any user-space OS services, it contains no capability database and disables most interrupts except for the ones explicitly used by Basslet for cross-core communication.

With a strict run-to-completion model, there is obviously always a danger of one of the tasks misbehaving or blocking indefinitely due to program bugs. Such a scenario could then essentially stall an entire compute plane instance. As a solution, we use watchdogs by programming the local APIC timer to interrupt the task in case the execution took too long. If the timeout is reached, the entire ptask is aborted, and a failure is reported back to the application which can then react accordingly (for example by restarting the ptask). We note that this is an extreme point in the design space and other approaches such as migrating back to a preemptive model for long-running tasks or finishing them on the control plane after a certain time would be viable alternatives as well.

3.5.2 Compute plane configuration

The default Basslet deployment configuration for Badis has the goal to minimize negative performance impact through resource interference. Therefore, it partitions the Basslet kernel instances to spawn an entire NUMA node for spatial task isolation. On current systems, this setup isolates not just CPUs but also the last-level and every other shared cache on a socket, as well as memory controllers. We the evaluate the benefit of such a deployment for parallel data processing workloads more closely in Section 3.7.1. For queues, Badis makes sure to assign a single, private queue to every Basslet instance (e.g., NUMA node) in the system as well as a global queue that is served by all instances. Programs can therefore control job execution by choosing where a given job should be executed. While this fairly simple setup has proven to be sufficient for our workloads more sophisticated configurations would also be possible. For example, interleaved instances across sockets can maximize performance for memory bound workloads where data is interleaved on the machine, at the potential cost of interfering and slowing down parallel jobs in the system. Instances can also be spawned on smaller or larger scale, e.g., on core-groups in SPARC M7 [AJK+15]. The actual setup therefore highly depends on the underlying hardware and objective.

61 Chapter 3. A Framework for an Adaptive OS Architecture

Basslet Application API (synchronization) bas_mutex_init(bas_mutex) bas_mutex_release(bas_mutex) bas_mutex_lock(bas_mutex) bas_mutex_unlock(bas_mutex) bas_cond_init(bas_cond) bas_cond_release(bas_cond) bas_cond_wait(bas_cond, bas_mutex) bas_cond_signal(bas_cond) bas_cond_broadcast(bas_cond) bas_barrier_wait(bas_barrier) Basslet Application API (memory) malloc(size) free(ptr) Basslet Message API bas_send(dest, msg, arg1, ...) bas_wait_for(msg) → [data]

Figure 3.9: Basslet runtime API

3.5.3 Basslet runtime libraries

Tasks running on a compute-plane kernel can use the API provided by the Basslet library (Figure 3.9). It currently supports a set of synchronization functions (compatible with the pthreads) and a dynamic memory allocation using malloc and free. The synchronization APIs are implemented using x86 atomic instructions with exponential back-off. In practice, and as others have reported [DGT13], we found this works well for our set-up as tasks are never preempted, and share a last-level cache. A Basslet kernel can not allocate or map new memory into an address space. This is due to the fact that it does not hold the state of the capability database and shares the address space with the executing tasks’ parent thread. Therefore, if malloc ever runs out of internal buffer space while called from within a task, it will forward the allocation request to the control plane using the message passing APIs. Such a system call forwarding from the compute to the control plane is a similar to the mechanism proposed in FlexSC [SS10]. Basslet currently supports parallel algorithms using OpenMP, as well as pthreads. We briefly explain how a program can be adapted from using pthreads to Badis tasks, as well as how an OpenMP runtime can be implemented on top of Basslet.

62 3.5. Basslet: A kernel based, task-parallel runtime system

3.5.3.1 Porting pthreads to Basslet

A straight-forward conversion from pthreads to Badis tasks can be fairly minimal: pthread_create becomes a combination of ptask_create and ptask_enqueue whereas ptask_wait can be used in place of joining the individual threads. Listing 3.1 shows a simple example of a pthread program ported to Basslet. With the Basslet runtime library (Figure 3.9), it is possible to allocate memory and use various synchronization mechanisms that are compatible with the pthread synchronization API. While such a restricted set is relatively minimal, we found it sufficient for porting a set of widely used data-processing operators running inside a database. In case more other operations from the OS are required, the programmer can always use the messaging API to route requests back to the control plane. This can incur additional latency and may stall tasks unnecessarily. Therefore, appropriate measures (buffering, caching etc.) should be taken to make sure this happens only in exceptional cases.

Listing 3.1: Example of porting the POSIX thread creation using Badis and Basslet APIs struct ptask *pt = badis_ptask_create(nthreads); for (i=0; itasks[i].task = fn; pt->tasks[i].data = &args[i]; // pthread_create(&tid[i], NULL, fn, &args[i]); } badis_ptask_enqueue(q, pt, &ptid); badis_ptask_wait(ptid); //for (i=0; i

3.5.3.2 Porting OpenMP to Basslet

We implemented an OpenMP [Ope15] runtime library using Basslet. OpenMP programs consist of C/C++ code with annotations added by the programmer or a DSL compiler to automatically parallelize loops and other constructs. The annotations are parsed by the OpenMP compiler which will generate intermediate functions to compartmentalize loops into chunks that can be computed in parallel. The process leads to the generation of a

63 Chapter 3. A Framework for an Adaptive OS Architecture

series of calls to the OpenMP runtime library which transforms the annotations to calls that distribute the loop execution on many tasks. The runtime library is also in charge of balancing the work across different cores. As an example, a simple implementation in Badis for the #pragma omp parallel annotation is given in Listing 3.2. The final implementation in Basslet, is slightly more complicated: (a) typically OpenMP assumes that thread 0, which was initially calling the GOMP_parallel_start function, is also taking part in the execution of parallel section. In Badis’ case usually this thread would run on the control plane and is therefore not participating in task execution. In our implementation, we deliberately avoid that and let only tasks execute the parallel parts; (b) Often, a loop encloses the #pragma omp parallel statements to iterate on some data until convergence is reached. In that case, it is desirable to run the consecutive #pragma omp parallel constructs on the same instance and in sequence to benefit from cache reuse. There are several ways to implement this in Badis: The simplest is to use a queue dedicated for a particular Basslet instance and enqueue the ptasks to its dedicated queue. Alternatively, a single ptask can be spawned, and then the Basslet runtime can use the message passing interface to hand off work units to the individual tasks. Basslet currently supports the parallel pragma as well as dynamic and static for loops in OpenMP. Features like OpenMP tasks, or teams are at the moment not supported as the data-processing algorithms we used typically did not rely on them.

Listing 3.2: Example of OMP runtime using Badis and Basslet APIs for #pragma omp parallel

GOMP_parallel_start( void * fn , void * data , int nthreads ) { struct ptask *pt = badis_ptask_create(nthreads -1); for ( int i=0; itasks[i].task = fn; pt->tasks[i].data = data; } badis_ptask_enqueue(q, pt, &ptid); }

GOMP_parallel_end( void ){ badis_ptask_wait(ptid);

64 3.6. bfrt: A real-time OS kernel

badis_ptask_free(pt); }

3.5.4 Basslet code size

The Badis user-space library is currently about 1.5k lines of code. However, it does rely on several libraries from the existing Barrelfish source tree (for memory allocations, data-structures, and messaging). The Basslet user-space library adds another 1.5k lines (with the OpenMP runtime implementation attributing to about 1k lines of code). The changes to the existing Barrelfish kernel to adapt it to a Basslet kernel kernel were relatively small: overall, we changed 20 files in the tree, added 844 lines of code, and removed 122 lines. This includes some changes to the build scripts as well, which excludes some files entirely (such as the code to manipulate the capability data-structure which is not required on the data-plane). The Basslet kernel itself currently supports the x86-64 architecture. One important hardware requirement is a fast notification mechanism between the compute and control plane which can be either interrupt or memory based (for example by using something similar to the monitor and mwait instructions on Intel).

3.6 bfrt: A real-time OS kernel

Badis can offer hard real-time support by dedicating an entire compute plane core exclusively to an application and using a specialized kernel to eliminate OS jitter. To evaluate such a scenario we built bfrt, a kernel which has no scheduler (since it targets a single application) and takes no interrupts. Its sole purpose is to measure the reduction of noise in the system by eliminating any operating system interference on a dedicated core while still allowing to execute application code on it safely.

3.7 Evaluation

We perform experiments for Badis on a set of x86 machines shown in Table 3.1. For Basslet, our experiments are mainly focusing on an AMD Magny-Cours machine with the following properties: Dell 06JC9T board with four AMD Opteron 6174 processors. Each processor has two 6-core dies, each with a NUMA node size of 16 GiB and a 5 MiB

65 Chapter 3. A Framework for an Adaptive OS Architecture

Name Memory Socket x Cores Freq. 1×4 Haswell 32 GiB 1x4c Xeon E3-1245 v3 3.4 GHz 2×10 IvyBridge 256 GiB 2x10c Xeon E5-2670 v2 2.5 GHz 4×8 SandyBridge 512 GiB 4x8c Xeon E5-4640 2.4 GHz 4×8 Bulldozer 64 GiB 4x8c AMD Opteron 6378 2.4 GHz 4×12 Magny-Cours 128 GiB 4x12c AMD Magny-Cours (HY-D1) 2.2 GHz

Table 3.1: Architectural details of different systems we use in our evaluation.

LLC cache. The main advantage is that the machine has more NUMA nodes (eight in total), and allows us to scale the compute plane with up to seven instances.

3.7.1 Basslet runtime

The first set of experiments evaluate the efficiency of the Basslet runtime with the use of Badis, when scheduling concurrent parallel workloads. It also compares the system performance of the same workloads when executed using default OpenMP and Linux schedulers. The focus of our evaluation are parallel data processing applications. As part of the GreenMarl [HCSO12] graph application suite (git revision 4c0d62e) we execute the following algorithms: (1) PageRank (PR), (2) the Single-Source Shortest Path (SSSP), and (3) HopDistance (HD). We evaluate the performance of the three algorithms on the LiveJournal graph, which is the largest available social graph from the SNAP dataset [LK14]. It has 4.8 M 32-bit nodes, and 68 M 32-bit edges (about 300 MiB row binary data). In the presented measurements we do not include the time for loading the graph in memory. For a relational DB workload we use one of the most common operators, the hashjoin (HJ), whose implementation has been tuned for modern hardware [TABO13] and is open-source1.

3.7.1.1 Interference between a pair of parallel jobs

The first experiment measures the effects of interference when different pairs of parallel jobs execute concurrently.

1http://www.systems.ethz.ch/sites/default/files/multicore-hashjoins-0_1_tar.gz

66 3.7. Evaluation

2.6 2.6 SSSP 2.70 2.19 2.47 2.4 SSSP 1.01 1.00 1.00 2.4 2.2 2.2 2.0 2.0 HD 2.50 2.13 2.07 1.8 HD 1.07 1.13 1.04 1.8 1.6 1.6 1.4 1.4 PR 2.14 2.15 1.88 PR 1.03 1.01 1.02 1.2 1.2 1.0 1.0 PR HD SSSP PR HD SSSP (a) Linux + OpenMP 12/24 (b) Basslet + Badis 12/24

2.6 2.6 SSSP 1.72 1.60 1.77 2.4 SSSP 1.07 1.00 1.01 2.4 2.2 2.2 2.0 2.0 HD 1.68 1.58 1.44 1.8 HD 1.03 1.04 1.05 1.8 1.6 1.6 1.4 1.4 PR 1.23 1.25 1.17 PR 1.00 1.03 0.90 1.2 1.2 1.0 1.0 PR HD SSSP PR HD SSSP (c) Linux + OpenMP 6/12 (d) Basslet + Badis 6/12

Figure 3.10: Numbers report slow-down of SSSP, HD and PR algorithms (for algorithm in row) when co-executed with a partner algorithm (column) vs. running the algorithm alone. Graphs contrast Linux + OpenMP scheduling vs. Basslet + Badis scheduling.

We execute the three GreenMarl graph algorithms on separate OpenMP runtime instances. We run them concurrently either using (1) the default Linux scheduler, with OpenMP choosing the degree of parallelism; or using (2) the Basslet compute kernel runtime. The reported numbers are normalized to a baseline experiment, which measured each algorithm’s runtime when run in isolation on 6 and 12 cores (the respective 1 and 2 NUMA nodes). When executing the pair of algorithms concurrently, for both setups we doubled the allocated resources to execute both jobs by assigning 12 and 24 cores (on the respective 2 and 4 NUMA nodes). Ideally, the normalized time should be 1, as for twice the load there are twice the resources. The results from the run on Linux are presented in Figure 3.10a and Figure 3.10c. The heatmap shows the calculated slowdown of the noisy execution time relative to the runtime in isolation. The first observation is that there is significant performance

67 Chapter 3. A Framework for an Adaptive OS Architecture

degradation for all combinations of algorithm pairs. In some cases, it can reach up to 2.7x slowdown despite having enough resources for both to execute well. The second observation is that, the degradation and interference gets worse with higher degree of parallelism and with the number of NUMA nodes used. This is an important insight, especially because NUMA nodes on more recent machines are becoming bigger (with more cores, larger caches etc.) and the effects of internal resource sharing are going to get exacerbated. In contrast, when the same combination of algorithm pairs are executed on Basslet +Badis, the normalized runtimes are as expected (Figure 3.10b, Figure 3.10d), which means the job’s runtime is almost unaffected compared to their run in isolation. These results confirm the benefits of Basslet’s runtime design decisions, and especially (1) the non-preemptive co-scheduling of the tasks belonging to the same ptask; (2) the spatial isolation of ptasks on complete NUMA nodes; and (3) the data-aware task placement, which was in particular important for the tight pipeline of ptasks within a loop, as generated by the GreenMarl compiler. As a result, Basslet’s scheduler delivers the desired performance isolation, even in noisy environments.

3.7.1.2 System throughput scale-out

We revisit the problem statement experiment presented in Section 3.1, measuring how well different scheduling approaches do when increasing the number of clients in the system, but also the resources allocated. As a first step, we perform the same experiment on several different machines, to verify that the effects we observed on the 4×12 Magny-Cours machine are not a special case. We use three additional machines: 4×8 SandyBridge, 2×10 IvyBridge with four NUMA nodes and 4×8 Bulldozer having a similar architecture to the 4×12 Magny-Cours machine with 2 NUMA nodes per socket. Hyper-threading is disabled for the Intel machines. Again, for this experiment we run the PageRank algorithm. The baseline run uses the cores on the first NUMA node (the degree of parallelism used is, thus, machine-dependent). The results are presented in Figure 3.11, and confirm that the problem with throughput scaling per client in concurrent workload is exhibited on all four multi-core machines. As a second step, we return the focus on the AMD Magny-Cours machine and measures the performance of three different scheduling approaches: (1) concurrency as supported internally within OpenMP when running on top of the default Linux scheduler, (2)

68 3.7. Evaluation

20

18

16

14 IvyBridge 12

10

8 SandyBridge

Throughput [PR/min/client] 6 MagnyCours Bulldozer 4

2 1 2 3 4 5 6 7 8 Number of clients

Figure 3.11: Expanding the problem statement experiment (recall Figure 3.1) on four different machines. concurrency as supported by Linux when executing multiple OpenMP runtimes, one for each client, and (3) throughput as achieved by Basslet. Note that in the previous experiment we used set up (2). As a client, again we use the PageRank algorithm running on the LiveJournal social graph, using 6 cores. For every subsequent client (another instance of the PageRank algorithm) we allow the system to use additional 6 cores. The rest of the cores are disabled. The system throughput reported is the inverse of the total time needed to execute all PageRanks (i.e., the throughput as perceived per client). The results are presented in Figure 3.12. They show that the performance interference among multiple clients, when both Linux or OpenMP+Linux schedule the resources, increases as we add more clients, despite having sufficient resources. The machine has forty-eight cores, so there are enough resources to execute eight concurrent PageRanks (within an OpenMP runtime), or OpenMP runtimes on top of Linux. Basslet achieves almost perfect per-client throughput scale-out until seven clients. The final six cores, belonging to the first NUMA node, are

69 Chapter 3. A Framework for an Adaptive OS Architecture

10 Ideal 9 Basslet 8

7

6

5 Linux+OpenMP 4 OpenMP

Throughput [PR/min/client] 3

2

1 1 2 3 4 5 6 7 8 Number of clients

Figure 3.12: Throughput scale-out when executing multiple PRs using a default Linux+OpenMP scheduler versus Basslet.

dedicated for the control plane, which limits the scalability to seven Basslet instances.

3.7.1.3 Standalone runtime comparison

The goal of this experiment is to compare the absolute runtimes of the algorithms when executed on Basslet +Badis versus Linux. All algorithms are run in isolation and with a degree of parallelism equal to the number of cores in a single NUMA node – six. For both systems, the algorithms were executed on cores belonging to the same NUMA node. The results, shown in Table 3.2, indicate that the algorithm’s runtime on Basslet is comparable to the one measured on Linux. We would also like to point out that the compute plane could be additionally customized to improve the performance of such workloads.

70 3.7. Evaluation

Execution time (ms) Algorithm (input data) Linux Basslet Hash join (128M x 128M) 4787 3316 PageRank (LiveJournal) 6712 6509 Hop-Dist (LiveJournal) 515 542 SSSP (LiveJournal) 3390 3491

Table 3.2: Runtime of parallel algorithms executing on Linux versus Basslet.

3.7.2 Performance isolation with bfrt

We illustrate the benefits of Badis by executing a specialized kernel for running hard-real time applications where eliminating OS jitter is required. To ensure that the application will run uninterrupted, we assign a dedicated core with the bfrt kernel as described in Section 3.6). We evaluate the performance isolation that can be achieved with our specialized kernel compared to the isolation provided by:

• an unmodified Barrelfish kernel

• a Linux 3.13 kernel where we set the application to run with real-time priority

We run our experiments on the 1×4 Haswell machine, ensuring that no other applications or services are running on the same core. To measure OS jitter we use a synthetic benchmark that only performs memory stores to a single location. Our benchmark is intentionally simple to minimize performance variance caused by architectural effects. We sample the timestamp counter every 103 iterations, for a total of 106 samples. Figure 3.13a shows a histogram of sampled cycles, where for all systems, most of the values fall into the 6-7 thousand range (i.e., 6-7 cycles latency per iteration). Figure 3.13b presents the CDF graph for the 6–7 kcycles range, showing that there are no significant differences for the three systems in this range. Contrarily to the bfrt dedicated kernel where all of the samples are in the 6-7k range, in Linux and Barrelfish we observe significant outliers that fall outside this range. Since we run the experiment on the same hardware, under the same configuration, we attribute the outliers to OS jitter. In Barrelfish the outliers reach up to 68k cycles (excluded

71 Chapter 3. A Framework for an Adaptive OS Architecture

106 105 Barrelfish 104 103

Count 102 101 100 6 9 12 15 18 21 24 27 30 33 106 Kilo cycles 105 Linux 3.13 104 103

Count 102 101 100 6 9 12 15 18 21 24 27 30 33 106 Kilo cycles 105 bfrt Dedicated Kernel 104 103

Count 102 101 100 6 9 12 15 18 21 24 27 30 33 Kilo cycles (a) Histogram for all samples

1.0

0.8

0.6 CDF

0.4

bfrt Dedicated Kernel 0.2 Barrelfish Linux 3.13 0.0 6000 6200 6400 6600 6800 7000 CPU cycles (b) CDF for samples in the range of 6–7k cycles

Figure 3.13: Number of cycles measured for 103 iterations of a synthetic benchmark for bfrt, Barrelfish, and Linux using real-time priorities.

72 3.7. Evaluation

1 Client 50 Clients System / Op. GET SET GET SET Barrelfish 89405 75711 255251 173979 Badis w/ load 88362 76173 254123 184559

Table 3.3: Redis throughput (operations/sec) running on Barrelfish versus Badis. from the graph). Linux performs better than Barrelfish, but its outliers still reach 27–28 kcycles. We ascribe the worse behavior of Barrelfish compared to Linux to OS services running in user-space.

3.7.3 Badis OS architecture

We have confirmed the benefits of the customized kernels and runtimes for parallel and real-time workloads. We now evaluate the impact of Badis and the control and compute plane partitioning on the rest of the OS.

3.7.3.1 Control plane applications

Executing an program, relying on the traditional thread-based scheduler on Barrelfish, on the control plane is unaffected by the workload running on the compute plane. To demonstrate that, we execute the same PageRank workload on the Badis compute plane concurrently with a Redis KVstore [Wikd] running on the control plane. The Redis engine serves 1 and 50 clients issuing a series of GET/SET requests, and we compare the throughput (operations/second) when executing the same workload on the original Barrelfish OS (on the same set of resources) to running it on the control plane on Badis alongside the PageRanks. The results (Table 3.3) show that for both workloads the observable difference in performance is really small (within 10%) and can primarily be attributed to noise from the experiment runs. This experiment confirms that, (parts of) applications that need to be executed on a standard thread-based scheduler are unaffected by the control-/compute-plane separa- tion. They may, however, benefit from additional performance isolation as bandwidth intensive jobs can be offloaded to the compute plane.

73 Chapter 3. A Framework for an Adaptive OS Architecture

140

120

100

80 Total time (incl. queuing)

60

Runtime within a Basslet instance 40 Latency [cycles/tuple/client] 20

0 0 2 4 6 8 10 12 Number of clients

Figure 3.14: Measuring the overhead of an enqueue syscall and the queuing effects in Badis using HJ.

3.7.3.2 Overhead of Badis enqueuing

This experiment measures the overhead (additional time/cost) of enqueuing the parallel jobs for execution on the compute plane. It measures the time of issuing the enqueue system call in two situations: (1) when the compute plane still has enough resources to execute the new job concurrently, (2) when it is saturated and the new job needs to queue. For this experiment, as a parallel job we use the parallel hashjoin operator and the compute plane runs Basslet instances. We measure the execution time of the hashjoin (HJ) before invoking ptask_enqueue and after returning from ptask_wait, as well as the execution time of the algorithm within the Basslet instance. In order to generate enough load to saturate the compute plane (i.e., spawn more hashjoin clients), we dedicated the cores on two NUMA nodes for the control plane, and use the cores on the remaining six NUMA nodes for the compute plane with six Basslet instances. The performance of the hashjoin is typically evaluated in number of cycles per output

74 3.8. Concluding remarks

tuple [TABO13]. The join in this experiment is executed on input relations with 32 million 64-bit tuples. The results are shown in Figure 3.14, and indicate that the cost of enqueuing the ptask is quite low. It also shows that as soon as there are not enough Basslet instances in the compute plane to take over the enqueued ptasks, the wait time increases. We also note that the runtime within a Basslet instance for all jobs remains stable, despite of the noise in the system.

3.8 Concluding remarks

This chapter presented Badis, an OS architecture that allows to dynamically adjust the OS kernel and services based on workload requirements. Badis enables complex applications to execute efficiently on modern machines by providing an adaptive and customizable OS stack. Badis currently uses quite simple (though effective) policies over the basic mechanism, and this chapter has focused strongly on the latter. The policy design space, however, is large. For instance, control over the compute plane could be exercised by a query optimizer in a database engine which has knowledge of where data is, the cost functions for data processing operators, can decide on what operators to prioritize, and how to parallelize each one of them. The optimizer would run in the control plane. The control plane could also implement policies for relinquishing resources early based on utilization. Inspecting Badis queues provides clear indication of idle cores, and allow them to be reallocated before the application releases the resources itself. We showed by building the Basslet runtime, how to leverage the customized kernels by having a custom integrated runtime scheduler – we can achieve almost linear throughput scale-out and predictable runtime for heavy analytical workload mixes. With bfrt kernel we showed how an architecture like Badis allows to successfully eliminate most of the noise introduced by the operating system. While the Badis prototype is implemented over a multikernel, it is reasonable to ask whether similar benefits could be obtained by modifying a monolithic kernel like Linux. A radically new scheduler, leveraging recent proposals for fast core reconfigurability in Linux [PSK15] might be able to achieve similar results for certain workloads but may introduce a penalty for others. Currently, the Basslet compute plane kernel is not a good fit for I/O due to the strict run-to-completion policy, but the Badis architecture allows to easily integrate the

75 Chapter 3. A Framework for an Adaptive OS Architecture

control/data plane design proposed by systems like Arrakis [PLZ+14]. Even though the current prototype runs on homogeneous 64-bit x86 machines, it is natural to extend the compute plane to support heterogeneous accelerators such as GPGPUs, co-processors like XeonPhi or to control and provide support for near-data processing accelerators on a memory controller.

76 4 Using Multiple Address Spaces in Applications

The volume of data processed by applications is increasing dramatically, and the amount of physical memory in machines is growing to meet this demand. However, effectively using this memory poses challenges for programmers. The main challenges tackled in this chapter are addressing more physical memory than the size of a virtual space, maintaining pointer-based data structures across process lifetimes, and sharing very large memory objects between processes. It is plausible to anticipate that main memory capacity will exceed the virtual address space size supported by CPUs today (currently 256 TiB, with 48 virtual bits). This is made particularly likely with non-volatile memory (NVM) devices – having larger capacity than DRAM – expected to appear in memory systems by 2020. While some vendors have already increased the number of virtual address (VA) bits in their processors, adding VA bits has implications on performance, power, production, and estate cost that make it undesirable for low-end and low-power processors. Adding VA bits also implies longer TLB miss latency, which is already a bottleneck in current systems (up to 50% overhead in scientific apps [MCV08]). Processors supporting virtual address spaces smaller than available physical memory are forced to have in-memory data structures to be partitioned across multiple processes or to be mapped in and out of virtual memory, using techniques similar to how we manage I/O today. Chapter 4. Using Multiple Address Spaces in Applications

Preserving pointer-based data structures beyond process boundaries requires data conversion by relying on serialization techniques, which incurs a large performance overhead. Sharing and storing pointers in their original form across process lifetimes requires guaranteed acquisition of specific VA locations. Providing this guarantee is not always feasible, and when feasible, may necessitate mapping datasets residing at conflicting VA locations in and out of the virtual space. Languages can hide some of this nuisance from programmers by relying on special pointer representations, however this results in non-standard programming techniques. Sharing data among processes requires communication protocols with a server process (using sockets, pipes etc.), impacting programmability and incurring communication channel overheads. Sharing data via traditional shared memory requires tedious com- munication and synchronization between all client processes for growing the shared region or guaranteeing consistency on writes. To address these challenges, we present SpaceJMP, a set of APIs and OS mechanisms to manage memory. SpaceJMP applications create their own Virtual Address Spaces (VASes) as first-class objects, independent of processes. Detaching VASes from processes enables a single process to use simultaneously execute in multiple VASes, such that threads of that process can switch between these VASes in a lightweight manner. It also enables a single VAS to be shared by multiple processes or to exist in the system alone in a self-contained manner as a temporary, persistent storage. SpaceJMP allows a process to use an arbitrary amount of VA space by placing data in multiple address spaces. If a process runs out of virtual memory, it does not need to modify data mappings or create new processes. It simply creates more address spaces and switches among them. In SpaceJMP, a process can rely on the guaranteed availability of a VA locations, this means it can use pointer-based data structures without relying on special pointers to circumvent address conflicts. Clients that live in a shared environment with a server do not need communicate with a server process but can synchronize with each other over shared region management. SpaceJMP builds on a wealth of historical techniques in memory management and virtualization but revisits them for modern data-centric applications in the light of expected, new memory centric architectures. However, we will see in the evaluation, that the presented methods are also applicable on today’s hardware by delivering more flexibility and performance for large data-centric applications compared with current OS facilities.

78 4.1. Motivation

4.1 Motivation

The primary objectives of this work are as follows:

• Addressing the insufficient VA bits for processes in today’s hardware

• Methods for effectively preserving pointer-based data structures with little over- head or special programming techniques

• The efficient sharing of large amounts of memory as discussed in Section 4.1.1, Section 4.1.2, and Section 4.1.3.

4.1.1 Memory technology

Non-Volatile Memory (NVM) technologies, promise persistence with low latency, better scaling and lower power consumption. This will enable massive pools of densely packed memory systems with much larger capacity than DRAM-based computers. Combined with high-radix optical switches [VSM+08], these future memory systems will appear to the processing elements as single, petabyte-scale “load-store domains” [FKMM15]. One implication of this trend is that the physical memory accessible from a CPU will exceed the size that VA bits can address. While almost all modern processor architectures support 64-bit addressing, CPU implementations pass fewer bits to the virtual memory translation unit because of power and performance implications. When designing a CPU core, vendors have to strike a balance between optimizing the implementation for their highest-volume market (e.g., computing platforms with many GiBs of main memory) and serving the needs of high-end machines requiring much larger address spaces. Most CPUs today are limited to 48 virtual address bits (i.e., 256 TiB) and 44-46 physical address bits (16-64 TiB). On such hardware, a challenge is the support of applications that want to address large physical memories without paying the cost of increasing processor address bits across the board. One solution is partitioning physical memory across multiple processes which incurs unnecessary inter-process communication overheads and is tedious to program. Another solution is mapping memory partitions in and out of the VAS, which has overheads discussed in Section 4.1.4.

79 Chapter 4. Using Multiple Address Spaces in Applications

4.1.2 Preserving pointer-based data structures

Maintaining pointer-based data structures beyond process lifetimes without paying any serialization overhead has motivated persistent memory programming models to adopt region-based programming paradigms [CCA+11, CBB14, VTS11]. Such approaches provide a more natural means for applications to interact with data. However, a challenge is the representation of a pointer across process lifetimes. A region that contains pointers, when remapped at a later point again, may end up at a different virtual address and therefore make all pointers invalid. To solve this problem, such languages typically rely on the use of special pointers (e.g., offset based pointers where the address of a pointer at runtime is calculated by taking the start of the region and the stored offset as the pointer value). This limits programmability, portability and ease of use for the programmers. It also hinders the adoption of legacy code in such environments as they typically require substantial changes to adopt a new pointer model. Another solution is requiring memory regions to be mapped at fixed virtual addresses by all parties. This solution creates degenerate scenarios where memory is mapped in and out when memory regions overlap, the drawbacks of which are discussed in Section 4.1.4.

4.1.3 Large-scale sharing of memory

Scenarios where multiple processes share large amounts of data require effective mecha- nisms for processes to access and manage shared data efficiently and safely. An old and well-known approach is the client-server model whereby client processes communicate with one or more servers (e.g., key-value stores [Wikd, Wik03, OAE+11] that manage and serve data). Such approaches typically rely on network interfaces that incur signifi- cant communication and serialization overhead. Future, memory-centric systems where a set of compute nodes is connected to a large, byte-addressable storage pool present an opportunity to minimize much of this cost by sharing access to the same memory between clients and server.

4.1.4 Problems with legacy methods

Traditional UNIX operating systems are based on the concept of using the file abstraction to describe all accessible data. With systems becoming memory centric, these interfaces become increasingly irrelevant as they expose an abstraction that is based on the

80 4.1. Motivation

104 map map (cached) 103 unmap unmap (cached) 102

101

100

10-1 Latency [msec]

10-2

10-3

10-4 15 20 25 30 35 Region size (power of 2) [bytes]

Figure 4.1: construction (mmap) and removal (munmap) costs in Linux, using 4KiB pages. Does not include page zeroing costs. traditional workings of a mechanical disk based system (e.g., seek, read, write). Forcing programmers to use such abstraction in the future will likely lead to performance bottlenecks and usability limitations. We present one such instance in Section 4.5.4. Changing memory maps in the critical path of an application has significant performance implications, and scaling to large memories further exacerbates the problem. Conven- tional methods such as mmap are typically slow and not scalable [CKZ13]. Figure 4.1 shows that constructing page tables for a 1 GiB region using 4 KiB page takes about 5 ms. For 64 GiB the cost is about 2 seconds. This prevents using such system calls on the fast path, as the cost is too expensive. Mapping cached page tables, as prototyped in Barrelfish, minimizes these costs. SpaceJMP provides a basis for caching page table entries. In addition to performance implications, the flexibility of legacy interfaces is a major

81 Chapter 4. Using Multiple Address Spaces in Applications

concern: Memory-centric computing requires a careful organization of the virtual address space. Data access and the underlying mappings will need to be installed by the applications themselves. Interfaces such as mmap only provide limited control. In Linux, for example, mmap does not safely abort if a request is made to open a region of memory that overlaps with an existing region and simply writes over it. Protection of shared memory regions is coupled to the underlying backing file using access control lists (ACLs). Memory sharing is tightly coupled with coarse-grained protection bits, configured on the backing file by using access control lists (ACL). This leads to a unavoidable translation between two different security models: The ACL backed file system vs. the often much more fine grained page-based protection in conventional virtual memory systems, or object based protection based in capability systems like CHERI [WNW+17]. All this, together with linkers that dynamically relocate libraries or randomize the stack allocation prevent applications from controlling the memory layout effectively.

4.2 Related work

SpaceJMP is related to a wide range of work both in OS design and hardware support for memory management.

4.2.1 Operating systems

SpaceJMP is influenced by the design of Mach [RTY+87], in particular the concept of a memory object mappable by different processes. Such techniques were also adopted in the Opal [CBHLL92] and Nemesis [Han99] SASOSes to describe memory segments characterized by fixed virtual offsets. Lindstrom [LRD95] expands these notions with shareable containers that contain code segments and private memory. Protection in enforced with a capability model. Application threads called loci enter containers to execute and manipulate data. loci motivated our approach to the client–server model. Single-address-space operating systems (SASOSes) (e.g., Opal [CBHLL92, CLBhL92] and IVY [Li88]) eliminated duplicate page tables. Nevertheless, protection domains need to be enforced by additional hardware (protection lookaside buffers) or page groups [KCE92]. Systems like Mondrix [WRA05] employ isolation by enforcing protec- tion on fixed entry and exit points of cross-domain calls, much like traditional “call gates”

82 4.2. Related work

or the CAP computers’ “enter capabilities” [Wil79]. Such mechanisms complement SpaceJMP by providing protection capabilities at granularities other than a page within a VAS itself. Also, SASOSes assume large virtual address spaces, but we predict a limit on virtual address bits in the future. More recent work such as SMARTMAP [BPH08] evaluates data-sharing on high-performance machines. In the 90s, research focused on exporting functionality to , reducing OS complexity, and enabling applications to set memory management policies [EKO95, RKZB11, Han99]. Our work is supported by such ideas, as applications are exposed to an interface to compose virtual address spaces, with protection enforced by the OS, as with seL4 [EDE08]. The idea of using multiple address spaces has mainly been applied to achieve protection in a shared environment [CBHLL92, TKM99], and to address the mismatch of abstractions provided by the OS interfaces [AL91, RKZB11]. Some operating systems provide similar APIs compared to SpaceJMP in order to simplify kernel development by executing the kernel directly in user-space [Eco, Dik06]. Dune [BBM+12] uses virtualization hardware to sandbox threads of a process in their own VAS. SpaceJMP goes further by promoting a virtual address space to a first-class citizen in the OS, allowing applications to compose logically related memory regions for more efficient sharing.

4.2.2 Memory management

Memory management is one of the oldest areas of computer systems with many different design considerations that have been proposed so far. The most familiar OSes support address spaces for protection and isolation. Threads of a process usually share a single virtual address space. The OS uses locks to ensure thread safety which leads to contention forcing memory management intensive programs to be split up into multiple processes to avoid lock contention [WA09]. RadixVM [CKZ13] maintains a radix tree from virtual addresses to meta data and allows concurrent operations on non-overlapping memory regions. VAS abstractions as first-class citizens provides a more natural expression of sharing and isolation semantics, while extending the use of scalable virtual memory optimizations. BSD operating system kernels are designed around the concept of virtual memory objects, enabling more efficient memory mappings, similar to UVM [CP99]. Prior 32-bit editions of Windows had only 2 GiB of virtual address space available for applications but allowed changing this limit to 3 GiB for memory intensive applica-

83 Chapter 4. Using Multiple Address Spaces in Applications

tions [Mica]. Furthermore, Address Windowing Extensions [Micb] gave applications means to allocate physical memory above 4 GiB and quickly place it inside reserved windows in the 32-bit address space, allowing applications to continue using regular 32-bit pointers. Distributed shared memory systems in the past [KCDZ94] gained little adoption, but have recently acquired renewed interest [NHM+15].

4.2.3 Communication and Sharing

Strong isolation limits flexibility to share complex, pointer rich data structures and shared memory objects duplicate page table structures per process. OpenVMS [NK98] shares page tables for the subtree of a global section. Munroe patented sharing data among multiple virtual address spaces [MPS04]. Anderson et al. [ALBL91] recognized that increased use of RPC-based communication requires both sender and receiver to be scheduled to transfer data involving a con- text switch and kernel support (LRPC [BALL90]) or shared memory between cores (URPC [BALL91]). In both cases, data serialization, cache misses, and buffering lead to overhead [KC94, CLR94, TLL94]. Compared to existing IPC mechanisms like Mach Local RPC [Dra90] or overlays [Lev00] SpaceJMP distinguishes itself by crossing VAS boundaries rather than task or process boundaries. Ideally, bulk-data transfer between processes avoids copies [SR12, SWP01, CCC14, JC14], although there are cases where security precludes this [DP93, JC14]. Collaborative data processing with multiple processes often requires pointer-rich data structures to be serialized into a self-contained format and sent to other processes which induces an overhead in web services but also in single multi-core machines [HJR+03, CFKL99, SM12, JKH+04]. Cyclone [GMJ+02] has memory regions with varying life times and checks the validity of pointers in the scopes where they are dereferenced. Systems like QuickStore [WD94] provide shared-memory implementations of object stores, leveraging the virtual memory system to make objects available via their virtual addresses. Quickstore, however, incurs the cost of in-place pointer swizzling for objects whose virtual address ranges overlap, and the cost itself of mapping memory. SpaceJMP elides swizzling by reattaching alternate VASes, and combines VAS-aware pointer tracking for safe execution.

84 4.2. Related work

Mondrix [WRA05] aims for by enforcing isolation of protection domains within the Linux kernel based on a hardware and software combination. Cross- domain calling involves checking of permissions by (modified) hardware and happen at fixed entry or exit points (switch gates). A thread needs to switch to the protection domain to execute code. This is similar to enter capabilities [SSF99].

4.2.4 Hardware

We have already discussed the trade-offs involved in tagged TLBs, and observed that SpaceJMP would benefit from a more advanced tagging architecture than that found in x86 processors. The HP PA-RISC [MLM+86] and Intel Itanium [Int, ZRMH00] architectures both divide the virtual address space as windows into distinct address regions. The mapping is controlled by region registers holding the VAS identifier. SpaceJMP could profit from the adoption of Direct Segments [BGC+13] in general purpose hardware. SpaceJMP segments are currently backed by the underlying page table structure, but integrating our segments API with Direct Segments would yield further benefits by reducing TLB miss and page translation overheads. Alternatively, RMM [KGA+15] proposes hardware support that adds a redundant mapping for large ranges of contiguous pages. RMM as well is a potentially good match for SpaceJMP segments, requiring fewer TLB entries than a traditional page-based system. SpaceJMP could be extended straightforwardly to implement RMM’s eager allocation strategy. Other work proposes hardware-level optimizations that enable finer granularities than a page [KGA+15, SPR+15]. The CODOMs [VBYN+14] architecture allows trusted and untrusted software com- ponents to reside in the same address space and interact with each other. Sharing memory locations is achieved using hardware capability registers to grant temporary memory access from an untrusted domains. The hardware allows different modules to communicate efficiently by just invoking procedure calls. In dIPC [VJN+17], the authors extended CODOMs with inter-process communication by having threads of different applications reside in the same address space. Similarly, the SpaceJMP Redis example enables client threads to temporarily enter the server address spaces. The hardware support in CODOMs would yield benefits for SpaceJMP as well by isolating clients further within a server address space. Bailey et al. [BCGL11] argue the availability of huge and fast non-volatile, byte address- able memory (NVM) will fundamentally influence OS structure. Persistent storage will

85 Chapter 4. Using Multiple Address Spaces in Applications

be managed by existing virtual memory mechanisms [MAK+13, CNF+09] and accessed directly by the CPU. Atlas [CBB14] examines durability when accessing persistent storage by using locks and explicit publishing to persistent memory. SpaceJMP provides a foundation for future designs of memory centric operating system architectures.

4.3 Design

We now describe the design of SpaceJMP and define the terminology used in this chapter. SpaceJMP provides two key abstractions: lockable segments encapsulate sharing of in-memory data, and virtual address spaces represent sets of non-overlapping segments.

4.3.1 Lockable Segments

In SpaceJMP, all data and code used by the system exist within segments. The term has been used to refer to many different concepts over the years; in SpaceJMP, a segment should be thought of as an extension of the model used in Unix to hold code, data, stack, etc.: a segment is a single, contiguous area of virtual memory containing code and data, with a fixed virtual start address and size, together with meta-data to describe how to access the content in memory. With every segment we store the backing physical frames, the mapping from its virtual addresses to physical frames and the associated access rights. For safe access to regions of memory, all SpaceJMP segments can be lockable. In order to switch into an address space, the OS must acquire a reader/writer lock on each lockable segment in that address space. Each lock acquisition is tied to the access permissions for its corresponding segment: if the segment is mapped read-only, the lock will be acquired in a shared mode, supporting multiple readers (i.e., multiple reading address spaces) and no writers. Conversely, if the segment is mapped writable, the lock is acquired exclusively, ensuring that only one client at a time is allowed in an address space with that segment mapped. Lockable segments are the unit of data sharing and protection provided in SpaceJMP. Together with address space switching, they provide a fast and secure way to guarantee safe and concurrent access to memory regions to many independent clients.

86 4.3. Design

PC Registers PC Registers AScurrent Stack Stack Stack Stack

A A C E B B F

New D G Traditional

Address SpaceAddress Heap Heap Heap Heap

AddressSpaces 1 2 3

Process Abstraction Process Process Abstraction Process Data Data Data Data Text Text Text Text Text/Data/Stack are shared across address spaces

Figure 4.2: Contrasting SpaceJMP and Unix.

4.3.2 Multiple Virtual Address Spaces

In most contemporary OSes, a process or kernel thread is associated with a single virtual address space (VAS), assigned when the execution context is created. In contrast, SpaceJMP virtual address spaces are first-class OS objects, created and manipulated independently of threads or processes. The contrast between SpaceJMP and traditional Unix-like (or other) OSes is shown in Figure 4.2. SpaceJMP can emulate regular Unix processes (and does, as we discuss in the next section), but provides a way to rapidly switch sections of mappings to make different sets of segments accessible. The SpaceJMP API is shown in Figure 4.3. A process in SpaceJMP can create, delete, and enumerate VASes. A VAS can then be attached by one or more processes (via vas_attach), and a process can switch its context between different VASes to which it is attached at any point during its execution. A VAS can also continue to exist beyond the lifetime of its creating process (e.g., it may be attached to other processes). The average user only needs to worry about manipulating VASes. Advanced users and library developers may directly manipulate segments within the VAS. Heap allocators, for instance, must manage the segments that service the allocations (see Section 4.4). Segments can be attached to, or detached from a VAS. In practice, some segments (such as global OS mappings, code segments, and thread stacks) are widely shared between VASes attached to the same process. seg_attach and seg_detach allow installing segments that are process-specific by using the vid handle (vh) or globally for all

87 Chapter 4. Using Multiple Address Spaces in Applications

VAS API – for applications. vas_clone(vid) → vid vas_find(name) → vid vas_attach(vid) → vh vas_detach(vh) vas_switch(vh) vas_create(name,perms) → vid vas_ctl(cmd,vid[,arg])

Segment API – for library developers. seg_ctl(sid,cmd[,arg]) seg_find(name) → sid seg_attach(vid,sid) seg_attach(vh,sid) seg_detach(vid,sid) seg_detach(vh,sid) seg_clone(sid) → sid seg_alloc(name,base,size,perms) → sid

Figure 4.3: SpaceJMP interface.

type *t; vasid_t vid; vhandle_t vh; segid_t sid;

// Example use of segment API. // Example use of VAS API. va=0xC0DE; sz=(1UL«35); vid=vas_find("v0"); vid=vas_create("v0",660); vh=vas_attach(vid); sid=seg_alloc("s0",va,sz,660); vas_switch(vh); seg_attach(vid, sid); t=malloc(...); *t = 42;

Figure 4.4: Example SpaceJMP usage.

processes attached to a VAS using the vid. Rather than reinventing the wheel, our permission model for segments and address spaces use existing security models available in the OS. For example, in DragonFly BSD, we rely on ACLs to restrict access to segments and address spaces for processes or process groups. In Barrelfish, we use the capability system provided by the OS (Section 4.4.1) to control access. To modify permissions, the user can clone a segment or VAS and use the seg_ctl, or vas_ctl, to change the meta-data of the new object (e.g., permissions) accordingly. A spawned process will still receive its initial VAS by the OS. The exposed functionality to the user process, however, allows it to construct and use additional VASes. A typical call sequence to create a new VAS is shown in Figure 4.4.

88 4.4. Implementation

4.4 Implementation

We built 64-bit x86 implementations of SpaceJMP for the Barrelfish and DragonFly BSD operating systems, which occupy very different points in the design space for memory management and provide a broad perspective on the implementation trade-offs for SpaceJMP.

4.4.1 Barrelfish

We recall from Section 1.3, that in contrast to traditional UNIX systems, Barrelfish prohibits dynamic memory allocation in the kernel. Barrelfish instead relies on one or more user-space memory servers which allocate physical memory on behalf of applications. The capability system allows the application to retype memory into different objects which is checked by the kernel and performed by system calls. The security model guarantees that a user-space process can safely allocate memory for kernel-level objects (page-tables, dispatcher control blocks etc.) and modify them safely through capability invocations. For example in order to install a physical frame into a last-level page-table, a user-space process can give the capability for the page-table, the frame and a slot number on where to install the frame in the table. During the map invocation, kernel checks that both frame and page-table capabilities are valid and writes the physical address into the correct slot (along with the specified access rights etc.). The capability system in Barrelfish exposes a fine-grained access control scheme for all system objects to applications. Therefore, it allows processes great flexibility in implementing policies, and safely performs security-relevant actions via explicit capability invocations. The result is that SpaceJMP is implemented almost entirely in user space and no additional logic in the kernel was needed: all VAS management operation translate into series of explicit capability invocations that were already supported in Barrelfish. We use this functionality to explicitly share and modify page tables without kernel participation. In particular, we implemented a SpaceJMP-service to track created VASes in the system, together with attached segments and attaching processes, similar to the kernel extensions employed in BSD. Processes interact with the user-level service via RPCs.

89 Chapter 4. Using Multiple Address Spaces in Applications

Upon attaching to a VAS, a process obtains a new capability to a root page table to be filled in with mapping information. Switching to the VAS is a capability invocation to replace the thread’s root page table with the one of the VAS. This is guaranteed to be a safe operation as the capability system enforces only valid mappings. Initially, all page tables other than the root of a VAS are shared among the attached processes, allowing easy propagation of updated mappings. The Barrelfish kernel is therefore unaware of SpaceJMP objects. To enforce their proper reclamation we can rely on the capability revocation mechanism: revoking the process’ root page table prohibits the process from switching into the VAS. The result is a pure user space implementation of SpaceJMP, enabling us to enforce custom policies, such as alignment constraints or selecting alternative page sizes. We also implemented the ability to cache page table structures of subtree and patch it into a VAS upon segment attachment. This results in fast attach and detach latencies for segments.

4.4.2 DragonFly BSD

DragonFly BSD [Wikb] is a scalable, minimalistic kernel originally derived from FreeBSD4, supporting only the x86-64 architecture. Our SpaceJMP implementa- tion includes modifications to the BSD memory subsystem and an address-space-aware implementation of the user space malloc interfaces. Memory subsystem: DragonFly BSD has a nearly identical memory system to FreeBSD, both of which derive from the memory system in the Mach kernel [RTY+87]. The following descriptions therefore also apply to similar systems. The BSD memory subsystem is based on the concept of VM objects [RTY+87] which abstract storage (files, swap, raw memory, etc.). Each object contains a set of physical pages that hold the associated content. A SpaceJMP segment is a wrapper around such an object, backed only by physical memory, additionally containing global identifiers (e.g., a name), and protection state. Physical pages are reserved at the time a segment is created, and are not swappable. Furthermore, a segment may contain a set of cached translations to accelerate attachment to an address space. The translations may be globally cached in the kernel if the segment is shared with other processes.

90 4.4. Implementation

Address spaces in BSD, as in Linux, are represented by two layers: a high-level set of region descriptors (virtual offset, length, permissions), and a single instance of the architecture-specific translation structures used by the CPU. Each region descriptor references a single VM object, to inform the page fault handler where to ask for physical pages. A SpaceJMP segment can therefore be created as a VM object and added directly to an address space, with minimal modifications made to the core OS kernel implementation. Sharing segments with other processes is straightforward to implement as they can always be attached to an existing address space, provided it does not overlap a previously mapped region. Sharing an address space is slightly more complicated: as mentioned in Section 4.3, we only share segments that are also visible to other processes. Doing so allows applications to attach their own private segments (such as their code or their stack) into an address space, before they can switch into it. Those segments would typically heavily conflict as every process tends to have a stack and program code at roughly the same location. The underlying address space representation that BSD uses (the vmspace object) always represents a particular instance of an address space with concrete segments mapped. Therefore, sharing a vmspace instance directly with other processes will not work. Instead, when sharing an address space, the OS shares just the set of memory segments that comprise the VAS. A process then attaches to the VAS, creating a vmspace, which it then can switch into. A slight modification of the process context structure was necessary to hold references to more than one vmspace object, along with a pointer to the current address space. Knowing the active address space is required for correct operation of the page fault handler and system calls which modify the address space, such as the segment attach operation. Inadvertent address collisions may arise between segments, such as those for stack, heap, and code. For example, attaching to a VAS may fail if a (global) segment within it conflicts with the attaching process’ (private, fixed-address) code segments. Our current implementation in DragonFly BSD avoids this by ensuring both globally visible and process-private segments are created in disjoint address ranges. VAS switching: Attaching to a global VAS creates a new process-private instance of a vmspace object, where the process code, stack, global regions are mapped in. A program or runtime initiates a switch via a system call; the kernel identifies the vmspace object specified by

91 Chapter 4. Using Multiple Address Spaces in Applications

the call, then simply overwrites CR31 with the physical address of the page table of the new VAS. Other registers, such as the stack pointer, are not modified. A kernel may also transparently initiate a switch, for example after a page fault, or triggered via any other event.

4.4.3 Runtime library

We developed a library to facilitate application development using the SpaceJMP kernel interface, avoiding complexities such as locating boundaries of process code, globals, and stack for use within an address space across switches. The library performs much of the bookkeeping work involved in attaching an address space to a process: private segments holding the process’ program text, globals, and thread-specific stacks are attached to the process-local vmspace object by using the VAS handle. Furthermore, the library provides allocation of heap space (malloc) within a specific segment while inside an address space. SpaceJMP complicates heap management since programs need to allocate memory from different segments depending on their needs. These segments may not be attached to every address space of the process, and moreover a call to free up memory can only be executed by a process if it is currently in an address space which has the corresponding segment attached. To manage this complexity, the SpaceJMP allocator is built over Doug Lea’s dlmal- loc [Lea], providing the notion of a memory space (mspace). An mspace is an allocator’s internal state and may be placed at arbitrary locations. Our library supports distinct mspaces for individual segments, and provides wrapper functions for malloc and free which supply the correct mspace instance to dlmalloc, depending on the currently active address space and segment.

4.4.4 Discussion

The current implementations of SpaceJMP are relatively ad hoc, and there are several directions for improvement. More effective caching of page table structures would be beneficial for performance, but imposes additional constraints on the virtual addresses to be efficient. For example, mapping an 8 KiB segment on the boundaries of a PML4 slot requires 7 page tables2. 1Control register—indicates currently active page table on an x86 CPU core. 2One PML4 table, two tables each of PDPT, PDT, PT.

92 4.4. Implementation

Requiring certain alignment constraints for virtual addresses of segments could avoid crossing such high-level page table boundaries and page table structures can be cached more efficiently. Hardware support for SpaceJMP is also an interesting source of future work. Modern x86-64 CPUs support tagging of TLB entries with a compact (e.g., 12-bit) address space identifier to reduce overheads incurred by a full TLB flush on every address space switch. Our current implementations reserve the tag value zero to always trigger a TLB flush on a context switch. By default, all address spaces use tag value zero. The user has the ability to request a tag be assigned to an address space using vas_ctl. The trade-off can be complex: The use of many tags can decrease overall TLB coverage (particularly since in SpaceJMP many address spaces share the same translations) and result in lower performance instead of faster context switch time. Furthermore, this trade-off is hardware-specific. Only a single TLB tag can be current in an x86 MMU, whereas other hardware architectures (such as the “domain bits” in the ARMv7-A architecture [ARM14] or the protection attributes in PA-RISC [WS92]) offer more flexible specification of translations, which would allow TLB entries to be shared across VAS boundaries. SpaceJMP increases the burden for the programmer significantly. The fact that every object may or may not be accessible based on the address space we currently reside in leads to an increased difficulty for programming. The compiler can help with static analysis to prove when dereferences are safe, or insert checks before any pointer dereference in case it can not be proved statically that the dereference is safe. A trivial solution would be to tag the pointers with the ID of the address space they belong to, and insert checks on the tags before any pointer dereference. While this introduces a runtime overhead, some architectures do help in making this more efficient. For example, ARM’s aarch64 [ARM15] supports the notion of “tagged pointers” where eight bits in every pointer can be made available to software and not passed to the MMU. Finally, SpaceJMP can serve as a basis for further applications: Efficient copy-on-write implementations on top of SpaceJMP can be used to expose further functionality such as fast snapshotting and versioning. It is also possible to have more fine grained isolation by using different address spaces for different threads within a process. For example, light- weight contexts [LVOE+16] used similar techniques to protect cryptography libraries along with private keys and other sensitive data from other threads within the same process.

93 Chapter 4. Using Multiple Address Spaces in Applications

Name Memory Processors Freq. M1 92 GiB 2x12c Xeon X5650 2.66 GHz M2 256 GiB 2x10c Xeon E5-2670v2 2.50 GHz M3 512 GiB 2x18c Xeon E5-2699v3 2.30 GHz

Table 4.1: Large-memory platforms used in our study.

Operation DragonFly BSD Barrelfish CR3 load 130 224 130 224 system call 357 – 130 – vas_switch 1127 807 664 462

Table 4.2: Breakdown of context switching. Measurements on M2 in cycles. Numbers in bold are with tags enabled.

4.5 Evaluation

In this section, we evaluate potential benefits afforded through SpaceJMP using three applications across domains: (i) a single-threaded benchmark derived from the HPCC GUPS [PBV+06] code to demonstrate the advantages of fast switching and scalable

access to many address spaces; (ii) RedisJMP , an enhanced implementation of the data- center object-caching middleware Redis, designed to leverage the multi-address-space programming model and lockable segments to ensure consistency among clients; and (iii) a genomics tool to demonstrate ease of switching with complex, pointer-rich data structures. Three hardware platforms support our measurements, code-named M1, M2, and M3, as shown in Table 4.1. All are dual-socket server systems, varying in total memory capacity, core count, and CPU micro-architecture; symmetric multi-threading and dynamic frequency scaling are disabled. Unless otherwise mentioned, the DragonFly BSD SpaceJMP implementation is used.

4.5.1 Microbenchmarks

This section explores trade-offs available with our solution for address space manip- ulation, and further evaluates its performance characteristics as an RPC mechanism (Figure 4.6). We begin first with an evaluation of VAS modification overheads, followed by a breakdown of context switching costs with and without TLB tagging optimizations.

94 4.5. Evaluation

70

60

50

40

30

20

Page Touch Latency [cycles] Switch (Tag Off) 10 Switch (Tag On) No context switch 0 500 1000 1500 2000 No. of Pages (4 KiB)

Figure 4.5: Impact of TLB tagging (M3) on a random-access workload. Tagging retains translations, and can lower costs of VAS switching.

Recall from Figure 4.1 that page table modification does not scale, even in optimized, mature OS implementations. The reason is because entire page table subtrees must be created – costs which are directly proportional to the region size and inversely proportional to page size. When restricted to a single per-process address space, changing translations for a range of virtual addresses using mmap and munmap incurs these costs. Copy-on-write optimizations can ultimately only reduce these costs for large, sparsely-accessed regions, and random-access workloads that stress large areas incur higher page-fault overheads using this technique. With SpaceJMP, these costs are removed from the critical path by switching translations instead of modifying them. We demonstrate this performance impact on the GUPS workload in Section 4.5.2. Given the ability to switch into other address spaces at arbitrary instances, a process in SpaceJMP will cause context switching to occur at more frequent intervals than is typical, for example due to task rescheduling, potentially thousands of times per second. Table 4.2 breaks down the immediate costs imposed due to an address space switch in our DragonFly BSD implementation; a system call imposes the largest cost. Subsequent costs are incurred as TLB misses trigger the page-walking hardware to fetch new translations.

95 Chapter 4. Using Multiple Address Spaces in Applications

106

105

104 Latency [cycles]

103

SpaceJMP URPC L URPC X 102 4 64 4k 256k Transfersize [bytes]

Figure 4.6: Comparison of URPC and SpaceJMP on Barrelfish (M2) as an alternative solution for fast local RPC communication.

While immediate costs may only be improved within the micro-architecture, subsequent costs can be improved with TLB tagging. Notice in Table 4.2 that changing CR3 register becomes more expensive with tagging enabled, as it invokes additional hardware circuitry that must consider extra TLB- resident state upon a write. Naturally, these are hardware-dependent costs. However, the overall performance of switching is improved, due to reduced TLB misses from shared OS entries. We directly measured the impact of tagging in the TLB using a random page-walking benchmark we wrote. For a given set of pages, it will load one cache line from a randomly chosen page. A write to CR3 is then introduced between each iteration, and the cost in cycles to access the cache line are measured. Lastly, we enable tags, shown in Figure 4.5. The benefits, however, tail off: with tags, the cost of accessing a cache line reduces to the cost incurred without writes to CR3 as the working set grows. As expected, benefits gained using tags are limited by a combination of TLB capacity (for a given page size), and sophistication of TLB prefetchers. In our experiment, access latencies with larger working sets match that of latencies where the TLB is flushed.

96 4.5. Evaluation

Finally, we compare the latency of switching address spaces in SpaceJMP, with issuing a remote procedure call to another core. In this benchmark, an RPC client issues a request to a server process on a different core and waits for the acknowledgment. The exchange consists of two messages, each containing either a 64-bit key or a variable-sized payload. We compare with the same semantics in SpaceJMP by switching into the server’s VAS and accessing the data directly by copying it into the process-local address space. Figure 4.6 shows the comparison between SpaceJMP and different RPC backends. Our point of comparison is the (highly optimized) Barrelfish low latency RPC, stripped of stub code to expose only the raw low-level mechanism. In the low-latency case, both client and server busy-wait polling different circular buffers of cache-line-sized messages in a manner similar to FastForward [GMV08]. This is the best-case scenario for Barrelfish RPCs – a real-world case would add overhead for marshalling, polling multiple channels, etc. We see a slight difference between intra (URPC L) and inter-socket (URPC X) performance. In all cases, the latency grows once the payload exceeds the buffer size. SpaceJMP is only out-performed by intra-socket URPC for small messages (due to system call and context switch overheads). Between sockets, the interconnect overhead for RPC dominates the cost of switching the VAS. In this case, using TLB tags further reduced latency.

4.5.2 GUPS: Addressing Large Memory

To deal with the limits of addressability for virtual memory, applications adopt various solutions for accessing larger physical memories. In this section, we use the GUPS benchmark [PBV+06] to compare two approaches in use today with a design using SpaceJMP. We ask two key questions: (i) how can applications address large physical memory regions, and (ii) what are the limitations of these approaches? GUPS is appropriate for this: it measures the ability of a system to scale when applying random updates to a large in-memory array. This array is one large logical table of integers, partitioned into some number of windows. GUPS executes a tight loop that, for some number of updates per iteration, computes a random index within a given window for each update and then mutates the value at that index. After each set of updates is applied, a new window is randomly chosen. Figure 4.7 illustrates the performance comparison between three different approaches, where performance is reported as the rate of million updates applied per second.

97 Chapter 4. Using Multiple Address Spaces in Applications

80 SpaceJMP (64) SpaceJMP (16) MP (64) MP (16) MAP (64) MAP (16) 60

40 MUPS (per process) 20

0 1 2 4 8 16 32 64 128 No. Address Spaces (1GiB Windows)

Figure 4.7: Comparison of three designs to program large memories with GUPS (M3). Update set sizes 16 and 64.

The first approach (MAP) leverages address space remapping: a traditional process may logically extend the reachability of its VAS by dynamically remapping portions of it to different regions of physical memory. Our implementation of this design uses the BSD mmap and munmap system calls (configured to attach to existing pages in the kernel’s page cache) opening new windows for writing. The second traditional approach (MP) uses multiple processes: each process is assigned a distinct portion of the physical memory (a window). In our experiment, one process acts as master and the rest as slaves, whereby the master process sends RPC messages using OpenMPI to the slave process holding the appropriate portion of physical memory. It then blocks, waiting for the slave to apply the batch of updates before continuing, simulating a single thread applying updates to a large global table. Each process is pinned to a core. We compare these techniques with VAS switching, modifying GUPS to take advantage of SpaceJMP. Unlike the first technique, we do not modify mappings when changing windows, but instead represent a window as a segment in its own VAS. Pending updates are maintained in a shared heap segment mapped into all attached address spaces. Figure 4.8 illustrates the rate of VAS switching and TLB misses.

98 4.5. Evaluation

105

104

103

102 Rate (1k/sec)

101 VAS Switches (64) VAS Switches (16) TLB Misses (64) TLB Misses (16) 100 1 2 4 8 16 32 64 128 No. Address Spaces (1GiB Windows)

Figure 4.8: Rate of VAS switching and TLB misses for GUPS executed with SpaceJMP, averaged across 16 iterations. TLB tagging is disabled. A larger window size would produce a greater TLB miss rate using one window.

For a single window – no remapping, RPC, or switching – all design configurations perform equally well. Changing windows immediately becomes prohibitively expensive for the MAP design, as it requires modification of the address space on the critical path. For the MP and SpaceJMP designs, the amount of CPU cache used grows with each added window, due to cache line fills from updates, as well as the growth in required page translation structures pulled in by the page-walker. The SpaceJMP implementation performs at least as well as the multi-process implementation, despite frequent context switches, with all data and multiple translation tables competing for the same set of CPU caches (for MP, only one translation table resides on each core). At greater than 36 cores on M3, the performance of MP drops, due to the busy-wait characteristics the OpenMPI implementation. The same trends are visible across a range of update set sizes (16 and 64 in the figure). Finally, a design leveraging SpaceJMP is more flexible, as a single process can independently address multiple address spaces without message passing. This experiment shows that SpaceJMP occupies a useful point in the design space

99 Chapter 4. Using Multiple Address Spaces in Applications

between multiple-process and page-remapping techniques – there is tangible benefit from switching between multiple address spaces rapidly on a single thread.

4.5.3 Redis with Multiple Address Spaces

In this section, we investigate the trade-off in adapting an existing application to use SpaceJMP and the potential benefits over traditional programming models. We use Redis (v3.0.2) [Wikd], a popular in-memory key-value store. Clients interact with Redis using UNIX domain or TCP/IP sockets by sending commands, such as SET and GET, to store and retrieve data. Our experiments use local clients and UNIX domain sockets for performance.

We compare a basic single-threaded Redis instance with RedisJMP , which exploits

SpaceJMP by eliding the use of socket-based communications. RedisJMP avoids a server process entirely, retaining only the server data, and clients access the server data by

switching into its address space. RedisJMP is therefore implemented as a client-side library, and the server data is initialized lazily by its first client. This client creates a new lockable segment for the state, maps it into a newly created address space, switches into this address space, and runs the initialization code to set-up Redis data-structures. We replaced the Redis memory allocator with one based on SpaceJMP, and moved all Redis global variables to a statically-allocated region inside the lockable segment. Each client creates either one or two address spaces with the lockable segment mapped read-only or read-write, and invokes commands by switching into the newly created

address space, executing server code directly. Our RedisJMP implementation currently supports only basic operations for simple data-types. Some Redis features such as publish–subscribe would be more challenging to support, but could be implemented in a dedicated notification service.

RedisJMP uses locked segments to provide parallel read access to Redis state, but two further modifications were needed to the Redis code. First, Redis creates heap objects even on GET (read-only) requests for parsing commands, which would require read-write access to the lockable segment for all commands. We therefore attached a small, per-client scratch heap to each client’s server address space. Second, Redis dynamically grows and shrinks its hash-tables asynchronously with respect to queries; we modified it to resize and rehash entries only when a client has an exclusive lock on the address space.

100 4.5. Evaluation

1000K RedisJMP Redis 6x 800K RedisJMP (Tags) Redis

600K

400K

Requests / second 200K

0K 1 10 100 Number of Clients (a) Throughput of GET requests

160K 140K 120K 100K 80K 60K 40K RedisJMP Requests / second 20K Redis 0K 1 10 100 Number of Clients (b) Throughput of SET requests

800K 700K RedisJMP GET/SET 600K Redis GET/SET 500K 400K 300K 200K Requests / second 100K 0K 0 10 20 30 40 50 60 70 80 90 100 SET rate [%] (c) Varying GET and SET requests

Figure 4.9: Performance comparison of Redis vs. a version of Redis using SpaceJMP

101 Chapter 4. Using Multiple Address Spaces in Applications

We used the redis-benchmark from the Redis distribution to compose the throughput of

RedisJMP with the original on a single machine (M1). The benchmark simulates multiple clients by opening multiple file descriptors and sending commands while asynchronously polling and waiting for a response before sending the next command. For SpaceJMP we modified redis-benchmark to use our API and individual processes as clients that all attach the same server segment. Figure 4.9a and Figure 4.9b show the performance of the GET and SET commands

(4-byte payload) for a regular Redis server and RedisJMP . In case of a single client (one thread), SpaceJMP outperforms a single server instance of Redis by a factor of 4x for GET and SET requests, by reducing communication overhead. The maximum read throughput of a single-threaded Redis server is naturally limited

by the clock-speed of a single core, whereas RedisJMP allows multiple readers to access the address space concurrently. Therefore, we also compare throughput of a single

RedisJMP instance with six independent Redis instances (Redis 6x) paired with six instances of redis-benchmark running on the twelve core machine.

Even in this case, at full utilization RedisJMP is still able to serve 36% more requests than 6 regular Redis instances. We also compare SpaceJMP with and without TLB tagging enabled and notice a slight performance improvement using tags until the synchronization overhead limits scalability. For TLB misses, we measured a rate of 8.9M misses per second with a single client and 3.6M per core per second at full utilization (using 12 clients). With TLB tagging, the miss rate was lower, 2.8M misses per second for a single client and 0.9M per core per second (using 12 clients). The total number of address space switches per second is equals to two times the request rate for any number of clients. For SET requests we sustain a high request rate until too many clients contend on the segment lock. This is a fundamental SpaceJMP limit, but we anticipate that a more scalable lock design than our current implementation would yield further improvements. Figure 4.9c shows maximum system throughput while increasing the percentage of SET requests. The write lock has a large impact on throughput even when 10% of the

requests are SETs, but RedisJMP still outperforms traditional file-based communication. In summary, small changes are needed to modify an existing application to use Space- JMP, but can make a single-threaded implementation both scale better and sustain higher single-thread performance by reducing IPC overhead.

102 4.5. Evaluation

BAM SAM SpaceJMP 1.0 0.8 0.6 0.4 0.2

Time (normalized) 0.0 Flagstat Qname Sort Coordinate Sort Index

Figure 4.10: SAMTools vs. an implementation with SpaceJMP. BAM and SAM are alternative in-memory serialization methods; SpaceJMP has no serialization.

4.5.4 SAMTools: In-Memory Data Structures

Finally, we show the benefit of using SpaceJMP as a mechanism to keep data structures in memory, avoiding both regular file I/O and memory-mapped files. SAMTools [LHW+09] is a toolset for processing DNA sequence alignment information. It operates on multiple file formats that encode aligned sequences and performs various operations such as sorting, indexing, filtering, and collecting statics and pileup data. Each of these parses file data, performs a computation, and may write output to another file. Much of the CPU time in these operations is spent converting between serialized and in-memory representations of data. We implement a version of SAMTools that uses SpaceJMP to keep data in its in-memory representation, avoiding frequent data conversions such as serialization of large data structures; such procedures comprise the majority of its execution time. Instead of storing the alignment information in a file according to a schema, we retain the data in a virtual address space and persist it between process executions. Each process operating on the data switches into the address space, performs its operation on the data structure, and keep its results in the address space for the next process to use. Figure 4.10 compares the performance of using SpaceJMP to the original SAMTools operations on Sequence Alignment/Map (SAM) and BGZF compressed (BAM) files of sizes 3.1 GiB and 0.9 GiB respectively. It is evident from the graph that keeping data in memory with SpaceJMP results in significant speedup. The SAM and BAM files are stored using an in-memory file-system so the impact of disk access in the original tool is completely factored out. The performance improvement comes mainly from avoiding data conversion between pointer-rich and serialized formats.

103 Chapter 4. Using Multiple Address Spaces in Applications

MMAP SpaceJMP 1.00 108.4 5.48 14.77 14.88 1.0 106.4 5.03 0.8 0.67 0.6 0.4 0.2

Time (normalized) 0.0 Flagstat Qname Sort Coordinate Sort Index

Figure 4.11: Use of mmap vs. SpaceJMP in SAMTools. Absolute runtime in seconds shown above each bar.

We also implement a version of SAMTools that uses memory-mapped files to keep data structures in memory. Processes access data by calling mmap on a file. Region-based programming is used to build the data structures within the file. To maximize fairness, we make mmap as lightweight as possible by using an in-memory file system for the files, and pass flags to mmap to exclude regions from the core file and inform the pager to not gratuitously sync dirty pages back to disk. Moreover, we stop timers before process exit to exclude the implicit cost of unmapping data. Figure 4.11 compares the performance of SpaceJMP to its counter-part using memory- mapped files. It is evident that the SpaceJMP has comparable performance. Therefore, the flexibility provided by SpaceJMP over memory-mapped files (i.e., not using special pointers or managing address conflicts) comes at a free cost. Moreover, with caching of translations enabled, it is expected that SpaceJMP’s performance will improve further. It is noteworthy that flagstat shows more significant improvement from SpaceJMP than the other operations in Figure 4.11. That is because flagstat runs much quicker than the others so the time spent performing a VAS switch or mmap takes up a larger fraction of the total time.

4.6 Concluding remarks

SpaceJMP exposes multiple address spaces and lockable segments to processes. The implementations of SpaceJMP in Barrelfish and BSD show promising benefits with micro-benchmarks and real-world applications to program massive memory and/or pointer-rich data structures. SpaceJMP is an extension of existing OS mechanisms that

104 4.6. Concluding remarks

overcomes the limitation of virtual address bits and gives more flexibility to the process and applications over memory management. However, SpaceJMP as an OS mechanism in the future can also help to handle the growing complexity of memory. It is likely for future memory systems to include a com- bination of several heterogeneous hardware modules with quite different characteristics: A volatile tier for performance, a persistent tier for capacity combined with different levels of memory-side caching, as well as private memory not accessible by everyone, and so on. Applications today use cumbersome techniques that involves explicit copying and DMA to operate in these environments. SpaceJMP is a basis for dynamically operating in a complex heterogeneous memory system and provides application support for efficient programming.

105

5 Conclusion

Since the last decade, commodity computer systems have diverged significantly from single core architectures: In the beginning, we experienced the introduction of multi- core, followed by the addition of more specialized, manycore hardware in the form of GPUs or other accelerators. Single machines were scaled up to encompass an entire rack. New storage technologies increased I/O performance and capacities significantly. Combinations of large storage layers with fast, optical interconnects were proposed to have low-latency data access in such systems. All those efforts are driven by the demand to process growing data volumes quicker and with less energy consumption. The challenge for systems research is to find suitable OS designs for such hardware, along with adequate mechanisms that the OS exposes to applications. To tackle this challenge, in Chapter 2 we present an OS design that treats all cores as fully dynamic. More concretely, we show how to decouple all OS state from the underlying core and kernel itself. We achieve this by leveraging the partitioned capability system of Barrelfish/DC and a kernel control block that captures all state of a core inside an OSnode. With that, the OS can quickly add or remove cores in a system, move per-core state arbitrarily between cores or multiplex two OSnode on a single core. By performance measurements of real applications and device drivers, we see that the approach is practical enough to be used many of purposes, such as online kernel upgrades, core hot-plugging or multiplexing two cores on one. In Chapter 3, we revisit an old problem to address the standard, inflexible models and policies that the OS provides to applications. With Badis, we show how one can Chapter 5. Conclusion

customize the entire OS stack, including the kernel, library OS, and runtimes. The ability to customize makes it possible to maximize performance by tailoring parts of the system for use with specific application classes. We evaluate Badis with two use-cases: A kernel that successfully eliminates OS noise and provides hard real-time guarantees to applications, and a task-based runtime integrated with the operating system that coordinates multiple parallel runtimes within a machine. In Chapter 4, we demonstrate a system that addresses the problem of accessing memory larger than the size of a single address space, as well as data sharing and communication with low overheads. The presented system SpaceJMP is a set of OS extensions to manage memory and introduces two novel features: First, it gives applications the ability to create modify and switch address spaces directly. Second, it implements lockable segments to coordinate access for address spaces shared across different processes. We show that the approach is feasible and often performs better over known solutions for sharing pointer-rich data-structures in memory, or communication between client and server processes. In a broader context, this dissertation investigates several relevant aspects of a radically different vision how operating systems should exploit hardware and how to present it to applications. After more than 50 years of miniaturization, we expect transistors to stop decreasing in size by 2020 due to physical limitations [VDC17]. The implications are that hardware can no longer gain more functionality by merely adding transistors to a chip, nor rely on any improvements for power consumption or performance through shrinking transistors. In the past few years, we saw a clear trend to incorporate specialized hardware for important, compute-intensive applications to speed-up processing performance. A clear indication for this is the rising popularity of accelerators cards such as general purpose graphics processing units (GPGPUs), field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). These examples allow one to speculate about possible design aspects of hardware in the Post-Moore era and how this dissertation is relevant in such a world. If we envision a possible future machine, having reached its maximum potential in size and amount of transistors, a logical conclusion would be to now shift the development focus on maximum adaptability and reconfigurability of its logic circuits to specialize the system without the use of FPGA boards or other technologies. Such a hypothet- ical machine could then go far beyond the capabilities of an FPGA by dynamically reconfiguring every part of the hardware: This would give the possibility to restructure

108 5.1. Specialized hardware

arbitrary components at runtime, for example by changing memory to trade-off storage for compute space. Or modifying the interconnect network by adding, removing or changing links to increase capacities or reduce latencies. Caches of any kind could be resized, added or removed. And eventually, the system could instantiate and destroy customized compute cores at arbitrary locations and not just next to caches: on the interconnect, the memory controller etc. It is easy to imagine that the potential system software for such hardware needs to be highly adaptive and dynamic to manage such a machine. Barrelfish/DC shows that an OS that treats all cores and devices as fully dynamic is realistically achievable with very little overhead for applications. Badis together with Barrelfish/DC further extends the OS, giving it the ability to launch various customized OS services on specialized cores and devices by defining a way manage and interact with them. SpaceJMP offers applications the means to dynamically and quickly adapt and change their logical view on physical memory and storage by exchanging its virtual page-tables. In the remainder of this chapter, we discuss several areas of immediate interest on how this work can be extended and improved to build a full system software stack to control such future hardware.

5.1 Specialized hardware

Our current evaluations are based on Intel and AMD multicore machines. These multiprocessor architectures are homogeneous (i.e., all processors are identical). In the future, we expect systems to become increasingly heterogeneous, containing a combination of processors with different functionality, performance characteristics, and potentially various ISAs. The techniques presented in Barrelfish/DC (Chapter 2), also apply to heterogeneous systems: For example, Barrelfish has since been extended with a range of core boot- drivers that support many different architectures, including ARM-based variants, and the Intel XeonPhi co-processor on x86-64 systems. Today, a system typically contains one or more accelerators in the form of GPGPUs, FPGAs or ASICs. As accelerators are maturing, the need for system software services on such devices arises [SFW14]. The reduced functionality makes accelerators a prime target for installing custom, light-weight OS kernels. Badis provides a framework to do so and allows for seamless interaction with the control-plane OS. The advent of FPGAs makes it easier for OS designers to experiment directly with hardware by customizing it or adding functionality.

109 Chapter 5. Conclusion

FPGAs offer much of potential for co-designing OS APIs with hardware support. For example in Section 4.4.4, we discussed several trade-offs and ideas for better TLB tagging support in combination with SpaceJMP. Lockable segments could also benefit further from transactional hardware support to make the proposed mechanism safer and more resilient. Intel transactional memory extensions (TSX), are an example of insufficient hardware for lockable segments in SpaceJMP: Although transactional memory would be incredibly useful for a system like SpaceJMP to avoid extensive locking overheads and guarantee consistency to clients as part of the OS kernel, TSX always aborts in case of a ring transition (i.e., every time an address-space switch occurs). Therefore the transaction cannot be initiated by the OS.

5.2 Rack-scale systems

Rack-scale systems are systems with hundreds of compute nodes connected over optical interconnects or low-latency networks. Such systems promise the best of both worlds in scale-up and scale-out architectures: A vast amount of cores paired with low data access and communication latencies. From an OS perspective, there are a variety of exciting design decisions and challenges these architectures introduce. We already discussed the problem of spatial and temporal scheduling for applications on big machines as part of the Basslet runtime system in Section 3.5. Another challenge is dealing with core or node failures in rack-scale systems. While Barrelfish/DC from Chapter 2 and the concept of dynamic cores does not directly address reliability issues or deal with core failures, the idea of an encapsulated state within a KCB and per-core kernel leads to small, partitioned fault domains which makes it easier to reason about, isolate and handle failures. The available mechanisms provide a good starting point for tackling this problem within a multikernel. Finally, rack-scale architectures that share a global load-store domain (e.g., HP’s The Machine) do have the potential to eschew large serialization overheads by relying on better message passing infrastructures as part of the OS. SpaceJMP is one approach to communicate without serialization overheads in scenarios where clients and data providers have direct access to the data storage.

5.3 Near-data processing

Near-data processing (NDP) loosely refers to the practice of placing processing ca- pabilities as close as possible to the data. Researchers and companies are currently

110 5.3. Near-data processing

investigating several architectural options to achieve this: Some make use of specialized CPUs, FPGAs or custom accelerators placed directly on the DRAM controller or the interconnect network. This reduces the distance between a regular CPU socket and data storage considerably. Others modify the electronic circuitry of the memory cells, to provide simple operations like copy, addition or subtraction on entire DRAM row buffers. The added hardware gives the system the means to offload modifications to memory at scale, with high energy efficiency and extremely low latency. There are several challenges when designing system software for this use-case and the topics presented in this dissertation are applicable for many of them: First, the use of non-standard or heterogeneous processing units at remote locations in the system raises the issue on how near-data compute hardware is launched, initialized and controlled. The concept of boot drivers, introduced in Chapter 2, is applicable here as it allows the OS to quickly boot a custom processing device and replace its software or control the hardware during runtime. Second, the system software that runs on such hardware and the interface that is exposed must be defined carefully. The use-case likely demands customized execution and scheduling policies as well as APIs to send processing work and interact with near-data units. All of which are typically not found, nor useful in general-purpose operating systems. Therefore, NDP calls for customized system software on a select set of hardware. Badis, as presented in Chapter 3, explores a similar use-case by specializing regular CPU sockets for data-processing. Finally, the ability to directly offload computation to compute units near-memory raises the issue of preserving the protection provided by a conventional process and address space. The Barrelfish capability model allows processes to conveniently represent the entire memory protection state of a process as a set of capabilities. Together with a system like SpaceJMP (Chapter 4), this enables a compelling method for communicating and mirroring page-table states on NDP compute entities – all directly in user-space. Finally, SpaceJMP can be extended to keep multiple views of one address space but with different protection bits, and therefore quickly change access permissions by switching between different page-table representations of the same mappings. Different protections for the same logical view can further increase security by compartmentalization of various trusted and untrusted code that runs on NDP devices.

111

List of Tables

2.1 Architectural details of different systems we use in our evaluation. . . . 31 2.2 Performance of core management operations for Barrelfish/DC and Linux (3.13) when the system is idle and when the system is under load. For the Barrelfish/DC down column, the value after the slash shows the cost of stopping a core on another socket with regard to the boot driver. aWe do not include this number for Santa-Rosa because it lacks synchronized timestamp counters, nor for Haswell because it only includes a single package...... 32

3.1 Architectural details of different systems we use in our evaluation. . . . 66 3.2 Runtime of parallel algorithms executing on Linux versus Basslet. . . . 71 3.3 Redis throughput (operations/sec) running on Barrelfish versus Badis. . 73

4.1 Large-memory platforms used in our study...... 94 4.2 Breakdown of context switching. Measurements on M2 in cycles. Numbers in bold are with tags enabled...... 94

List of Figures

2.1 Shows the supported operations of a decoupled OS. Update: The entire kernel, dispatching OSnode α, is replaced at runtime. Move: OSnode α containing all per-core state, entailing applications is migrated to another core and kernel. Park: OSnode α is moved to a new core and kernel that temporarily dispatches two OSnodes. Unpark: OSnode α is transferred back to its previous core...... 13 2.2 State in the Barrelfish/DC OSnode ...... 23 2.3 Breakdown of the cost of bringing up a core for various machines. . . . 34 2.4 Ethernet driver behavior when restarting kernels and parking OSnodes. 37 2.5 Webserver behavior when restarting kernels and parking OSnodes. . . . 38 2.6 PostgreSQL behavior when restarting kernels and parking OSnodes. . . 40

3.1 System throughput for executing concurrent pagerank jobs...... 44 3.2 A selfish detour benchmark to measure OS noise. Each sample represents an outlier that was running 9x slower than expected...... 45 3.3 The time of a context switch, including time lost due to cache-pollution with increasing working set sizes...... 51 3.4 An Intel Sandy Bridge processor with labeled opportunities for resource sharing: (1) Last-level cache, (2) local DRAM controller, (3) bus inter- connect, (4) L1 and L2 caches, and (5) simultaneous multithreading. . . 53 3.5 Shows the system architecture of Badis. Its key idea is that at any point in time, the cores in a multicore machine are partitioned into a control and compute plane...... 55 List of Figures

3.6 The Badis architecture: Applications submit parallel tasks to the ptask queue(s). The Badis compute plane dequeues ptasks and dispatches the individual tasks to the hardware contexts owned by the compute plane. 57 3.7 Badis APIs as provided to applications running on the control plane for interaction with compute plane kernels...... 58 3.8 Badis compute plane minimally required API ...... 59 3.9 Basslet runtime API ...... 62 3.10 Numbers report slow-down of SSSP, HD and PR algorithms (for algorithm in row) when co-executed with a partner algorithm (column) vs. running the algorithm alone. Graphs contrast Linux + OpenMP scheduling vs. Basslet + Badis scheduling...... 67 3.11 Expanding the problem statement experiment (recall Figure 3.1) on four different machines...... 69 3.12 Throughput scale-out when executing multiple PRs using a default Linux+OpenMP scheduler versus Basslet...... 70 3.13 Number of cycles measured for 103 iterations of a synthetic benchmark for bfrt, Barrelfish, and Linux using real-time priorities...... 72 3.14 Measuring the overhead of an enqueue syscall and the queuing effects in Badis using HJ...... 74

4.1 Page table construction (mmap) and removal (munmap) costs in Linux, using 4KiB pages. Does not include page zeroing costs...... 81 4.2 Contrasting SpaceJMP and Unix...... 87 4.3 SpaceJMP interface...... 88 4.4 Example SpaceJMP usage...... 88 4.5 Impact of TLB tagging (M3) on a random-access workload. Tagging retains translations, and can lower costs of VAS switching...... 95 4.6 Comparison of URPC and SpaceJMP on Barrelfish (M2) as an alternative solution for fast local RPC communication...... 96 4.7 Comparison of three designs to program large memories with GUPS (M3). Update set sizes 16 and 64...... 98

116 List of Figures

4.8 Rate of VAS switching and TLB misses for GUPS executed with Space- JMP, averaged across 16 iterations. TLB tagging is disabled. A larger window size would produce a greater TLB miss rate using one window. 99 4.9 Performance comparison of Redis vs. a version of Redis using SpaceJMP 101 4.10 SAMTools vs. an implementation with SpaceJMP. BAM and SAM are alternative in-memory serialization methods; SpaceJMP has no serialization.103 4.11 Use of mmap vs. SpaceJMP in SAMTools. Absolute runtime in seconds shown above each bar...... 104

117

Bibliography

[ABLL91] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. “Sched- uler Activations: Effective Kernel Support for the User-level Management of Parallelism.” In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pp. 95–109. 1991.

[ADK+07] J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Ostrowski, B. Rosenburg, A. Waterland, R. W. Wisniewski, J. Xenidis, M. Stumm, and L. Soares. “Experience distributing objects in an SMMP OS.” ACM Transactions on Computer Systems, vol. 25, no. 3, 2007.

[AJK+15] K. Aingaran, S. Jairath, G. Konstadinidis, S. Leung, P. Loewenstein, C. McAllister, S. Phillips, Z. Radovic, R. Sivaramakrishnan, D. Smentek, et al. “M7: Oracle’s next-generation SPARC processor.” IEEE Micro, vol. 35, no. 2, 36–45, 2015.

[AK09] J. Arnold and M. F. Kaashoek. “Ksplice: Automatic Rebootless Kernel Updates.” In Proceedings of the EuroSys Conference, pp. 187–198. 2009.

[AL91] A. W. Appel and K. Li. “Virtual Memory Primitives for User Programs.” In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 96–107. 1991.

[ALBL91] T. E. Anderson, H. M. Levy, B. N. Bershad, and E. D. Lazowska. “The Interaction of Architecture and Operating System Design.” In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 108–120. 1991. Bibliography

[ARM14] ARM Ltd. ARM Architecture Reference Manual: ARMv7-A and ARMv7- R Edition, 2014. ARM DDI 0406C.c.

[ARM15] ARM Ltd. “ARMv8-A Architecture.”, 2015. http://www.arm.com/ products/processors/armv8-architecture.php, accessed 2017-11-20.

[Ash] Ashok Raj. “CPU hotplug Support in the Linux Kernel.” https: //www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html, ac- cessed 2017-11-20.

[BALL90] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. “Lightweight Remote Procedure Call.” ACM Transactions on Computer Systems, vol. 8, no. 1, 37–55, 1990.

[BALL91] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. “User- level Interprocess Communication for Shared Memory Multiprocessors.” ACM Transactions on Computer Systems, vol. 9, no. 2, 175–198, 1991.

[Bar81] J. F. Bartlett. “A NonStop Kernel.” In Proceedings of the 8th ACM Symposium on Operating Systems Principles, pp. 22–29. 1981.

[BATO13] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. “Multi-core, Main-memory Joins: Sort vs. Hash Revisited.” Proceedings of the VLDB Endowment, vol. 7, no. 1, 85–96, 2013.

[BAW+07] A. Baumann, J. Appavoo, R. W. Wisniewski, D. D. Silva, O. Krieger, and G. Heiser. “Reboots Are for Hardware: Challenges and Solutions to Updating an Operating System on the Fly.” In Proceedings of the USENIX Annual Technical Conference, pp. 1–14. 2007.

[BBD+09] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. “The multikernel: a new OS architecture for scalable multicore systems.” In Proceedings of the 22nd ACM Symposium on Operating System Principles, pp. 29–44. 2009.

[BBLB03] S. A. Brandt, S. A. Banachowski, C. Lin, and T. Bisson. “Dynamic Integrated Scheduling of Hard Real-Time, Soft Real-Time and Non-Real- Time Processes.” In Proceedings of the 24th IEEE Real-Time Systems Symposium. 2003.

120 Bibliography

[BBM+12] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis. “Dune: Safe User-level Access to Privileged CPU Features.” In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, pp. 335–348. 2012.

[BBSG11] M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. “Bulldozer: An Approach to Multithreaded Compute Performance.” IEEE Micro, vol. 31, no. 2, 6–15, 2011.

[BCGL11] K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. “Operating System Implications of Fast, Cheap, Non-volatile Memory.” In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, pp. 2–2. 2011.

[BDF+03] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge- bauer, I. Pratt, and A. Warfield. “Xen and the Art of Virtualization.” In Proceedings of the 19th ACM Symposium on Operating Systems Prin- ciples, SOSP ’03, pp. 164–177. 2003.

[BDSK+08] M. Butrico, D. Da Silva, O. Krieger, M. Ostrowski, B. Rosenburg, D. Tsafrir, E. Van Hensbergen, R. W. Wisniewski, and J. Xenidis. “Spe- cialized Execution Environments.” SIGOPS Operating Systems Review, vol. 42, no. 1, 106–107, 2008.

[BGC+13] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. “Efficient Virtual Memory for Big Memory Servers.” In Proceedings of the 40th Annual International Symposium on , pp. 237–248. 2013.

[BHA+05] A. Baumann, G. Heiser, J. Appavoo, D. Da Silva, O. Krieger, R. W. Wisniewski, and J. Kerr. “Providing Dynamic Update in an Operating System.” In Proceedings of the USENIX Annual Technical Conference, pp. 279–291. 2005.

[BIYC06] P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. “Operating System Issues for Petascale Systems.” SIGOPS Operating Systems Review, pp. 29–33, 2006.

121 Bibliography

[Bla79] Blasgen, Mike and Gray, Jim and Mitoma, Mike and Price, Tom. “The Convoy Phenomenon.” SIGOPS Operating Systems Review, vol. 13, no. 2, 20–25, 1979.

[BLF+13] A. Baumann, D. Lee, P. Fonseca, L. Glendenning, J. R. Lorch, B. Bond, R. Olinsky, and G. C. Hunt. “Composing OS Extensions Safely and Efficiently with Bascule.” In Proceedings of the EuroSys Conference, pp. 239–252. 2013.

[BPH08] R. Brightwell, K. Pedretti, and T. Hudson. “SMARTMAP: Operating System Support for Efficient Data Sharing Among Processes on a Multi- core Processor.” In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 25:1–25:12. 2008.

[BSP+95] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. “Extensibility Safety and Performance in the SPIN Operating System.” In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pp. 267– 283. 1995.

[BWCC+08] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. “Corey: An Operating System for Many Cores.” In Proceedings of the 8th Sym- posium on Operating Systems Design and Implementation, pp. 43–57. 2008.

[BWCM+10] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. “An Analysis of Linux Scalability to Many Cores.” In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–8. 2010.

[BYD+15] J. Brock, C. Ye, C. Ding, Y. Li, X. Wang, and Y. Luo. “Optimal Cache Partition-Sharing.” In Proceedings of the 44th International Conference on Parallel Processing, pp. 749–758. 2015.

[CBB14] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari. “Atlas: Leveraging Locks for Non-volatile Memory Consistency.” In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 433–452. 2014.

122 Bibliography

[CBHLL92] J. Chase, M. Baker-Harvey, H. Levy, and E. Lazowska. “Opal: A Single Address Space System for 64-bit Architectures.” SIGOPS Operating Systems Review, vol. 26, no. 2, 1992.

[CCA+11] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. “NV-Heaps: Making Persistent Objects Fast and Safe with Next-generation, Non-volatile Memories.” In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 105–118. 2011.

[CCC14] T.-H. Chien, C.-J. Chen, and R.-G. Chang. “An Adaptive Zero-Copy Strategy for Ubiquitous High Performance Computing.” In Proceedings of the 21st European MPI Users’ Group Meeting, pp. 139–144. 2014.

[CEH+13] J. A. Colmenares, G. Eads, S. Hofmeyr, S. Bird, M. Moretó, D. Chou, B. Gluzman, E. Roman, D. B. Bartolini, N. Mor, K. Asanović, and J. D. Kubiatowicz. “Tessellation: Refactoring the OS Around Explicit Resource Containers with Continuous Adaptation.” In Proceedings of the 50th Annual Design Automation Conference, pp. 1–10. 2013.

[CFKL99] B. Carpenter, G. Fox, S. H. Ko, and S. Lim. “Object Serialization for Marshalling Data in a Java Interface to MPI.” In Proceedings of the ACM 1999 Conference on Java Grande, pp. 66–71. 1999.

[CGHT17] G. Chatzopoulos, R. Guerraoui, T. Harris, and V. Trigonakis. “Abstract- ing Multi-Core Topologies with MCTOP.” In Proceedings of the 12th European Conference on Computer Systems, pp. 544–559. 2017.

[CGS+05] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. “X10: An Object-oriented Approach to Non-uniform Cluster Computing.” In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, pp. 519–538. 2005.

[CJ75] E. Cohen and D. Jefferson. “Protection in the Hydra Operating Sys- tem.” In Proceedings of the 5th ACM Symposium on Operating Systems Principles, pp. 141–160. 1975.

123 Bibliography

[CJS+09] J. Charles, P. Jassi, A. N. S, A. Sadat, and A. Fedorova. “Evaluation of the Intel Core i7 Turbo Boost feature.” In Proceedings of the IEEE International Symposium on Workload Characterization. 2009.

[CKZ13] A. T. Clements, M. F. Kaashoek, and N. Zeldovich. “RadixVM: Scalable Address Spaces for Multithreaded Applications.” In Proceedings of the 8th ACM European Conference on Computer Systems, pp. 211–224. 2013.

[CLBhL92] J. S. Chase, H. M. Levy, M. Baker-harvey, and E. D. Lazowska. “How to Use a 64-Bit Virtual Address Space.” Tech. rep., Department of Computer Science and Engineering, University of Washington, 1992.

[CLR94] S. Chandra, J. R. Larus, and A. Rogers. “Where is Time Spent in Message-passing and Shared-memory Programs?” In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 61–73. 1994.

[CNF+09] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. “Better I/O Through Byte-addressable, Persistent Memory.” In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, pp. 133–146. 2009.

[Cor] J. Corbet. “Deadline scheduling for Linux.” https://lwn.net/ Articles/356576/, accessed 2017-11-20.

[CP99] C. D. Cranor and G. M. Parulkar. “The UVM Virtual Memory System.” In Proceedings of the 1999 USENIX Annual Technical Conference. 1999.

[CR07] J. Cieslewicz and K. A. Ross. “Adaptive Aggregation on Chip Mul- tiprocessors.” In Proceedings of the VLDB Endowment, pp. 339–350. 2007.

[CRD+95] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. “Hive: Fault Containment for Shared-memory Multiproces- sors.” In Proceedings of the 15th ACM Symposium on Operating Systems Principles, pp. 12–25. 1995.

[CSL04] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. “Dynamic In- strumentation of Production Systems.” In Proceedings of the USENIX Annual Technical Conference, pp. 15–28. 2004.

124 Bibliography

[DBR09] P.-E. Dagand, A. Baumann, and T. Roscoe. “Filet-o-Fish: practical and dependable domain-specific languages for OS development.” In Pro- ceedings of the 5th Workshop on Programming Languages and Operating Systems. 2009.

[DdBF+94] A. Dearle, R. di Bona, J. Farrow, F. Henskens, A. Lindström, J. Rosen- berg, and F. Vaughan. “Grasshopper: An Orthogonally Persistent Operating System.” Computing Systems, vol. 7, no. 3, 289–312, 1994.

[DGT13] T. David, R. Guerraoui, and V. Trigonakis. “Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pp. 33–48. 2013.

[Dik06] J. Dike. User Mode Linux. Pearson Education, 2006.

[DP93] P. Druschel and L. L. Peterson. “Fbufs: A High-bandwidth Cross- domain Transfer Facility.” In Proceedings of the 14th ACM Symposium on Operating Systems Principles, pp. 189–202. 1993.

[Dra90] R. Draves. “A Revised IPC Interface.” In USENIX MACH Symposium, pp. 101–122. 1990.

[DS10] A. Depoutovitch and M. Stumm. “Otherworld: Giving Applications a Chance to Survive OS Kernel Crashes.” In Proceedings of the EuroSys Conference, pp. 181–194. 2010.

[EBSA+11] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. “Dark Silicon and the End of Multicore Scaling.” In Pro- ceedings of the 38th Annual International Symposium on Computer Architecture, pp. 365–376. 2011.

[Eco] A. Economopoulos. “A peek at the DragonFly Virtual Kernel (part 1).” https://lwn.net/Articles/228404/, accessed 2017-11-20.

[EDE08] D. Elkaduwe, P. Derrin, and K. Elphinstone. “Kernel design for isolation and assurance of physical memory.” In Proceedings of the 1st Workshop on Isolation and Integration in Embedded Systems, pp. 35–40. 2008.

125 Bibliography

[EDS+15] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, et al. “A Demonstration of the BigDAWG Polystore System.” Proceedings of the VLDB Endowment, vol. 8, no. 12, 2015.

[EHMZ+16] I. El Hajj, A. Merritt, G. Zellweger, D. Milojicic, R. Achermann, P. Fara- boschi, W.-m. Hwu, T. Roscoe, and K. Schwan. “SpaceJMP: Pro- gramming with Multiple Virtual Address Spaces.” In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 353–368. 2016.

[EHMZM17a] I. El Hajj, A. Merritt, G. Zellweger, and D. Milojicic. “Persistent virtual address spaces.”, 2017. WO Patent App. PCT/US2016/015,661.

[EHMZM17b] I. El Hajj, A. Merritt, G. Zellweger, and D. Milojicic. “Switch process virtual address space.”, 2017. WO Patent App. PCT/US2015/049,726.

[EHMZM17c] I. El Hajj, A. Merritt, G. Zellweger, and D. Milojicic. “Versioning virtual address spaces.”, 2017. WO Patent App. PCT/US2016/015,814.

[EKO95] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. “: an op- erating system architecture for application-level resource management.” In Proceedings of the 15th ACM Symposium on Operating Systems Prin- ciples, SOSP ’95, pp. 251–266. 1995.

[ESG+94] Y. Endo, M. Seltzer, J. Gwertzman, C. Small, K. A. Smith, and D. Tang. “VINO: The 1994 Fall Harvest.” Technical Report TR-34-94, Center for Research in Computing Technology, Harvard University, 1994.

[FCP+12] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. “SAP HANA Database: Data Management for Modern Business Appli- cations.” SIGMOD Record, vol. 40, no. 4, 45–51, 2012.

[FFPF05] E. Frachtenberg, D. G. Feitelson, F. Petrini, and J. Fernandez. “Adaptive Parallel Job Scheduling with Flexible Coscheduling.” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 11, 1066–1077, 2005.

[FKMM15] P. Faraboschi, K. Keeton, T. Marsland, and D. Milojicic. “Beyond Processor-centric Operating Systems.” In 15th Workshop on Hot Topics in Operating Systems. 2015.

126 Bibliography

[FLR98] M. Frigo, C. E. Leiserson, and K. H. Randall. “The Implementation of the Cilk-5 multithreaded language.” In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1998.

[Fre] Free Software Foundation Inc. “libgomp: proc.c gomp_dynamic_max_threads().” https://github.com/gcc-mirror/ gcc/blob/edd716b6b1caa1a5cb320a8cd7f626f30198e098/libgomp/ config/posix/proc.c#L55, accessed 2017-11-20.

[GAK12] G. Giannikis, G. Alonso, and D. Kossmann. “SharedDB: killing one thousand queries with one stone.” Proceedings of the VLDB Endowment, vol. 5, no. 6, 526–537, 2012.

[GARH14] J. Giceva, G. Alonso, T. Roscoe, and T. Harris. “Deployment of Query Plans on Multicores.” Proceedings of the VLDB Endowment, vol. 8, no. 3, 233–244, 2014.

[GGIW10] M. Giampapa, T. Gooding, T. Inglett, and R. W. Wisniewski. “Ex- periences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene’s CNK.” In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 2010.

[GKT13] C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum. “Safe and Automatic Live Update for Operating Systems.” In Proceedings of the 18th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279–292. 2013.

[GMG12] T. Gleixner, P. E. McKenney, and V. Guittot. “Cleaning Up Linux’s CPU Hotplug for Real Time and Energy Management.” SIGBED Review, vol. 9, no. 4, 49–52, 2012.

[GMJ+02] D. Grossman, G. Morrisett, T. Jim, M. Hicks, Y. Wang, and J. Cheney. “Region-based Memory Management in Cyclone.” In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 282–293. 2002.

[GMV08] J. Giacomoni, T. Moseley, and M. Vachharajani. “FastForward for Efficient Pipeline Parallelism: A Cache-optimized Concurrent Lock-free

127 Bibliography

Queue.” In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 43–52. 2008.

[GSS+13] J. Giceva, T.-I. Salomie, A. Schüpbach, G. Alonso, and T. Roscoe. “COD: Database/Operating System Co-Design.” In Proceedings of the Conference on Innovative Data Systems Research. 2013.

[GZAR16] J. Giceva, G. Zellweger, G. Alonso, and T. Rosco. “Customized OS Support for Data-processing.” In Proceedings of the 12th International Workshop on Data Management on New Hardware, pp. 2:1–2:6. 2016.

[Han99] S. M. Hand. “Self-paging in the Nemesis Operating System.” In Pro- ceedings of the 3rd Symposium on Operating Systems Design and Imple- mentation, pp. 73–86. 1999.

[Har85] N. Hardy. “KeyKOS Architecture.” SIGOPS Operating Systems Review, vol. 19, no. 4, 8–25, 1985.

[HCSO12] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. “Green-Marl: A DSL for Easy and Efficient Graph Analysis.” In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 349–362. 2012.

[Heo] T. Heo. “Control Group v2.” https://www.kernel.org/doc/ Documentation/cgroup-v2.txt, accessed 2017-11-20.

[HJR+03] M. Hericko, M. B. Juric, I. Rozman, S. Beloglavec, and A. Zivkovic. “Object Serialization Analysis and Comparison in Java and .NET.” ACM SIGPLAN Notices, vol. 38, no. 8, 44–54, 2003.

[HMLR07] T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. “Netgauge: A Network Performance Measurement Framework.” In Proceedings of High Performance Computing and Communications, vol. 4782, pp. 659–671. 2007.

[HMM14] T. Harris, M. Maas, and V. J. Marathe. “Callisto: Co-scheduling Parallel Runtime Systems.” In Proceedings of the 9th European Conference on Computer Systems, pp. 1–14. 2014.

128 Bibliography

[HSL10] T. Hoefler, T. Schneider, and A. Lumsdaine. “Characterizing the In- fluence of System Noise on Large-Scale Applications by Simulation.” In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. 2010.

[IKKM07] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors.” In Pro- ceedings of the 34th Annual International Symposium on Computer Architecture, pp. 186–197. 2007.

[Int] Intel Corporation. Intel Itanium Architecture Software Developer’s Man- ual. Document Number: 245315.

[JC14] B. Jeremia and F. Claudio. “Bulk Transfer over Shared Memory.” Tech- nical report, ETH Zurich, 2014.

[JKH+04] M. B. Juric, B. Kezmah, M. Hericko, I. Rozman, and I. Vezocnik. “Java RMI, RMI Tunneling and Web Services Comparison and Performance Analysis.” ACM SIGPLAN Notices, vol. 39, no. 5, 58–65, 2004.

[Jon] Jonathan Corbet. “Thread-level management in control groups.” https: //lwn.net/Articles/656115/, accessed 2017-11-20.

[Jos10] A. Joshi. “Twin-Linux: Running independent Linux Kernels simulta- neously on separate cores of a multicore system.” In Proceedings of the Linux Symposium, pp. 101–108. 2010.

[KAH+16] S. Kaestle, R. Achermann, R. Haecki, M. Hoffmann, S. Ramos, and T. Roscoe. “Machine-Aware Atomic Broadcast Trees for Multicores.” In 12th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 16), pp. 33–48. USENIX Association, 2016.

[KAO05] P. Kongetira, K. Aingaran, and K. Olukotun. “Niagara: A 32-Way Multithreaded Sparc Processor.” IEEE Micro, vol. 25, no. 2, 21–29, 2005.

[KB05] S. M. Kelly and R. Brightwell. “Software architecture of the light weight kernel, Catamount.” In In Cray User Group, pp. 16–19. 2005.

129 Bibliography

[KC94] V. Karamcheti and A. A. Chien. “Software Overhead in Messaging Lay- ers: Where Does the Time Go?” In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 51–60. 1994.

[KCDZ94] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. “TreadMarks: Distributed Shared Memory on Standard Workstations and Operat- ing Systems.” In Proceedings of the USENIX Winter 1994 Technical Conference, pp. 10–10. 1994.

[KCE92] E. J. Koldinger, J. S. Chase, and S. J. Eggers. “Architecture Support for Single Address Space Operating Systems.” In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 175–186. 1992.

[KEH+09] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood. “seL4: Formal Verification of an OS Kernel.” In Proceedings of the 22nd ACM Symposium on Operating System Principles. 2009.

[KFJ+03] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction.” In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 81–92. 2003.

[KGA+15] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal. “Redundant Memory Mappings for Fast Access to Large Memories.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 66–78. 2015.

[KKL+09] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. “Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs.” Proceedings of the VLDB Endowment, vol. 2, no. 2, 1378–1389, 2009.

130 Bibliography

[KKR09] M. A. Kozuch, M. Kaminsky, and M. P. Ryan. “Migration Without Virtualization.” In Proceedings of the 12th Workshop on Hot Topics in Operating Systems, pp. 10–15. 2009.

[KN11] A. Kemper and T. Neumann. “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots.” In Pro- ceedings of the IEEE International Conference on Data Engineering, pp. 195–206. 2011.

[LBKN14] V. Leis, P. Boncz, A. Kemper, and T. Neumann. “Morsel-driven paral- lelism: a NUMA-aware query evaluation framework for the many-core age.” In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754. 2014.

[LCX+12] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. “A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems.” In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 367–376. 2012.

[LDC+09] R. Lee, X. Ding, F. Chen, Q. Lu, and X. Zhang. “MCC-DB: minimizing cache conflicts in multi-core processors for databases.” Proceedings of the VLDB Endowment, vol. 2, no. 1, 373–384, 2009.

[LDS07] C. Li, C. Ding, and K. Shen. “Quantifying the Cost of Context Switch.” In Proceedings of the Workshop on Experimental Computer Science. 2007.

[Lea] D. Lea. “dlmalloc: A Memory Allocator.” http://g.oswego.edu/dl/ html/malloc.html, accessed 2017-11-20.

[Lev00] J. Levine. Linkers and Loaders. Morgan Kauffman, 2000.

[LHW+09] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and . G. P. D. P. Subgroup. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics, vol. 25, no. 16, 2078–2079, 2009.

[Li88] K. Li. “IVY: A Shared Virtual Memory System for Parallel Comput- ing.” Proceedings of the International Conference on Parallel Processing, vol. 88, 94, 1988.

131 Bibliography

[Lina] Linaro Ltd. “CPU Hotplug.” https://wiki.linaro.org/ WorkingGroups/PowerManagement/Archives/Hotplug, accessed 2017-11-20.

[Linb] Linux Developers. “Linux 4.6-rc7 Scheduler Source code.” https://git.kernel.org/cgit/linux/kernel/git/torvalds/ linux.git/tree/kernel/sched/fair.c?id=refs/tags/v4.6-rc7#n39, accessed 2017-11-20.

[LK14] J. Leskovec and A. Krevl. “SNAP Datasets: Stanford Large Network Dataset Collection.”, 2014. http://snap.stanford.edu/data, accessed 2017-11-20.

[LKB+09] R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanović, and J. Kubiatowicz. “Tessellation: Space-Time Partitioning in a Manycore Client OS.” In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism. 2009.

[LLF+16] J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova. “The Linux Scheduler: A Decade of Wasted Cores.” In Proceedings of the 11th European Conference on Computer Systems, pp. 1–16. 2016.

[LPM+13] Y. Li, I. Pandis, R. Müller, V. Raman, and G. M. Lohman. “NUMA- aware algorithms: the case of data shuffling.” In Proceedings of the Conference on Innovative Data Systems Research. 2013.

[LQF15] B. Lepers, V. Quéma, and A. Fedorova. “Thread and Memory Placement on NUMA Systems: Asymmetry Matters.” In Proceedings of the USENIX Annual Technical Conference, pp. 277–289. 2015.

[LRD95] A. Lindstrom, J. Rosenberg, and A. Dearle. “The Grand Unified Theory of Address Spaces.” In Proceedings of the 5th Workshop on Hot Topics in Operating Systems, pp. 66–71. 1995.

[LVOE+16] J. Litton, A. Vahldiek-Oberwagner, E. Elnikety, D. Garg, B. Bhattachar- jee, and P. Druschel. “Light-weight Contexts: An OS Abstraction for Safety and Performance.” In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 49–64. 2016.

132 Bibliography

[MAK+13] I. Moraru, D. G. Andersen, M. Kaminsky, N. Tolia, P. Ranganathan, and N. Binkert. “Consistent, Durable, and Safe Memory Management for Byte-addressable Non Volatile Main Memory.” In Proceedings of the 1st ACM SIGOPS Conference on Timely Results in Operating Systems, pp. 1:1–1:17. 2013.

[MCV08] C. McCurdy, A. L. Coxa, and J. Vetter. “Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors.” In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pp. 95–104. 2008.

[MDH+02] D. T. Marr, F. B. Desktop, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. “Hyper-Threading Technology Architecture and Microarchitecture.” Intel Technology Journal, 2002.

[MDP+00] D. S. Milo´ičić,F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. “Process Migration.” ACM Computing Surveys, vol. 32, no. 3, 241–299, 2000.

[Men11] D. Menzi. “Support for heterogeneous cores for Barrelfish.” Master’s thesis, Department of Computer Science, ETH Zurich, 2011.

[Mer14] D. Merkel. “Docker: Lightweight Linux Containers for Consistent Devel- opment and Deployment.” Linux Journal, 2014.

[Mica] Microsoft Corp. “4-Gigabyte Tuning: BCDEdit and Boot.ini.” https://msdn.microsoft.com/en-us/library/windows/desktop/ bb613473(v=vs.85).aspx, accessed 2017-11-20.

[Micb] Microsoft Corp. “Address Windowing Extensions.” https: //msdn.microsoft.com/en-us/library/windows/desktop/ aa366527(v=vs.85).aspx, accessed 2017-11-20.

[Micc] Microsoft Corp. “Fibers.” https://msdn.microsoft.com/en-us/ library/windows/desktop/ms682661(v=vs.85).aspx, accessed 2017- 11-20.

[MLM+86] M. J. Mahon, R. B.-L. Lee, T. C. Miller, J. C. Huck, and W. R. Bryg. “The Hewlett-Packard Precision Architecture: The Processor.” Hewlett- Packard Journal, vol. 37, no. 8, 16–22, 1986.

133 Bibliography

[MMR+13] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gaza- gnaire, S. Smith, S. Hand, and J. Crowcroft. “Unikernels: Library Operating Systems for the Cloud.” In Proceedings of the 18th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pp. 461–472. 2013.

[MPS04] S. Munroe, S. Plaetzer, and J. Stopyro. “Computer system having shared address space among multiple virtual address spaces.”, 2004. US Patent 6,681,239.

[MSLM91] B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. “First-class User-level Threads.” In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, SOSP ’91, pp. 110–121. 1991.

[Nev12] M. Nevill. “An Evaluation of Capabilities for a Multikernel.” Master’s thesis, ETH Zurich, 2012.

[NHM+15] J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. “Latency-Tolerant Software Distributed Shared Memory.” In Proceedings of the USENIX Annual Technical Conference, pp. 291–305. 2015.

[NK98] K. L. Noel and N. Y. Karkhanis. “OpenVMS Alpha 64-bit Very Large Memory Design.” Digital Tech. Journal, vol. 9, no. 4, 33–48, 1998.

[NSN+11] Y. Nomura, R. Senzaki, D. Nakahara, H. Ushio, T. Kataoka, and H. Taniguchi. “Mint: Booting Multiple Linux Kernels on a Multicore Processor.” In Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications, pp. 555–560. 2011.

[OAE+11] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosen- blum, S. M. Rumble, E. Stratmann, and R. Stutsman. “The Case for RAMCloud.” Communications of the ACM, vol. 54, no. 7, 121–130, 2011.

[Ope15] OpenMP Architecture Review Board. “OpenMP Application Program Interface Version 4.5.”, 2015.

[Ora09] Oracle Inc. “Programming Interfaces Guide.”, 2009. https:// docs.oracle.com/cd/E18752_01/pdf/817-4415.pdf, accessed 2017-11- 20.

134 Bibliography

[Ous82] J. Ousterhout. “Scheduling Techniques for Concurrent Systems.” IEEE Distributed Computer Systems, 1982.

[OWZS13] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. “Sparrow: Dis- tributed, Low Latency Scheduling.” In Proceedings of the 24th ACM Symposium on Operating Systems Principles, pp. 69–84. 2013.

[PBV+06] S. Plimpton, R. Brightwell, C. Vaughan, K. Underwood, and M. Davis. “A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark.” In Proceedings of the IEEE International Conference on Cluster Computing, pp. 1–7. 2006.

[PBWH+11] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C. Hunt. “Rethinking the Library OS from the Top Down.” In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 291–304. 2011.

[Pet12] S. Peter. “Resource Management in a Multicore Operating System.” Ph.D. thesis, ETH Zurich, 2012.

[PHA09] H. Pan, B. Hindman, and K. Asanović. “Lithe: Enabling Efficient Composition of Parallel Libraries.” In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism. 2009.

[PLTA14] D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. “ATraPos: Adaptive transaction processing on hardware Islands.” In Proceedings of the IEEE International Conference on Data Engineering, pp. 688–699. 2014.

[PLZ+14] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe. “Arrakis: The Operating System is the Control Plane.” In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 1–16. 2014.

[PPB+12] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki. “OLTP on Hardware Islands.” Proceedings of the VLDB Endowment, vol. 5, no. 11, 1447–1458, 2012.

[PS12] S. Panneerselvam and M. M. Swift. “Chameleon: Operating System Support for Dynamic Processors.” In Proceedings of the 17th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pp. 99–110. 2012.

135 Bibliography

[PSK15] S. Panneerselvam, M. Swift, and N. S. Kim. “Bolt: Faster Reconfigu- ration in Operating Systems.” In Proceedings of the USENIX Annual Technical Conference, pp. 511–516. 2015.

[PSM+15] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki. “Scaling Up Concurrent Main-memory Column-store Scans: Towards Adaptive NUMA-aware Data and Task Placement.” Proceedings of the VLDB Endowment, vol. 8, no. 12, 1442–1453, 2015.

[PWM+14] I. Psaroudakis, F. Wolf, N. May, T. Neumann, A. Böhm, A. Ailamaki, and K. Sattler. “Scaling Up Mixed Workloads: A Battle of Data Fresh- ness, Flexibility, and Scheduling.” In Performance Characterization and Benchmarking. Traditional to Big Data, pp. 97–112. 2014.

[RCS+11] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. “PTask: Operating System Abstractions to Manage GPUs As Compute Devices.” In Proceedings of the 23rd ACM Symposium on Operating Systems Principles, pp. 233–248. 2011.

[RKZB11] B. Rhoden, K. Klues, D. Zhu, and E. Brewer. “Improving Per-node Efficiency in the Datacenter with New OS Abstractions.” In Proceedings of the 2nd ACM Symposium on Cloud Computing, pp. 25:1–25:8. 2011.

[RMG+15] R. Riesen, A. B. Maccabe, B. Gerofi, D. N. Lombard, J. J. Lange, K. Pe- dretti, K. Ferreira, M. Lang, P. Keppel, R. W. Wisniewski, R. Brightwell, T. Inglett, Y. Park, and Y. Ishikawa. “What is a Lightweight Ker- nel?” In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers. 2015.

[RSBJ14] S. Ray, B. Simion, A. D. Brown, and R. Johnson. “Skew-resistant Parallel In-memory Spatial Join.” In Proceedings of the 26th International Conference on Scientific and Statistical Database Management, pp. 1–12. 2014.

[RTG14] P. Roy, J. Teubner, and R. Gemulla. “Low-latency Handshake Join.” Proceedings of the VLDB Endowment, pp. 709–720, 2014.

[RTY+87] R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron, D. Black, W. Bolosky, and J. Chew. “Machine-independent Virtual Memory Man- agement for Paged Uniprocessor and Multiprocessor Architectures.” In

136 Bibliography

Proceedings of the 2nd International Conference on Architectual Support for Programming Languages and Operating Systems, pp. 31–39. 1987.

[SAH+03] C. A. N. Soules, J. Appavoo, K. Hui, R. W. Wisniewski, D. D. Silva, G. R. Ganger, O. Krieger, M. Stumm, M. A. Auslander, M. Ostrowski, B. S. Rosenburg, and J. Xenidis. “System Support for Online Reconfiguration.” In USENIX Annual Technical Conference, pp. 141–154. USENIX, 2003.

[SBRQ13] M. Sadini, A. Barbalace, B. Ravindran, and F. Quaglia. “A Page Co- herency Protocol for Popcorn Replicated-kernel Operating System.” In Proceedings of the ManyCore Architecture Research Community Sympo- sium. 2013.

[SCD+16] D. Schatzberg, J. Cadden, H. Dong, O. Krieger, and J. Appavoo. “EbbRT: A Framework for Building Per-Application Library Operat- ing Systems.” In 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 671–688. 2016.

[SFW14] M. Silberstein, B. Ford, and E. Witchel. “GPUfs: The Case for Operating System Services on GPUs.” Communications of the ACM, vol. 57, no. 12, 68–79, 2014.

[She13] B. H. Sheldon. “Popcorn Linux: enabling efficient inter-core communi- cation in a Linux-based multikernel operating system.” Master’s thesis, Virginia Polytechnic Institute and State University, 2013.

[SKN13] A. Singhania, I. Kuz, and M. Nevill. “Capability Management in Bar- relfish.” Technical Note 013, Barrelfish Project, ETH Zurich, 2013.

[SM12] A. Sumaray and S. K. Makki. “A Comparison of Data Serialization Formats for Optimal Efficiency on a Mobile Platform.” In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, pp. 48:1–48:6. 2012.

[SPR+15] V. Seshadri, G. Pekhimenko, O. Ruwase, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry, and T. Chilimbi. “Page Overlays: An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Manage- ment.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 79–91. 2015.

137 Bibliography

[SR12] B. P. Swenson and G. F. Riley. “A New Approach to Zero-Copy Message Passing with Reversible Memory Allocation in Multi-core Architectures.” In Proceedings of the ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation, pp. 44–52. 2012.

[SS10] L. Soares and M. Stumm. “FlexSC: Flexible System Call Scheduling with Exception-less System Calls.” In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 33–46. 2010.

[SSF99] J. S. Shapiro, J. M. Smith, and D. J. Farber. “EROS: A Fast Capability System.” In Proceedings of the 17th ACM Symposium on Operating Systems Principles, pp. 170–185. 1999.

[SSG+15] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu. “The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory.” In Proceedings of the 48th International Symposium on Microarchitecture, pp. 62–75. 2015.

[str] “Stress Load Generator.” http://people.seas.harvard.edu/~apw/ stress/, accessed 2017-11-20.

[Sun01] Sun Microsystems, Inc. Solaris Live Upgrade 2.0 Guide. Sun Microsys- tems, Inc., 901 San Antonio Road Palo Alto, CA 94303-4900 USA, 2001.

[SWP01] P. Shivam, P. Wyckoff, and D. Panda. “EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing.” In Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 57–57. 2001.

[TABO13] J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu. “Main-memory Hash Joins on Multi-core CPUs: Tuning to the Underlying Hardware.” In Proceedings of the IEEE International Conference on Data Engineering, pp. 362–373. 2013.

[Tea06] T. S. Team. seL4 Reference Manual Version 2.0.0. NICTA, 2006. https://sel4.systems/Info/Docs/seL4-manual-2.0.0.pdf.

138 Bibliography

[The] The Linux Foundation. “Real-Time Linux.” https:// wiki.linuxfoundation.org/realtime/start, accessed 2017-11-20.

[TKM99] M. Takahashi, K. Kono, and T. Masuda. “Efficient Kernel Support of Fine-grained Protection Domains for Mobile Code.” In Proceedings of the 19th IEEE International Conference on Distributed Computing Systems, pp. 64–73. 1999.

[TLL94] C. A. Thekkath, H. M. Levy, and E. D. Lazowska. “Separating Data and Control Transfer in Distributed Operating Systems.” In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 2–11. 1994.

[TLW+09] M. Tiwari, X. Li, H. M. G. Wassel, F. T. Chong, and T. Sherwood. “Execution Leases: A Hardware-supported Mechanism for Enforcing Strong Non-interference.” In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 493–504. 2009.

[TMS11] L. Tang, J. Mars, and M. L. Soffa. “Contentiousness vs. Sensitivity: Improving Contention Aware Runtime Systems on Multicore Architec- tures.” In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 12–21. 2011.

[TMV+11] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. “The Impact of Memory Subsystem Resource Sharing on Datacenter Applica- tions.” In Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 283–294. 2011.

[Tra] Transaction Processing Performance Council. “TPC-H.” http:// www.tpc.org/tpch/, accessed 2017-11-20.

[TTH16] B. Teabe, A. Tchana, and D. Hagimont. “Application-specific Quantum for Multi-core Platform Scheduler.” In Proceedings of the 11th European Conference on Computer Systems. 2016.

[VBYN+14] L. Vilanova, M. Ben-Yehuda, N. Navarro, Y. Etsion, and M. Valero. “CODOMs: Protecting Software with Code-centric Memory Domains.” In Proceeding of the 41st Annual International Symposium on Computer Architecuture, pp. 469–480. 2014.

139 Bibliography

[VDC17] J. S. Vetter, E. P. DeBenedictis, and T. M. Conte. “Architectures for the Post-Moore Era.” IEEE Micro, vol. 37, no. 4, 6–8, 2017.

[VJN+17] L. Vilanova, M. Jordà, N. Navarro, Y. Etsion, and M. Valero. “Di- rect Inter-Process Communication (dIPC): Repurposing the CODOMs Architecture to Accelerate IPC.” In Proceedings of the 12th European Conference on Computer Systems, pp. 16–31. 2017.

[VSG+10] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo- Martinez, S. Swanson, and M. B. Taylor. “Conservation Cores: Reducing the Energy of Mature Computations.” In Proceedings of the 15th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pp. 205–218. 2010.

[VSM+08] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. P. Jouppi, M. Fiorentino, A. Davis, N. Binkert, R. G. Beausoleil, and J. H. Ahn. “Corona: System implications of emerging nanophotonic technology.” In ACM SIGARCH Computer Architecture News, vol. 36, pp. 153–164. IEEE Computer Society, 2008.

[VTS11] H. Volos, A. J. Tack, and M. M. Swift. “Mnemosyne: Lightweight Persistent Memory.” In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 91–104. 2011.

[WA09] D. Wentzlaff and A. Agarwal. “Factored operating systems (fos): the case for a scalable operating system for multicores.” SIGOPS Operating Systems Review, vol. 43, no. 2, 76–85, 2009.

[WD94] S. J. White and D. J. DeWitt. “QuickStore: A High Performance Mapped Object Store.” In Proceedings of the ACM SIGMOD Interna- tional Conference on Management of Data, pp. 395–406. 1994.

[WGB+10] D. Wentzlaff, C. Gruenwald III, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. Miller, and A. Agarwal. “An Operating System for Multi- core and Clouds: Mechanisms and Implementation.” In ACM Symposium on Cloud Computing. 2010.

[Wika] Wikipedia. “Barrelfish — Wikipedia, The Free Encyclopedia.” https: //en.wikipedia.org/wiki/Barrelfish, accessed 2017-11-20.

140 Bibliography

[Wikb] Wikipedia. “DragonFly BSD — Wikipedia, The Free Encyclope- dia.” https://en.wikipedia.org/wiki/DragonFly_BSD, accessed 2017- 11-20.

[Wikc] Wikipedia. “PostgreSQL — Wikipedia, The Free Encyclopedia.” https: //en.wikipedia.org/wiki/PostgreSQL, accessed 2017-11-20.

[Wikd] Wikipedia. “Redis — Wikipedia, The Free Encyclopedia.” https: //en.wikipedia.org/wiki/Redis, accessed 2017-11-20.

[Wik03] Wikipedia. “Memcached — Wikipedia, The Free Encyclopedia.”, 2003. https://en.wikipedia.org/wiki/Memcached, accessed 2017-11-20.

[Wil79] M. V. Wilkes. The Cambridge CAP Computer and Its Operating System (Operating and Programming Systems Series). North-Holland Publishing Co., 1979.

[WNW+17] R. N. M. Watson, P. G. Neumann, J. Woodruff, M. Roe, J. Anderson, J. Baldwin, D. Chisnall, B. Davis, A. Joannou, B. Laurie, S. W. Moore, S. J. Murdoch, R. Norton, S. Son, and H. Xia. “Capability Hardware Enhanced RISC Instructions: CHERI Instruction-Set Architecture (Ver- sion 6).” Technical Report UCAM-CL-TR-907, University of Cambridge, Computer Laboratory, 2017.

[WRA05] E. Witchel, J. Rhee, and K. Asanović. “Mondrix: Memory Isolation for Linux Using Mondriaan Memory Protection.” In Proceedings of the 20th ACM Symposium on Operating Systems Principles, pp. 31–44. 2005.

[WS92] J. Wilkes and B. Sears. “A comparison of Protection Lookaside Buffers and the PA-RISC Protection Architecture.” Technical Report HPL–92–55, Computer Systems Laboratory, Hewlett-Packard Laborato- ries, 1992.

[YRV11] Y. Ye, K. A. Ross, and N. Vesdapunt. “Scalable Aggregation on Multicore Processors.” In Proceedings of the 7th International Workshop on Data Management on New Hardware. 2011.

[YSD+09] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy, S. Okasaka, N. Narula, and N. Fullagar. “Native Client: A Sandbox for

141 Bibliography

Portable, Untrusted x86 Native Code.” In Proceedings of the 30th IEEE Symposium on Security and Privacy, pp. 79–93. 2009.

[ZBF10] S. Zhuravlev, S. Blagodurov, and A. Fedorova. “Addressing Shared Resource Contention in Multicore Processors via Scheduling.” In Pro- ceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems, pp. 129–142. 2010.

[ZGKR14] G. Zellweger, S. Gerber, K. Kourtis, and T. Roscoe. “Decoupling Cores, Kernels, and Operating Systems.” In 11th USENIX Symposium on Operating Systems Design and Implementation, pp. 17–31. 2014.

[ZRMH00] R. Zahir, J. Ross, D. Morris, and D. Hess. “OS and Compiler Consider- ations in the Design of the IA-64 Architecture.” In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 212–221. 2000.

[ZSR12] G. Zellweger, A. Schuepbach, and T. Roscoe. “Unifying Synchronization and Events in a Multicore OS.” In Proceedings of the 3rd Asia-Pacific Workshop on Systems. 2012.

142 Curriculum Vitae

Gerd Zellweger

Education

2013 - 2017 Doctorale Candidate in Computer Science ETH Zurich, Switzerland

2009 - 2012 Master of Science (M. Sc.) in Computer Science ETH Zurich, Switzerland

2006 - 2010 Bachelor of Science (B. Sc.) in Computer Science ETH Zurich, Switzerland

Professional Experience

2013-2017 Research Assistant and Doctoral Candidate ETH Zurich, Switzerland - Advisor: Prof. Dr. Timothy Roscoe - Research projects: Barrelfish OS

Summer 2015 Research Associate HP Labs, Palo Alto, USA

Summer 2014 Research Intern Microsoft Research, Redmond, USA

2012 Software Engineer sc-n.ch, Zurich, Switzerland Curriculum Vitae

Selected Publications

2016 So many performance events, so little time. Gerd Zellweger, Denny Lin, Timothy Roscoe. In APSys’16.

2016 Customized OS support for data processing. Jana Giceva, Gerd Zellweger, Gustavo Alonso, Timothy Roscoe. In DaMoN’16.

2016 SpaceJMP: Programming with Multiple Virtual Address Spaces. Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, Karsten Schwan. In ASPLOS’16.

2014 Decoupling Cores, Kernels, and Operating Systems. Gerd Zellweger, Simon Gerber, Kornilios Kourtis, Timothy Roscoe. In OSDI’14.

2012 Unifying Synchronization and Events in a Multicore OS. Gerd Zellweger, Adrian Schuepbach, Timothy Roscoe. In APSys’12.

Teaching and Mentoring Experience

Fall 2017 Systems Programming & Computer Architecture

Spring 2017 Application oriented programming with MATLAB

Fall 2016 Systems Programming & Computer Architecture

Spring 2016 Programming & Problem solving

Fall 2015 Systems Programming & Computer Architecture

Spring 2015 Parallel Programming

Fall 2014 Systems Programming & Computer Architecture

Spring 2014 Parallel Programming

Fall 2013 Computer Science for Biology, Pharmaceutical Sciences, HST

Spring 2013 Computer Science for Biology, Pharmaceutical Sciences, HST

144