Exploitable Hardware Features and Vulnerabilities Enhanced Side-Channel Attacks on SGX and Their Countermeasures

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio University

By

Guoxing Chen, B.S., M.S.,

Graduate Program in Science and Engineering

The Ohio State University

2019

Dissertation Committee:

Dr. Ten H. Lai, Advisor Dr. Yinqian Zhang, Co-Advisor Dr. Radu Teodorescu Dr. Zhiqiang Lin c Copyright by

Guoxing Chen

2019 Abstract

Intel Software Guard eXtensions (SGX) provides software applications shielded execu-

tion environments to run private code and operate sensitive , where both the code and

data are isolated from the rest of the software systems. Despite of its security promises,

today’s SGX design has been demonstrated to be vulnerable to various side-channel attacks,

and countermeasures have been proposed to mitigate these attacks. However, current under-

standing of the attack vectors and the corresponding countermeasures is insufficient. This

dissertation explores new attacks when the adversary could exploit hardware features, such

as Hyper-Threading and , and aims to design comprehensive defense

mechanisms that could address existing threats. Specifically, we first demonstrate how to

abuse Hyper-Threading to launch attacks that could bypass existing AEX-based mitigations.

Then, we introduce SGXPECTRE Attacks, the SGX-variants of the recently disclosed Spec-

tre attacks, that exploit speculative execution vulnerabilities to subvert the confidentiality

of SGX enclaves. On the defense side, we first design and implement HYPERRACE, an

LLVM-based tool for instrumenting SGX enclave programs to eradicate all side-channel

threats due to Hyper-Threading. Then, to address the limitations of existing mitigations, we

extend the idea of HYPERRACE and propose the concept of verifiable execution contracts, which request the privileged software to provide a benign execution environment for enclave within which launching attacks becomes infeasible.

ii To my father, Yizai Chen, my mother, Linyan Yang, my sisters, Fangfang Chen and Xiaofang

Chen, who love and support me unconditionally to pursue my dreams.

iii Acknowledgments

I would like to express my heartfelt gratitude to my advisors, Dr. Ten H. Lai and Dr.

Yinqian Zhang for their patient and careful supervision. Dr. Lai takes me under his wings,

offering me complete freedom in pursuing my own research interests and sharing with me

his infectious optimism about research and life. Dr. Zhang leads me to explore cutting

edge area of research and teach me patiently to tackle research problems with his extensive

knowledge and expertise. His incredible energy and passion for research inspired me a lot. I

feel double lucky to have both of them as my advisors.

I also want to thank my collaborators and mentors. In particular, I would like to thank Dr.

Dong Xuan, who taught me a lot over the years. I did enjoy the moments when we worked

together to build various amazing systems. Beyond research, Dr. Xuan is also a great friend, who gave me many valuable suggestions when I encountered difficulties and unexpected

situations. I am also grateful to Dr. Michael Reiter, Dr. XiaoFeng Wang and Dr. Zhiqiang

Lin, for their extensive advice and dedication to our collaborative research projects. I feel so

honored to have worked with all of them.

iv Vita

May 14, 1988 ...... Born, Wenzhou, China. 2010 ...... B.S. Information Engineering, Shanghai Jiao Tong University, Shanghai, China. 2013 ...... M.S. Information and Communication En- gineering, Shanghai Jiao Tong University, Shanghai, China. 2013-present ...... Ph.D. Candidate, Computer Science and Engineering, The Ohio State University, USA.

Publications

Research Publications

Guoxing Chen, Sanchuan Chen, Yuan Xiao, Yinqian Zhang, Zhiqiang Lin, and Ten H. Lai. SgxPectre Attacks: Stealing Intel Secrets from SGX Enclaves via Speculative Execution In Proceedings of IEEE European Symposium on Security and Privacy (EuroS&P), 2019.

Guoxing Chen*, Wenhao Wang* (*co-first authors), Tianyu Chen, Sanchuan Chen, Yinqian Zhang, XiaoFeng Wang, Ten H. Lai, Dongdai Lin. Racing in Hyperspace: Closing Hyper- Threading Side Channels on SGX with Contrived Data Races In Proceeding of IEEE Symposium on Security and Privacy (S&P), 2018.

Guoxing Chen, Ten H. Lai, Michael Reiter, Yinqian Zhang. Differentially Private Ac- cess Patterns for Searchable Symmetric Encryption In Proceeding of IEEE International Conference on Computer Communications (INFOCOM), 2018.

v Wenhao Wang, Guoxing Chen, Xiaorui Pan, Yinqian Zhang, XiaoFeng Wang, Vincent Bindschaedler, Haixu Tang, Carl A. Gunter. Leaky Cauldron on the Dark Land: Understand- ing Memory Side-Channel Hazards in SGX In Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS), 2017.

Gang Li, Fan Yang, Guoxing Chen, Qiang Zhai, Xinfeng Li, Jin Teng, Junda Zhu, Dong Xuan, Biao Chen, Wei Zhao. EV-Matching: Bridging Large Visual Data and Electronic Data for Efficient Surveillance In Proceeding of IEEE International Conference on Systems (ICDCS), 2017.

Fan Yang, Qiang Zhai, Guoxing Chen, Adam C. Champion, Junda Zhu, Dong Xuan. Flash- Loc: Flashing Mobile Phones for Accurate Indoor Localization In Proceeding of IEEE International Conference on Computer Communications (INFOCOM), 2016.

Jihun Hamm, Adam Champion, Guoxing Chen, Mikhail Belkin, Dong Xuan. Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices In Proceeding of IEEE International Conference on Distributed Computing Systems (ICDCS), 2015.

Wenjie Lin, Guoxing Chen, Ten H. Lai, David Lee. Detecting the Vulnerability of Multi- Party Authorization Protocols to Name Matching Attacks In Proceedings of the International Conference on Security and Management (SAM), 2014.

Guoxing Chen, Zhengzheng Xiang, Changqing Xu, Meixia Tao. On Degrees of Freedom of Cognitive Networks with User Cooperation In IEEE Wireless Communications Letters, 2012.

Fields of Study

Major Field: Computer Science and Engineering

vi Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita ...... v

List of Tables ...... x

List of Figures ...... xi

1. Introduction ...... 1

1.1 Overview ...... 1 1.2 HT-SPM: Hyper-Threading Assisted Sneaky Page Monitoring Attacks .3 1.3 SGXPECTRE: Speculative Execution Enabled Side-Channel Attacks . . .5 1.4 HYPERRACE: Hyper-Threading Side-Channel Mitigation ...... 6 1.5 Securing TEEs with Verifiable Execution Contracts ...... 7

2. Background and Threat Model ...... 9

2.1 Intel SGX ...... 9 2.2 Intel Internals ...... 12 2.2.1 and ...... 12 2.2.2 Hardware Extensions of Intel Processors ...... 13 2.2.3 Out-of-order and Speculative Execution ...... 14 2.3 Threat Model ...... 15 2.4 Existing Threats to SGX ...... 16 2.5 Effectiveness of Existing Defenses ...... 19

vii 3. HT-SPM: Hyper-Threading Assisted Sneaky Page Monitoring Attacks . . . . . 21

3.1 Overview ...... 21 3.2 Design ...... 24 3.3 Evaluation ...... 25

4. SGXPECTRE: Speculative Execution Enabled Side-Channel Attacks ...... 28

4.1 SGXPECTRE Attacks ...... 30 4.1.1 A Simple Example ...... 30 4.1.2 Injecting Branch Targets into Enclaves ...... 32 4.1.3 Controlling Registers in Enclaves ...... 35 4.1.4 Leaking Secrets via Side Channels ...... 36 4.1.5 Winning a ...... 38 4.2 Attack Gadgets Identification ...... 39 4.2.1 Types of Gadgets ...... 39 4.2.2 Symbolically Executing SGX Code ...... 41 4.2.3 Gadget Identification ...... 42 4.2.4 Experimental Results of Gadget Detection ...... 43 4.3 Stealing Enclave Secrets ...... 48 4.3.1 Reading Register Values from Arbitrary Enclaves ...... 48 4.3.2 Stealing Intel Secrets ...... 52 4.4 Evaluating Existing Countermeasures ...... 55 4.5 Is SGX Broken? ...... 57 4.5.1 Intel’s Secrets ...... 57 4.5.2 Defense via Centralized Attestation Services ...... 60 4.6 Summary ...... 61

5. HYPERRACE: Hyper-Threading Side-Channel Mitigation ...... 62

5.1 Overview ...... 62 5.1.1 Motivation ...... 62 5.1.2 Design Summary ...... 64 5.2 Physical-core Co-Location Tests ...... 66 5.2.1 Straw-man Solutions ...... 66 5.2.2 Co-Location Test via Data Race Probability ...... 68 5.3 Security Analysis of Co-location Tests ...... 76 5.3.1 Security Model ...... 77 5.3.2 Security Analysis ...... 79 5.3.3 Empirical Security Evaluation ...... 88 5.4 Protecting Enclave Programs with HYPERRACE ...... 91

viii 5.4.1 Safeguarding Enclave Programs ...... 91 5.4.2 Implementation of HYPERRACE ...... 93 5.5 Performance Evaluation ...... 93 5.5.1 nbench ...... 94 5.5.2 Cryptographic Libraries ...... 98 5.6 Summary ...... 99

6. Securing TEEs with Verifiable Execution Contracts ...... 101

6.1 Overview ...... 102 6.1.1 Limitations of Existing Defenses ...... 102 6.1.2 Verifiable Execution Contracts as Defense ...... 103 6.2 Execution contracts ...... 104 6.2.1 Construction of Execution Contracts ...... 105 6.2.2 Security Guarantees ...... 110 6.2.3 Remaining Challenges ...... 113 6.3 Verifiability ...... 113 6.3.1 Available Signals ...... 113 6.3.2 Verifiability Models ...... 114 6.3.3 Verification of Proposed Contracts ...... 116 6.4 Implementation ...... 119 6.4.1 Enforcing Execution Contracts ...... 119 6.4.2 Verifying Execution Contracts ...... 121 6.5 Evaluation ...... 123 6.5.1 Security Evaluation ...... 123 6.5.2 Performance Evaluation ...... 124 6.6 Execution Contracts without Memory Confidentiality ...... 129 6.6.1 Threat Analysis ...... 130 6.6.2 Defeating Memory Leaks with Execution Contracts ...... 131 6.6.3 -Level Mitigation ...... 133 6.6.4 Preventing Replay Attacks ...... 136 6.7 Discussion ...... 137 6.8 Summary ...... 138

7. Conclusion ...... 139

Bibliography ...... 141

ix List of Tables

Table Page

2.1 MESI cache line states...... 12

2.2 Existing threats to SGX ...... 17

3.1 Configuration of the testbed, available per logical core when HyperThread- ing is enabled...... 26

4.1 SGXPECTRE Attack Type-I gadgets in popular SGX runtime libraries. . . . 45

4.2 SGXPECTRE Attack Type-II gadgets in popular SGX runtimes...... 47

4.3 Attestation results [47] ...... 61

5.1 Hyper-Threading side channels...... 63

5.2 Time intervals (in cycles) of T0 and T1...... 80

5.3 Instruction latencies (in cycles) caused by disabling caching...... 86

5.4 Evaluation of false positive rates...... 90

5.5 Evaluation of false negative rates...... 91

5.6 Memory overhead (nbench)...... 97

6.1 Security analysis ...... 111

6.2 Estimation of Type II errors (false negative rates)...... 124

x List of Figures

Figure Page

3.1 entries...... 22

4.1 A simple example of SGXPECTRE Attacks. The gray blocks represent code or data outside the enclave. The white blocks represent enclave code or data. 31

4.2 Poisoning BTB from the same or a different process ...... 33

4.3 EENTER and ECall table lookup ...... 35

4.4 Best scenarios for winning a race condition. Memory accesses D1, I1, D2, D3 are labeled next to the related instructions. The address translation and data accesses are illustrated on the right: The 4 blocks on top denote the units holding the address translation information, including TLBs, paging structures, caches (for PTEs), and the memory; the 4 blocks at the bottom denote the units holding data/instruction. The shadow blocks represent the units from which the address translation or data/instruction access are served. 37

4.5 Exploiting Intel SGX SDK. The blocks with dark shadows represent in- structions or data located in untrusted memory. Blocks without shadows are instructions inside the target enclave or the .data segment of the enclave memory...... 49

4.6 Intel’s secrets and key derivation...... 58

5.1 Data races when threads are co-located/not co-located...... 69

5.2 Co-location detection code...... 71

5.3 The basic idea of the data race design. Monitoring the memory operations of the two threads on V . LD: load; ST: store...... 73

xi 5.4 The model of T0 and thread T1. •: load; : store...... 77

5.5 Demonstration of the cross-core communication time. There is no data race if the dummy instructions take time shorter than 190 cycles...... 81

5.6 The effects of frequency changing on execution speed, cache latencies, and cross-core communication time...... 82

5.7 The histogram of wbinvd execution time over 1,000,000 measurements. . . 85

5.8 Normalized number of iterations of nbench applications when running with a busy looping program on the co-located logical core...... 95

5.9 Runtime overhead due to AEX detection; q = Inf means one AEX detection per basic block; q = 20/15/10/5 means one additional AEX detection every q instructions within a basic block...... 96

5.10 Runtime overhead of performing co-location tests when q = 20...... 98

5.11 Overhead of crypto algorithms...... 99

6.1 Resource reservation contracts ...... 105

6.2 Runtime interaction contracts ...... 107

6.3 Normalized number of iterations per second...... 125

6.4 Performance gain when Hyper-Threading Control contract and AEX-free execution window are applied...... 126

6.5 Overhead for defeating both same-core and cross-core side-channel attacks. 127

6.6 The running time of the SPEC CPU2006 benchmark suite under various CAT settings, on a CPU with 13.75M 11-way associated LLC. Each way can be assigned to specific set of cores...... 128

xii Chapter 1: Introduction

1.1 Overview

The growing demands for secure data-intensive computing and rapid development of

hardware technologies bring in a new generation of hardware support for scalable trusted

execution environments (TEE), with the most prominent example being Intel Software

Guard Extensions (SGX). SGX is a hardware extension available in recent Intel processors.

It is designed to improve the application security by removing the privileged code from

the trusted computing base (TCB). At a high level, SGX provides software applications

shielded execution environments, called enclaves, to run private code and operate sensitive

data, where both the code and data are isolated from the rest of the software systems. Even

privileged software such as the operating systems and hypervisors are not allowed to directly

inspect or manipulate the memory inside the enclaves. Software applications adopting Intel

SGX are partitioned into sensitive and non-sensitive components. The sensitive components

run inside the SGX enclaves (hence called enclave programs) to harness the SGX protection, while non-sensitive components run outside the enclaves and interact with the system

software.

Although SGX is still in its infancy, the promise of shielded execution has encouraged

researchers and practitioners to develop various new applications to utilize these features

1 (e.g., [22, 39, 59, 60, 67, 77, 78, 93, 96]), and new software tools or frameworks (e.g., [24,

25,41,52,57,69,73,75,80,82,88]) to help developers adopt this emerging programming

paradigm. Most recently, SGX has been adopted by commercial public clouds, such as

Azure confidential computing [66] and Alibaba ECS Bare Metal Instance [9], aiming to

protect cloud data security even with compromised operating systems or hypervisors, or

even “malicious insiders with administrative privilege".

Despite of its security promises, today’s SGX design has been demonstrated to be vulnerable to various side-channel attacks [38, 53, 73, 91]. One example of such side

channels is the page-fault channels [73, 91] in which the adversary with full control of

the OS can induce page faults (by manipulating the page tables inside the kernel) during

an enclave program’s runtime, so as to identify the secret data the program’s page access

pattern depends upon. Besides, traditional micro-architectural side channels also exist in the

SGX context, including the CPU cache attacks [26,35,38,68], branch target buffer (BTB)

attacks [53], etc.

Countermeasures have been proposed to mitigate these attacks [30,36,71,72]. Shinde et

al. [72] and Shih et al. [71] address page-fault based side channels. Cloak targets cache side

channels [36]. Déjà Vu focuses on side-channel attacks that need to frequently interrupt the

execution of the victim enclave [30].

However, current understanding of the attack vectors and the corresponding countermea-

sures is insufficient, especially when the adversary could exploit hardware features, such as

Hyper-Threading and speculative execution, to enhance existing side-channel attacks, and

thus reducing the effectiveness of existing defenses.

In this dissertation, we aim to

2 • Further explore the attack vectors when Hyper-Threading and speculative execution

are supported. Specifically, in Section 1.2 and Chapter 3, we propose HT-SPM

attacks to demonstrate how Hyper-Threading could assist to evade existing defenses.

Moreover, inspired by the recently disclosed CPU vulnerabilities due to speculative

execution [40], we propose SGXPECTRE Attacks (in Section 1.3 and Chapter 4) to

completely compromise the confidentiality of SGX enclave programs developed using

most SGX runtime libraries (e.g.Intel SGX SDK, Rust-SGX, Graphene-SGX), and

even steal Intel secrets from Intel signed privileged enclaves.

• Propose holistic software solutions to the problem. Particular, we design and im-

plement HYPERRACE (in Section 1.4 and Chapter 5) to close all Hyper-Threading

side channels. The idea of HYPERRACE is to ask the OS to reserve the sibling hyper

thread for the enclave program. And reliably verifying such a scheduling arrangement

performed by the untrusted OS becomes the key challenge of HYPERRACE. We

then extend the idea of HYPERRACE and propose the concept of verifiable execution

contracts (in Section 1.5 and Chapter 6), which request the privileged software to

provide a benign execution environment for enclave within which launching attacks

becomes infeasible.

1.2 HT-SPM: Hyper-Threading Assisted Sneaky Page Monitoring At- tacks

Side channels can be categorized as same-core side channels, and cross-core side

channels, depending on whether the attack program, which collects information through

any of these side channels, is run on the same core executing the enclave program or not.

Same-core side channels include shared units within one physical core, such as L1 cache,

3 branch prediction units, and translation lookaside buffers, while cross-core side channels

include shared units between different physical cores such as last-level cache and DRAM.

Cross-core side channels in SGX are no different from those in other contexts, which

tend to be noisy and often harder to exploit in practice (e.g., to synchronize with the victim).

By comparison, the noise-free and easy-to-exploit same-core side channels are uniquely

threatening under the SGX threat model. Conventional ways to exploit same-core channels

are characterized by a large number of exceptions or interrupts to frequently transfer the

control of a core back and forth between an enclave process and the attacker-controlled OS

kernel, through a procedure called Asynchronous Enclave Exits (AEX). Such AEX-based

side-channel attacks have been intensively studied [38,53,73,91] and new defense proposals

continue to be made, often based upon detection of high frequency AEXs [30, 71].

To demonstrate that a large number of AEXs are not a necessary condition of side-

channel attacks, we develop a new memory-based attack, called Hyper-Threading assisted

sneaky page monitoring (HT-SPM). HT-SPM attacks work by setting and resetting a page’s

accessed flag in the page table entry (PTE) to monitor when the page is visited. Unlike

the page-fault attacks [91], in which a page fault is generated each time when a page is

accessed, manipulation of the accessed flags does not trigger any interrupt directly. However,

the attack still needs to flush the translation lookaside buffer (TLB) from time to time by

triggering interrupts to force the CPU to look up page tables and set accessed flags in

the PTEs. Nevertheless, we found that these interrupts can be completely eliminated in

such attacks. Particularly, in Chapter 3, we present a technique that utilizes Intel’s Hyper-

Threading capability to flush the TLBs through an attack process sharing the same CPU core with the enclave code, which can eliminate the need of interrupts, when Hyper-Threading is

on.

4 1.3 SGXPECTRE: Speculative Execution Enabled Side-Channel Attacks

In SGX, the CPU itself, as part of the TCB, plays a crucial role in the security promises.

However, the recently disclosed CPU vulnerabilities due to the out-of-order and speculative

execution [40] have raised many questions and concerns about the security of SGX. Par-

ticularly, the so-called Meltdown [54] and Spectre attacks [51] have demonstrated that an

unprivileged application may exploit these vulnerabilities to extract memory content that is

only accessible to privileged software. The developers have been wondering whether SGX will hold its original security promises after the disclosure of these hardware bugs [14]. It is

therefore imperative to answer this important question and understand its implications to

SGX.

As such, we set off our study with the goal of comprehensively understanding the security

impact of these CPU vulnerabilities on SGX. We particularly study Spectre-like attacks

against SGX enclaves, resulting in the. We aim to answer the following research questions:

(1) Is SGX vulnerable to Spectre attacks? (2) As Spectre attacks rely on vulnerable code

patterns to be exploitable, do such code patterns exist in real-world enclave programs? (3)

What are the consequences of the attacks? (4) Is SGX completely broken due to these

hardware bugs? The answers to these questions are critically important to the adoption of

the SGX technology and commercialization of SGX-based applications in the future; they

are also valuable to the research community in understanding SGX’s threat model.

Our study leads to the SGXPECTRE Attacks, a new breed of the Spectre attacks on SGX.

We show that SGXPECTRE Attacks can completely compromise the confidentiality of SGX

enclaves. In particular, because vulnerable code patterns exist in most SGX runtime libraries

(e.g., Intel SGX SDK, Rust-SGX, Graphene-SGX) and are difficult to be eliminated, the

5 adversary could perform SGXPECTRE Attacks against any enclave programs. Particularly, we demonstrate end-to-end attacks to steal Intel secrets from Intel signed enclaves.

1.4 HYPERRACE: Hyper-Threading Side-Channel Mitigation

As demonstrated by the HT-SPM, AEX-detection based protection could be evaded

through exploiting a set of side channels enabled or assisted by Hyper-Threading.To the best

of our knowledge, no prior work has successfully mitigated Hyper-Threading side channels

in SGX.

Chapter 5 reports a study that aims at filling this gap, understanding and addressing the

security threats from Hyper-Threading side channels in the SGX setting, and deriving novel

protection to close all Hyper-Threading side channels. In addition, our solution seamlessly

integrates with a method to detect AEXs from within the enclave, and thus completely

eliminates all same-core side channels on SGX.

Since disabling Hyper-Threading will introduce significant performance loss, we aim to

design an alternative solution, i.e., creating a shadow thread from the enclave program and

asking the OS to schedule it on the other logical core, so that no other process can share the

same physical core as the enclave program. However, it is very challenging to reliably verify

such a scheduling arrangement performed by the untrusted OS. To make this approach work, we need an effective physical-core co-location test to determine whether two threads are

indeed scheduled on the same physical core.

To address this problem, we present a unique technique that utilizes contrived data races

between two threads of the same enclave program to calibrate their inter-communication

speed using the speed of their own executions. More specifically, data races are created

by instructing both threads to simultaneously read from and write to a shared variable. By

6 carefully constructing the read-write sequences (Section 5.2), it is ensured that when both

threads operate on the same core, they will read from the shared variable the value stored by

the other thread with very high probabilities. Otherwise, when the threads are scheduled to

different cores, they will, with high probabilities, only observe values stored by themselves.

Using this technique, we designed and implemented an LLVM-based tool, called HY-

PERRACE, which compiles an enclave program from the source code and instruments it to

conduct frequent AEX detections and co-location tests during the execution of the enclave

program. The resulting binary is an enclave program that automatically protects itself

from all Hyper-Threading side-channel attacks (and other same-core side-channel attacks),

completely closing such side channels.

1.5 Securing TEEs with Verifiable Execution Contracts

A fundamental limitation of many of existing mitigations is that they are inherently

signature-based detection techniques, which are vulnerable to evasion attacks. For exam-

ple, Varys detects side-channel attacks when they trigger AEX more than 100 times per

second [62]. It is impossible to select a lower threshold because regular timer interrupts will be delivered to the CPU core running the enclave program at the rate of at least 100HZ.

However, this allows the adversary to launch attacks at a rate lower than 100Hz and remain

undetected. The root cause of such evasion attacks is that the signatures of side-channel

attacks (e.g., AEXs) are not unique to attack activities; some normal system activities share

the same traits. Even for HYPERRACE, the gaps between two consecutive co-location tests

are not protected well.

Chapter 6 introduces the concept of verifiable execution contracts, aiming to create a

contract between the OS and the enclaves that describes a guaranteed execution environment

7 for the enclaves, within which launching attacks against enclaves becomes infeasible. Similar to HYPERRACE, the basic idea of the verification execution contracts is to ask the OS to behave in a way (defined in the contracts) that could help enclave mitigate threats. For example, by requesting the OS to manage the page tables of the enclave in a way such that no page fault would occur, the page-fault based attacks can be mitigated completely. As a straightforward extension, execution contracts can be developed to prevent attacks due to enclaves’ OS-dependency (dubbed OS-dependent attacks), such as replay attacks against stateful enclaves and communication delay attacks. However, since the OS is untrusted and might violate the execution contracts, the ability to reliably verify whether the execution contracts are fulfilled correctly becomes the main challenge to be addressed in our work.

8 Chapter 2: Background and Threat Model

In this chapter, we introduce the background knowledge of Intel SGX, existing threats to

Intel SGX and related countermeasures. Related work will also be summarized along with the background introduction.

2.1 Intel SGX

Intel SGX is a hardware extension in recent Intel processors offering stronger application security by providing primitives such as memory isolation, memory encryption, sealed storage, and remote attestation. An important concept in SGX is the secure enclave. An enclave is an execution environment created and maintained by the processor so that only applications running in it have a dedicated memory region that is protected from all other software components. Both confidentiality and integrity of the memory inside enclaves are protected from the untrusted system software.

Entering and exiting enclaves. To enter the enclave mode, the software executes the

EENTER leaf function by specifying the address of Thread Control Structure (TCS) inside the enclave. TCS holds the location of the first instruction to execute inside the enclave.

Multiple TCSs can be defined to support multi-threading inside the same enclave. Registers used by the untrusted program may be preserved after EENTER. The enclave runtime needs to determine the proper control flow depending on the register values.

9 Asynchronous Enclave eXit (AEX). When interrupts, exceptions, and VM exits happen

during the enclave mode, the processor will securely save the execution state in the State

Save Area (SSA) of the current enclave thread, and replace it with a synthetic state to prevent

information leakage. After the interrupts or exceptions are handled, the execution will be

returned (through IRET) from the kernel to an address external to enclaves, which is known

as Asynchronous Exit Pointer (AEP). The ERESUME leaf function will be executed to transfer

control back to the enclave by filling the RIP with the copy saved in the SSA.

Memory isolation. Memory isolation of enclave programs is a key design feature of Intel

SGX. To maintain backward-compatibility, Intel implements such isolation via extensions

to existing processor architectures. Intel SGX reserves a range of continuous physical

memory exclusively for enclave programs and their control structures, dubbed Processor

Reserved Memory (PRM). The extended memory management units (MMU) of the CPU

prevents accesses to the PRM from all programs outside enclaves, including the OS kernel, virtual machine hypervisors, SMM code or Direct Memory Accesses (DMA). Enclave Page

Cache (EPC) is a subset of the PRM memory range. The EPC is divided to pages of 4KBs

and managed similarly as the rest of the physical memory pages. Each EPC page can be

allocated to one enclave at one time.

The space of each program has an Enclave Linear Address Range

(ELRANGE), which is reserved for enclaves and mapped to the EPC pages. Sensitive

code and data is stored within the ELRANGE. Page tables responsible for translating virtual addresses to physical addresses are managed by the untrusted system software. The

translation lookaside buffer (TLB) works for EPC pages in traditional ways. When the

CPU transitions between non-enclave mode and enclave mode, through EENTER or EEXIT

instructions or Asynchronous Enclave Exits (AEXs), TLB entries associated with the current

10 Process-Context Identifier (PCID) as well as the global identifier are flushed, preventing

non-enclave code learning information about address translation inside the enclaves.

Memory encryption. To support larger ELRANGE than EPC, EPC pages can be “swapped”

out to regular physical memory. This procedure is called EPC page eviction. The confiden-

tiality and integrity of the evicted pages are guaranteed through authenticated encryption.

The hardware Memory Encryption Engine (MEE) is integrated with the

and seamlessly encrypts the content of the EPC page that is evicted to a regular physical

memory page. A Message Authentication Code (MAC) protects the integrity of the encryp-

tion and a nonce associated with the evicted page. The encrypted page can be stored in

the main memory or swapped out to secondary storage similar to regular pages. But the

metadata associated with the encryption needs to be kept by the system software properly

for the page to be “swapped” into the EPC again.

Microcode update and CPU security version. Besides hardware modifications, SGX

instructions are implemented in microcode. And the microcode update version is reflected in

a 128-bit value called CPU security version (CPUSVN). CPUSVN is included in the derivation of various enclave specific secrets and used as an indicator of whether the SGX implementation

of the underlying platform is up-to-date.

Local and remote attestation. SGX provides local and remote attestation to enclaves

to prove to another entity that a claimed enclave is running on a trusted SGX platform.

Local attestation is used to convince another “local” enclave running on the same platform.

Specifically, EREPORT instruction is used by an attested enclave to generate an attestation

signature using the signing key of the target enclave, and EGETKEY instruction is used by

the target enclave to derive the signing key for verification. Remote attestation is used

to convince a remote entity. SGX adopts an anonymous signature scheme, called Intel

11 Enhanced Privacy ID (EPID), for producing attestation signatures. Intel introduces two

privileged enclaves, called the Provisioning Enclave (PvE) and the Quoting Enclave (QE) to

facilitate this process.

Sealed storage. SGX provides a mechanism called sealing for enclaves to encrypt and

integrity-protect secrets to be stored outside the enclave. The symmetric encryption key,

called seal key could be derived using the EGETKEY within the enclave.

SGX trusted platform services. Leveraging Intel Converged Security and Management

Engine (CSME), Intel provides a privileged enclave, called Platform Service Enclave (PSE),

to provide trusted services, including trusted time and monotonic counters, to application

enclaves. Trusted time service is used to measure the elapsed time (in units of seconds)

between two reads, and trusted monotonic counters are used to address replay attacks.

2.2 Intel Processor Internals

2.2.1 Cache and Memory Hierarchy

Modern processors are equipped with various buffers and caches to improve their

performance. Relevant to our discussion are protocols and the store buffer.

• Cache coherence protocols. Beginning with the processors, Intel processors

use the MESI cache coherence protocol to maintain the coherence of cached data [11].

Cache Line State M(Modified) E(Exclusive) S(Shared) I(Invalid) This line is valid? Yes Yes Yes No Copies exists in other No No Maybe Maybe processors’ cache? A read to this line Cache hit Cache hit Cache hit Goes to system A write to this line Cache hit Cache hit Read for ownership Read for ownership

Table 2.1: MESI cache line states.

12 Each cache line in the L1 data cache and the L2/L3 unified caches is labeled as being

in one of the four states defined in Table 2.1. When writing to a cache line labeled as

Shared or Invalid, a Read For Ownership (RFO) operation will be performed, which

broadcasts invalidation messages to other physical cores to invalidate the copies in their

caches. After receiving acknowledgement messages from other physical cores, the write

operation is performed and the data is written to the cache line.

• Store Buffer. The RFO operations could incur long delays when writing to an invalid

cache line. To mitigate these delays, store buffers were introduced. The writes will be

pushed to the store buffer, and wait to be executed when the acknowledgement messages

arrive. Since the writes are buffered, the following reads to the same address may not see

the most up-to-date value in cache. To solve this problem, a technique called store-to-load

forwarding is applied to forward data from the store buffer to later reads.

2.2.2 Hardware Extensions of Intel Processors

Hyper-Threading. Intel Hyper-Threading is Intel’s proprietary implementation of simulta-

neous multithreading (SMT). With Hyper-Threading support, a single physical core provides

two logical cores (or hyper threads) that could execute two code streams concurrently. The

two hyper threads on the same physical core share various resources such as translation

lookaside buffers (TLB) and branch prediction units (BPU).

Cache Allocation Technology. Intel Cache Allocation Technology (CAT) is a mechanism

for the OS or Hypervisor to configure the amount of resource, e.g., Last Level Cache (LLC),

available to different applications. Particularly, applications can be assigned to a set of

Classes of Service (COS) provided by the processor. Each COS has a capacity bitmask

(CBM) that can be configured to indicate the cache space that the applications in that COS

13 are limited to. The cache spaces of different COSs might be either overlapped or isolated.

When an application is scheduled on a logical core, the OS assigns its COS to that core, thus

restricting its cache usage.

Transactional Synchronization Extensions. Intel Transactional Synchronization Exten-

sions (TSX) is Intel’s implementation of hardware transactional memory (HTM). It is

originally introduced for simplifying concurrent programming. Critical sections could be

executed within transactions. When a transaction succeeds, all updates will be committed

atomically. If the transactional execution fails, a process called transaction abort will

roll back the execution and discard all updates. Transactions might abort due to various

reasons. Particularly, when the memory content read in a transaction is evicted from LLC,

the transaction would abort with a status denoted as CONFLICT.

2.2.3 Out-of-order and Speculative Execution

Out-of-order execution. Modern processors implement deep pipelines, so that multiple

instructions can be executed at the same time. Because instructions do not take equal time

to complete, the order of the instructions’ execution and their order in the program may

differ. This form of out-of-order execution requires taking special care of instructions whose operands have inter-dependencies, as these instructions may access memory in orders

constrained by the program logic. To handle the potential data hazards, instructions are

retired in order, resolving any inaccuracy due to the out-of-order execution at the time of

retirement.

Speculative execution. Speculative execution shares the same goal as out-of-order exe-

cution, but differs in that speculation is made to speed up the program’s execution when

the control flow or data dependency of the future execution is uncertain. One of the most

14 important examples of speculative execution is branch prediction. When a conditional or

instruction is met, because checking the branch condition or resolving branch

targets may take time, predictions are made, based on its history, to prefetch instructions

first. If the prediction is true, speculatively executed instructions may retire; otherwise

mis-predicted execution will be re-winded. The micro-architectural component that enables

speculative execution is the branch prediction unit (BPU), which consists of several hardware

components that help predict conditional branches, indirect jumps and calls, and function

returns. For example, branch target buffers (BTB) are typically used to predict indirect

jumps and calls, and return stack buffers (RSB) are used to predict near returns. These

micro-architectural components, however, are shared between software running on different

security domains (e.g., user space vs. kernel space, enclave mode vs. non-enclave mode),

thus leading to the security issues that we present in Chapter 4.

Implicit caching. Implicit caching refers to the caching of memory elements, either data

or instructions, that are not due to direct instruction fetching or data accessing. Implicit

caching may be caused in modern processors by “aggressive prefetching, branch prediction,

and TLB miss handling” [11]. For example, mis-predicted branches will lead to the fetching

and execution of instructions, as well as data memory reads or writes from these instructions,

that are not intended by the program. Implicit caching is one of the root causes of the CPU vulnerabilities studied in Chapter 4.

2.3 Threat Model

In this dissertation, we consider an adversary with the system privilege of the machine

that runs on the processor with SGX support. Specifically, We assume the adversary has

15 every capability an OS may have over a hosted application (excluding those restricted by

SGX), including but not limited to:

• Complete OS Control: We assume the adversary has complete control of the entire OS,

including re-compiling of the OS kernel; rebooting of the OS with arbitrary argument

as needed; manipulating kernel data structures, such as page tables.

• Interacting with the targeted enclave: We assume the adversary is able to interact with

the targeted enclave, including scheduling the enclave program to any logical cores;

terminating/restarting and suspending/resuming the enclave program; interrupting its

execution through interrupts; intercepting exception handling inside enclaves.

• Launching and controlling another enclave: we assume the adversary is able to run

another enclave that she completely controls in the same process or another process.

This implies that the enclave can poison any BTB entries used by the targeted enclave.

The goal of the attack is to learn the memory content inside the enclave. We assume the

binary code of the targeted enclave program is already known to the adversary and does not

change during the execution. Therefore, we assume that the adversary is primarily interested

in extracting the secrets that have been provisioned into the enclaves, either by Intel or by

regular enclave developers.

2.4 Existing Threats to SGX

In this section, we summarize existing threats to SGX to be addressed in this dissertation.

These threats are categorized into two types: side-channel leakage and OS-dependent threats.

Side-shannel leakage. While SGX provides strong confidentiality and integrity guarantees,

side-channel attacks through micro-architecture, e.g., CPU cache, are still a main type of

16 Micro-architectural side channels Side Channels Shared Exploitable when interleaved FPU Same core No Store buffer Same core No TLBs Same core No BPU Same core Yes L1/L2 cache Same core Yes LLC Cross core Yes DRAM Cross core Yes Page table Cross core Yes

OS-dependent threats Threats Cause Replay OS launches the enclave Communication delay OS handles the communication

Table 2.2: Existing threats to SGX

threats. In these side-channel attacks, the adversary needs to either (1) share the same

micro-architectural units with the victim concurrently, or (2) share these units with the victim in an interleaved manner. Table 2.2 shows major known micro-architectural side

channels. It also indicates whether they have to be conducted on the same CPU core or can

be performed on a different core (i.e., cross core), and whether they are exploitable when

interleaving the execution of the victim and the adversary. According to Table 2.2, there are

three categories of side channels against SGX enclaves:

• Floating-point units (FPU) [20], store buffers [21,76] and translation lookaside buffers

(TLB) are shared on the same core, and they could not be exploited by interleaved

execution. Because FPU and store buffers could only be exploited by concurrent execution

and TLBs are flushed when the processor leaves the enclave mode. To exploit these side

channels, the adversary needs to leverage Hyper-Threading, to run the attack code on the

sibling hyper thread of the victim thread for concurrent exploitation.

17 • Branch prediction units (BPU) [19] and L1/L2 caches [17,18,63,64,95] are shared by the

entire CPU core, and not flushed at context . Therefore, they could be utilized

from the sibling hyper thread or exploited by interleaved execution.

• LLC [37,55], DRAM [65], and page tables [85,91] are shared cross core, so attacks using

these resources could be launched from a different core. Moreover, they could also be

exploited by interleaved execution. Particularly, a privileged adversary could make use

of page table side channels, such as introducing page faults (resulting in the interleaving

of the enclave and OS’s page fault handler), or monitoring particular bits, e.g., accessed,

of page table entries.

OS-dependent threats. Particular in SGX’s threat model, the management of enclave’s

execution relies on the OS, who is potentially adversarial. The OS is responsible for

launching and executing the SGX enclaves, managing the enclave’s memory mappings,

and handling all communication between the enclave and its external world. Such a strong

dependence not only facilitates the aforementioned side-channel threat but also introduces

new threats, as shown in Table 2.2.

One threat is the replay attack, particularly targeting stateful enclave applications. The

malicious OS tries to run the enclave with an outdated state, aiming to break the enclave

program’s integrity. Though SGX trusted platform services provide trusted monotonic

to address such attacks, its practicality is questioned [57].

Another threat is introduced since the OS manipulates the communication between

the enclave and its external world. The adversary could introduce arbitrary delay to the

communication. This affects time-sensitive applications. For example, consider a time-based

Digital rights management (DRM). A secret is provisioned along with a lease duration.

Although the trusted timer provided by the SGX trusted platform service could be used to

18 prevent access to the secret after the lease expires, however, the adversary could forward a

request to the timer within the lease duration but delay the response to a later time, so that

the secret can be used outside the valid lease duration.

2.5 Effectiveness of Existing Defenses

Deterministic multiplexing. Shinde et al. [72] proposes a -based approach to

opportunistically place all secret-dependent control flows and data flows into the same pages,

so that page-level attacks will not leak sensitive information. However, this approach does

not consider cache side channels or DRAM side channels, leaving the defense vulnerable to

cache attacks and DRAMA.

Hiding page faults with transactional memory. T-SGX [71] prevents information leakage

about page faults inside enclaves by encapsulating the program’s execution inside hardware-

supported memory transactions. Page faults will cause transaction aborts, which will be

handled by abort handler inside the enclave first. The transaction abort handler will notice

the abnormal page fault and decide whether to forward the control flow to the untrusted OS

kernel. As such, the page fault handler can only see that the page fault happens on the page where the abort handler is located (via register CR2). The true faulting address is hidden.

However, T-SGX cannot prevent the accessed flags enabled memory side-channel

attacks. According to Intel Software Developer’s manual [11], transaction abort is not

strictly enforced when the accessed flags and dirty flags of the referenced page table

entries are updated. This means there is no security guarantee that memory access inside

transactional region is not leaked through updates of the page table entries.

Secure . Sanctum [32] is a new hardware design that aims to protect

against both last-level cache attacks and page-table based attacks. As Sanctum enclave has

19 its own page tables, page access patterns become invisible to the malicious OS. Therefore,

the page-faults attacks and SPM attacks will fail. However, Sanctum does not prevent

cross-enclave DRAMA attack. As a matter of fact, Sanctum still relies on the OS to assign

DRAM regions to enclaves, create page table entries and copy code and data into the enclave

during enclave initialization. Since OS knows the exact memory layout of the enclave, the

attacker can therefore run an attack process in a different DRAM region that shares a same

DRAM row as the target enclave address.

Timed execution. Chen et al. [30] proposes a compiler-based approach, called DÉJÀ VU,

to measure the execution time of an enclave program at the granularity of basic blocks in a

control-flow graph. Execution time larger than a threshold indicates that the enclave code

has been interrupted and AEX has occurred. The intuition behind it is that execution time

measured at the basic block level will not suffer from the variations caused by different inputs.

Due to the lack of timing measurements in SGX v1 enclaves, DÉJÀ VU constructs a software

clock inside the enclave which is encapsulated inside Intel Transactional Synchronization

Extensions (TSX). Therefore, the clock itself will not be interrupted without being detected.

It was shown that DÉJÀ VU can detect AEX with high fidelity. Therefore, any of the

side-channel attack vectors that induce high volume of AEX will be detected by DÉJÀ VU.

However, those not involving AEX in the attacks, such as T-SPM or HT-SPM attacks will

bypass DÉJÀ VU completely.

Enclave Layout Randomization. SGX-Shield [69] implemented fine-

grained ASLR when an enclave program is loaded into the SGX memory. However the

malicious OS could still learn the memory layout after observing memory access patterns in

a long run as SGX-Shield does not support live re-randomization.

20 Chapter 3: HT-SPM: Hyper-Threading Assisted Sneaky Page Monitoring Attacks1

In this chapter, we detail our study in Hyper-Threading side-channel attacks. Specifically, we present HT-SPM attacks which could evade existing AEX-detection based defenses. We will first give the overview of the attack in Section 3.1. The design is detailed in Section 3.2

and the evaluation is presented in Section 3.3.

3.1 Overview

A memory reference in the modern Intel CPU architectures involves a sequence of

micro-operations: the virtual address generated by the program is translated into the physical

address, by first consulting a set of address translation caches (e.g., TLBs and various

paging-structure caches) and then walking through the page tables in the memory. The

resulting physical address is then used to access the cache (e.g., L1, L2, L3) and DRAM to

complete the memory reference.

Page tables. Page tables are multi-level data structures stored in main memory, serving

address translation. Every page-table walk involves multiple memory accesses. Different

from regular memory accesses, page-table lookups are triggered by the micro-code of the

processor direction, without involving the re-ordering buffer [23]. The entry of each level

1This chapter is excerpted from [87]

21 Figure 3.1: Page table entries.

stores the pointer to (i.e., physical address of) the memory page that contains the next level of the page table. The structure of a PTE is shown in Figure 3.1. Specially, bit 0 is present flag, indicating whether a physical page has been mapped to the virtual page; bit 5 is accessed

flag, which is set by the processor every time the page-table walk leads to the reference of this page table entry; bit 6 is dirty flag, which is set when the corresponding page has been updated. Page frame reclaiming algorithms rely on the dirty flag to make frame reclamation decisions.

As the page tables are located inside the OS kernel and controlled by the untrusted system software, they can be manipulated to attack enclaves. However, as mentioned earlier, because the EPC page permission is also protected by EPCM, malicious system software cannot arbitrarily manipulate the EPC pages to compromise its integrity and confidentiality.

However, it has been shown in previous work [91] that by clearing the present flag in the corresponding PTEs, the malicious system software can collect traces of page accesses from the enclave programs, inferring secret-dependent control flows or data flows. Nevertheless, setting present flag is not the only attack vector against enclave programs.

When the page-table walk results in a reference of the PTE, the accessed flags of the entry will be set to 1. As such, code run in non-enclave mode will be able to detect the page table updates and learn that the corresponding EPC pages has just been visited by the enclave code. However, page-table walk will also updates TLB entries, so that future

22 reference to the same page will not update the accessed flags in PTEs, until the TLB entries

are evicted by other address translation activities. Flushing TLB entries without triggering

AEXs becomes the key point of our attack.

Address Translation Caches. Address translation caches are hardware caches that facilitate

address translation, including TLBs and various paging-structure caches. TLB is a multi-

level set-associative hardware cache that temporarily stores the translation from virtual page

numbers to physical page numbers. Specially, the virtual address is first divided into three

components: TLB tag bits, TLB-index bits, and page-offset bits. The TLB-index bits are

used to index a set-associative TLB and the TLB-tag bits are used to match the tags in each

of the TLB entries of the searched TLB set. Similar to L1 caches, the L1 TLB for data and

instructions are split into dTLB and iTLB. An L2 TLB, typically larger and unified, will

be searched upon L1 TLB misses. Recent Intel processors allow selectively flushing TLB

entries at context . This is enabled by the Process-Context Identifier (PCID) field

in the TLB entries to avoid flushing the entries that will be used again. If both levels of

TLBs miss, a page table walk will be initiated. The virtual page number is divided into,

according to Intel’s terminology, PML4 bits, PDPTE bits, PDE bits, and PTE bits, each of which is responsible for indexing one level of page tables in the main memory. Due to the

long latency of page-table walks, if the processor is also equipped with paging structure

caches, such as PML4 cache, PDPTE cache, PDE cache, these hardware caches will also

be searched to accelerate the page-table walk. The PTEs can be first searched in the cache

hierarchy before the memory access [23].

When Hyper-Threading is enabled, code running in the enclave mode may share the

same set of TLBs and paging-structure caches with code running in non-enclave mode.

Therefore, the enclave code’s use of such resources will interfere with that of the non-enclave

23 code, creating chances of evicting the enclave’s TLBs by carefully designed behaviours from the non-enclave code.

3.2 Design

To attack the virtual memory, a page-fault side-channel attacker first restricts access to all pages, which induces page faults whenever the enclave process touches any of these pages, thereby generating a sequence of its page visits. A problem here is that this approach is heavyweight, causing an interrupt for each page access. This often leads to a performance slowdown by one or two orders of magnitude [91]. As a result, such an attack could be detected by looking at its extremely high frequency of page faults (i.e., AEXs) and anomalously low performance observed from the remote. All existing solutions, except those requiring hardware changes, are either leveraging interrupts or trying to remove the page trace of a program (e.g., putting all the code on one page). Little has been done to question whether such defense is sufficient.

To show that excessive AEXs are not the necessary condition to conducting memory side-channel attack, in this section, we elaborate sneaky page monitoring (SPM), a new paging attack that can achieve comparable effectiveness with much less frequent AEXs.

The SPM attack manipulates and monitors the accessed flags on the pages of an enclave process to identify its execution trace. Specifically, we run a system-level attack process outside an enclave to repeatedly inspect each page table entry’s accessed flag, record when it is set (from 0 to 1) and reset it once this happens. The page-access trace recovered in this way is a sequence of page sets, each of which is a group of pages visited (with their accessed flags set to 1) between two consecutive inspections. This attack is lightweight since it does not need any AEX to observe the pages first time when they are visited.

24 However, as mentioned earlier, after a virtual address is translated, its page number is

automatically added to a TLB. As a result, the accessed flag of that page will not be set

again when the page is visited later. To force the processor to access the PTE (and update

the flag), the attacker has to invalidate the TLB entry proactively. The simplest way to do

so is by generating an inter-processor interrupt (IPI) from a different CPU core to trigger a

TLB shootdown, which causes an AEX from the enclave, resulting in flushing of all TLB

entries of the current PCID.

Further we found that when Hyper-Threading is turned on for a processor, we can clear

up the TLBs without issuing TLB shootdowns, which renders all existing interrupt-based

protection ineffective. Hyper-Threading runs two virtual cores on a physical core to handle

the workloads from two different OS processes. This resource-sharing is transparent to the

OS and therefore does not trigger any interrupt. The processes running on the two virtual

cores share some of the TLBs, which allows the attacker to remove some TLB entries outside

the enclave, without causing any interrupts. As a result, in the presence of Hyper-Threading, we can run an attack process together with an enclave process, to continuously probe the virtual addresses in conflict with the TLB entries the latter uses, in an attempt to evict these

entries and force the victim process to walk its page tables. Using this technique, which we

call HT-SPM, we can remove most or even eliminate the interrupts during an attack.

3.3 Evaluation

Our analysis was performed on an Dell Optiplex 7040 with a Skylake i7-6700 processor

and 4 physical cores, with 16GB memory. The configuration of the cache and memory

hierarchy is shown in Table 3.1. It runs Ubuntu 16.04 with kernel version 4.2.8. During our

experiments, we patched the OS when necessary to facilitate the attacks, as an OS-level

25 adversary would do. We used the latest Graphene-SGX Library OS [10, 80] compiled using

GCC 5.4.0 with default compiling options to port unmodified libraries.

Size Sets × Ways iTLB 64 8 × 8 dTLB 64 16 × 4 L2 TLB 1536 128 × 12 iCache 32KB 64 × 8 dCache 32KB 64 × 8 L2 Cache 256KB 1024 × 4 L3 Cache 8MB 8192 × 16 Size Channel × DIMMs × Ranks × Banks × Rows DRAM 8GB×2 2 × 1 × 2 × 16 × 215

Table 3.1: Configuration of the testbed, available per logical core when HyperThreading is enabled.

HT-SPM on Hunspell. Hunspell is a popular spell checking tool used by software packages

like Apple’s OS X and Google Chrome. It stores a dictionary in a , which uses

linked lists to link the words with the same hash values. Each spans across multiple

pages, so searching for a word often generates a unique page-visit sequence. Prior study [91]

shows that by monitoring page faults, the attacker outside an enclave can fingerprint the

dictionary lookup function inside the enclave, and further determine the word being checked

from the sequence of accessing different data pages (for storing the dictionary).

We ran HT-SPM on Hunspell, in a scenario when a set of words were queried on the

dictionary. We conducted the experiments on the Intel Skylake i7-6700 processor, which is

characterized by multi-level TLBs (see Table 3.1). The experiments show that the dTLB

and L2 TLB are fully shared across logical cores. Our attack process includes 6 threads:

2 cleaners operating on the same physical core as the Hunspell process in the enclave for

26 evicting its TLB entries and 4 collectors for inspecting the accessed flags of memory pages.

The cleaners probed all 64 and 1536 entries of the dTLB and L2 TLB every 4978 cycles

and the collectors inspected the PTEs once every 128 cycles. In the experiment, we let

Hunspell check 100 words inside the enclave, with the attack process running outside. The

collectors, once seeing the fingerprint of the spell function, continuously gathered traces

for data-page visits, from which we successfully recovered the exact page visit set for 88 words. The attack incurred a slowdown of 39.1% and did not fire a single TLB shootdown.

27 Chapter 4: SGXPECTRE: Speculative Execution Enabled Side-Channel Attacks2

In previous chapter, we presented a Hyper-Threading side-channel attack that triggers

no extra AEXs and thus bypasses existing AEX-detection based mitigation schemes. In

this chapter, we present SGXPECTRE, another side-channel attack that exploits speculative

execution to compromise the SGX’s confidentiality guarantee, and investigate its security

implication on SGX ecosystem:

First, we explore techniques to conduct SGXPECTRE Attacks. SGXPECTRE is the term we coined for the SGX-variants of the Spectre attacks. This is to differentiate from other variants of Spectre attacks. At a high level, SGXPECTRE exploits the race condition between

the speculatively executed memory references and the latency of the branch resolution,

in order to generate side-channel observable cache traces and consequently read memory

content. Specifically, we explore how branch targets can be injected into SGX enclaves,

how registers inside enclaves can be controlled by the outside world, how information can

be leaked through side channels, and how the adversary could increase the probability of winning the race condition. These techniques are the key components for successfully

performing SGXPECTRE Attacks. To the best of our knowledge, they have never been

studied in previous works.

2This chapter is excerpted from [28]

28 Second, we develop techniques to automate the search of vulnerabilities in enclave

binaries. We observe SGXPECTRE Attacks are enabled by two types of code gadgets

in the enclave binary. To help the enclave developers detect vulnerabilities in their code, we develop binary analysis tools to symbolically execute enclave code and automatically

identify such gadgets from enclave binaries. As a result, we found both types of gadgets exist

in widely used SGX runtimes, such as Intel SGX SDK, Rust-SGX SDK, and Graphene-SGX

library OS. Therefore any enclave program built with these runtimes would be vulnerable to

SGXPECTRE Attacks. To our knowledge, our tool is the first to perform symbolic execution

on enclave binary (which we have open-sourced on GitHub). It is also the first tool to

automatically detect software vulnerabilities that enable Spectre-like attacks. We expect our

study will inspire future research.

Third, we demonstrate end-to-end attacks to validate the fidelity of SGXPECTRE At- tacks and extract Intel’s secrets. Particularly, we show that the adversary could learn the

content of the enclave memory as well as its register values from a victim enclave. An even

more alarming consequence is that SGXPECTRE Attacks can be leveraged to steal secrets

belonging to Intel SGX platforms, such as provisioning keys, seal keys, and attestation keys.

For example, we have demonstrated that SGXPECTRE Attacks are able to read memory

from the quoting enclave developed by Intel and extract Intel’s seal key, which can be used

to decrypt the sealed EPID blob to extract the attestation key (i.e., EPID private key). With

an attestation key, the adversary could compromise a large group of SGX platforms that

share the same EPID public key. Our work was one of the first to demonstrate the extraction

of Intel’s secrets.

Fourth, we investigate the security implication of SGXPECTRE Attacks on the SGX

ecosystem. We enumerate all derived keys and in-memory secrets of Intel’s SGX platforms,

29 and study how Intel mitigate the threats to these in-memory secrets by having them de- pending on the version of the microcode of the SGX platform. This chapter contributes to the overall understanding of the security implications of SGXPECTRE Attacks and similar attacks targeting the confidentiality of SGX platforms.

This chapter is organized as follows: Section 4.1 presents a systematic exploration of attack vectors in enclaves and techniques that enable practical attacks. Section 4.2 presents a symbolic execution tool for automatically searching instruction gadgets in enclave programs. Section 4.3 shows end-to-end SGXPECTRE Attacks against enclave runtimes that lead to a complete breach of enclave confidentiality. Section 4.4 discusses and evaluates countermeasures against the attacks. Section 4.5 discusses the security implications of side channels on SGX platforms. This chapter is summarized in Section 4.6.

4.1 SGXPECTRE Attacks

4.1.1 A Simple Example

Steps of an SGXPECTRE Attack are illustrated in Figure 4.1. Step ‚ is to poison the branch target buffer, such that when the enclave program executes a branch instruction at a specific address, the predicted branch target is the address of enclave instructions that may leak secrets. For example, in Figure 4.1, to trick the ret instruction at address 0x02560 in the enclave to speculatively return to the secret-leaking instructions located at address

0x07642, the code to poison the branch prediction executes an indirect jump from the source address 0x02560 to the target address 0x07642 multiple times. We will discuss branch target injection in more details in Section 4.1.2.

Step ƒ is to prepare a CPU environment to increase the chance of speculatively executing the secret-leaking instructions before the processor detects the mis-prediction and flushes

30 Figure 4.1: A simple example of SGXPECTRE Attacks. The gray blocks represent code or data outside the enclave. The white blocks represent enclave code or data.

the . Such preparation includes flushing the victim’s branch target address (to delay

the retirement of the targeted branch instruction or return instruction) and depleting the RSB

(to force the CPU to predict return address using the BTB). Flushing branch targets cannot

use the clflush instruction, as the enclave memory is not accessible from outside (We will

discuss alternative approaches in Section 4.1.5). The code for depleting the RSB (shown in

31 Figure 4.1) pushes the address of a ret instructions 16 times and returns to itself repeatedly to drain all RSB entries.

Step „ is to set the register values used by the speculatively executed secret-leaking instructions, such that they will read enclave memory targeted by the adversary and leave cache traces that the adversary could monitor. In this simple example, the adversary sets r14 to 0x106500, the address of a 2- secret inside the enclave, and sets r15 to 0x610000, the base address of a monitored array outside the enclave. The enclu instruction with rax=2 is executed to enter the enclave. We will discuss methods to pass values into the enclaves in

Section 4.1.3.

Step is to trigger the enclave code. Because of the BTB poisoning, instructions at address 0x07642 will be executed speculatively while the target of the ret instruction at address 0x02560 is being resolved. The instruction “movzwq (%r14), %rbx” loads the

2-byte secret data into rbx, and “mov (%r15, %rbx, 1), %rdx” touches one entry of the monitored array dictated by the value of rbx.

Step † is to examine the monitored array using a FLUSH-RELOAD side channel and extract the secret values. Techniques to do so are discussed in details in Section 4.1.4.

4.1.2 Injecting Branch Targets into Enclaves

The branch prediction units in modern processors typically consist of:

• Branch target buffer: When an indirect jump/call or a conditional jump is executed, the

target address will be cached in the BTB. The next time the same indirect jump/call

is executed, the target address in the BTB will be fetched for speculative execution.

Modern -64 architectures typically support 48-bit virtual address and 40-bit physical

32 Figure 4.2: Poisoning BTB from the same process or a different process

address [11,49]. For space efficiency, many Intel processors, such as Skylake, use only

the lower 32-bit of a virtual address as the index and tag of a BTB entry.

• Return stack buffer: When a near Call instruction with non-zero displacement3 is

executed, an entry with the address of the instruction sequentially following it will be

created in the return stack buffer (RSB). The RSB is not affected by far Call, far Ret, or

Iret instructions. Most processors that implement RSB have 16 entries [34]. On Intel

Skylake or later processors, when RSB underflows, BTBs will be used for prediction

instead.

Poisoning BTBs from outside. To temporarily alter the control-flow of the enclave code by injecting branch targets, the adversary needs to run BTB poisoning code outside the targeted enclave, which could be done in one of the following ways (as illustrated in Figure 4.2).

• Branch target injection from the same process. The adversary could poison the BTB by

using code outside the enclave but in the same process. Since the BTB uses only the lower

32 bits of the source address to index a BTB entry, the adversary could reserve a 232 =

3Call instructions with zero displacements will not affect the RSB, because they are common code constructions for obtaining the current RIP value. These zero displacement calls do not have matching returns.

33 4GB memory buffer, and execute an indirect jump instruction (within the buffer) whose

source address (e.g., 0x7fff00002560) is the same as the branch instruction in the target

enclave (i.e., 0x02560) in the lower 32 bits, and target address (e.g., 0x7fff00007642) is

the same as the secret-leaking instructions (i.e., 0x07642) inside the target enclave in the

lower 32 bits.

• Branch target injection from a different process. The adversary could inject the branch

targets from a different process. Although this attack method requires a

in between of the execution of the BTB poisoning code and targeted enclave program,

the advantage of this method is that the adversary could encapsulate the BTB poisoning

coding into another enclave that is under his control. This allows the adversary to

perfectly shadow the branch instructions of the targeted enclave program (i.e., matching

all bits in the virtual addresses).

It is worth noting that address space layout randomization can be disabled by the adversary to facilitate the BTB poisoning attacks. On a Lenovo Thinkpad X1 Carbon (4th

Gen) laptop with an i5-6200U processor (Skylake), we have verified that for indirect jump/call, the BTB could be poisoned either from the same process or a different process. For the return instructions, we only observed successful poisoning using a different process (i.e., perfect branch target matching). To force return instructions to use BTB, the

RSB needs to be depleted before executing the target enclave code. Interestingly, as shown in Figure 4.1, a near call is made in enclave_entry(), which could have filled the RSB, but we still could inject the return target of the return instruction at 0x02560 with BTB. We speculate that this is an architecture-specific implementation. A more reliable way to deplete the RSB is through the use of AEX as described in Section 4.3.1.

34 Figure 4.3: EENTER and ECall table lookup

4.1.3 Controlling Registers in Enclaves

Because all registers are restored by hardware after ERESUME, the adversary is not able to control any register inside the enclave when the control returns back to the enclave after an AEX. In contrast, most registers can be set before the EENTER leaf function and remain controlled by the adversary after entering the enclave mode until modified by the enclave code. Therefore, the adversary might have a chance to control some registers in the enclave after an EENTER.

The SGX developer guide [12] defines ECall and OCall to specify the interaction between the enclave and external software. An ECall, or “Enclave Call”, is a function call to enter enclave mode; an OCall, or “Outside Call”, is a function call to exit the enclave mode. Returning from an OCall is called an ORet. Both ECalls and ORets are implemented through EENTER by the SGX SDK. As shown in Figure 4.3, the function enter_enclave() is called by the enclave entry point, enclave_entry(). Then depending on the value of

35 the edi register, do_ecall() or do_oret() will be called. The do_ecall() function is triggered to call trts_ecall() and get_function_address() in a sequence and eventu- ally look up the ECall table. Both ECall and ORet can be exploited to control registers in enclaves.

4.1.4 Leaking Secrets via Side Channels

The key to the success of SGXPECTRE Attacks lies in the fact that speculatively executed instructions trigger implicit caching, which is not properly rewinded when the incorrectly issued instructions are discarded by the processor. Therefore, these side effects of speculative execution on CPU caches can be leveraged to leak information from inside the enclave.

Cache side-channel attacks against enclave programs have been studied recently [26,

35, 38, 68], all of which demonstrated that a program runs outside the enclave may use

PRIME-PROBE techniques [79] to infer secrets from the enclave code, only if the enclave code has secret-dependent memory access patterns. Though more fine-grained and less noisy, FLUSH-RELOAD techniques [92] cannot be used in SGX attacks since enclaves do not share memory with the external world.

Different from these studies, however, SGXPECTRE Attacks may leverage these less noisy FLUSH-RELOAD side channels to leak information. Because the enclave code can access data outside the enclave directly, an SGXPECTRE Attack may force the speculatively executed memory references inside enclaves to touch memory locations outside the enclave, as shown in Figure 4.1. The adversary can flush an array of memory before the attack, such as the array from address 0x610000 to 0x61ffff, and then reload each entry and measure the reload time to determine if the entry has been touched by the enclave code during the speculative execution.

36 Figure 4.4: Best scenarios for winning a race condition. Memory accesses D1, I1, D2, D3 are labeled next to the related instructions. The address translation and data accesses are illustrated on the right: The 4 blocks on top denote the units holding the address translation information, including TLBs, paging structures, caches (for PTEs), and the memory; the 4 blocks at the bottom denote the units holding data/instruction. The shadow blocks represent the units from which the address translation or data/instruction access are served.

Other than cache side-channel attacks, previous work has demonstrated BTB side-

channel attacks, TLB side-channel attacks, DRAM-cache side-channel attacks, and page-

fault attacks against enclaves. In theory, some of these venues may also be leveraged by

SGXPECTRE Attacks. For instance, although TLB entries used by the enclave code will be

flushed when exiting the enclave mode, a PRIME-PROBE-based TLB attack may learn that a

TLB entry has been created in a particular TLB set when the program runs in the enclave

mode. Similarly, BTB and DRAM-cache side-channel attacks may also be exploitable in

this scenario. However, page-fault side channels cannot be used in SGXPECTRE Attacks

because the speculatively executed instructions will not raise exceptions.

37 4.1.5 Winning a Race Condition

At the core of an SGXPECTRE Attack is a race between the execution of the branch instruction and the speculative execution: data leakage will only happen when the branch instruction retires later than the speculative execution of the secret-leaking code. Figure 4.4 shows a desired scenario for winning such a race condition in an SGXPECTRE Attack: The branch instruction has one data access D1, while the speculative execution of the secret- leaking code has one instruction fetch I1 and two data accesses D2 and D3. To win the race condition, the adversary should ensure that the memory accesses of I1, D2 and D3 are fast enough. However, because I1 and D2 fetch memory inside the enclave, and as TLBs and paging structures used inside the enclaves are flushed at AEX or EEXIT, the adversary could at best perform the address translation of the corresponding pages from caches (i.e., use cached copies of the page table). Fortunately, it can be achieved by performing Step in

Figure 4.1 multiple times. It is also possible to preload the instructions and data used in

I1 and D2 into the L1 cache to further speed up the speculative execution. As D3 accesses memory outside the enclave, it is possible to preload the TLB entry of the corresponding page. However, data of D3 should be loaded from the memory.

Meanwhile, the adversary should slow down D1 by forcing its address translation and data fetch to happen in the memory. However, this step has been proven technically challenging.

First, it is difficult to effectively flush the branch target (and the address translation data) to memory without using clflush instruction. Second, because the return address is stored in the stack frames, which is very frequently used during the execution, evicting return addresses must be done frequently. In the attack described in Section 4.3, we leveraged an additional page fault to suspend the enclave execution right before the branch instruction and flush the return target by evicting all cache lines in the same cache set.

38 4.2 Attack Gadgets Identification

In this section, we show that any enclave programs developed with existing SGX SDKs

are vulnerable to SGXPECTRE Attacks. In particular, we have developed an automated

program analysis tool that symbolically executes the enclave code to examine code patterns

in SGX runtimes, and have identified those code patterns in every runtime library we have

examined, including Intel’s SGX SDK [3], Rust-SGX [33], Graphene-SGX [80]. In this

section, we present how we search these gadgets in greater detail.

4.2.1 Types of Gadgets

In order to launch SGXPECTRE Attacks, two types of code patterns are needed. The first

type of code patterns consists of a branch instruction that can be influenced by the adversary

and several registers that are under the adversary’s control when the branch instruction is

executed. The second type of code patterns consists of two memory references close to each

other and collectively reveal some enclave memory content through cache side channels.

Borrowing the term used in return-oriented programming [70] and Spectre attacks [51], we

use gadgets to refer to these patterns. More specifically, we name them Type-I gadgets and

Type-II gadgets, respectively.

Type-I gadgets: branch target injection. A gadget is a sequence of instructions that are

executed sequentially during one run of the enclave program (but not necessarily consecutive

in the memory layout). A Type-I gadget is such an instruction sequence that starts from

the entry point of EENTER (dubbed enclave_entry()) and ends with one of the following

instructions: (1) near indirect jump, (2) near indirect call, or (3) near return. EENTER is

the only method for the adversary to take control of registers inside enclaves. During an

EENTER, most registers are preserved by the hardware; they are left to be sanitized by the

39 enclave software. If any of these registers is not overwritten by the software before one of

the three types of branch instructions is met, a Type-I gadget is found. An example of a

Type-I gadget is shown in Listing 4.1, which is excerpted from libsgx_trts.a of Intel

SGX SDK. In particular, line 49 in Listing 4.1 is the first return instruction encountered

by an enclave program after EENTER. When this near return instruction is executed, several

registers can still be controlled by the adversary, including rbx, rdi, rsi, r8, r9, r10, r11,

r14, and r15.

Type-II gadgets: secret extraction. A Type-II gadget is a sequence of instructions that

starts from a memory reference instruction that loads data in the memory pointed to by

register regA into register regB, and ends with another memory reference instruction whose

target address is determined by the value of regB. When the control flow is redirected to a

Type-II gadget, if regA is controlled by the adversary, the first memory reference instruction will load regB with the value of the enclave memory chosen by the adversary. Because the

entire Type-II gadget is speculatively executed and eventually discarded when the branch

instruction in the Type-I gadget retires, the secret value stored in regB will not be learned

by the adversary directly. However, as the second memory reference will trigger the implicit

caching, the adversary can use a FLUSH-RELOAD side channel to extract the value of regB.

An example of a Type-II gadget is illustrated in Listing 4.2, which is excerpted from the

libsgx_tstdc.a library of Intel SGX SDK. Assuming rsi is a register controlled by the

adversary, the first instruction (line 3) reads the content of pointed to by

rsi+0x38 to edi. Then the value of rbx+rdi×8 is stored in rdi (line 5). Finally, the

memory address at rdi+0x258 is loaded to be compared with rsi (line 6). To narrow down

the range of rdi+0x258, it is desired that rbx is also controlled by the adversary. We use

regC to represent these base registers like rbx.

40 4.2.2 Symbolically Executing SGX Code

Although a skillful developer can manually read the source code or even the disassembled

binary code of an enclave program and runtime libraries to identify exploitable gadgets,

such an effort is very tedious and error-prone. It is highly desirable to leverage automated

software tools to scan an enclave binary to detect any gadgets, and eliminate them before

deploying them to untrusted SGX machines.

To this end, we devise a dynamic symbolic execution technique to enable automated

identification of SGXPECTRE Attack gadgets. Symbolic execution [50] is a program testing

and debugging technique in which symbolic inputs are supplied instead of concrete inputs.

Symbolic execution abstractly executes a program and concurrently explores multiple

execution paths. The abstract execution of each execution path is associated with a path

constraint that represents multiple concrete runs of the same program that satisfy the path

conditions. Using symbolic execution techniques, we can explore multiple execution paths

in enclave programs to find gadgets of SGXPECTRE Attacks.

Symbolic execution of an enclave function. We design a tool built atop the angr [74], a

popular binary analysis framework to perform the symbolic execution. To avoid the path ex-

plosion problem in symbolically executing a large enclave program (or a large SGX runtime

such as Graphene-SGX), our tool allows the user to specify an arbitrary enclave function to

start the symbolic execution. During the symbolic execution, machine states are maintained

internally to represent the status of registers, stacks, and the memory; instructions update the

machine states represented with symbolic values while the execution makes forward progress.

The exploration of an execution path terminates when the execution returns to this entry func-

tion or detects a gadget. To symbolically execute an SGX enclave binary, we have extended

angr to handle: (1) the EEXIT instruction, by putting the address of the enclave entry point,

41 enclave_entry(), in the rip register of its successor states; (2) dealing with instructions

that are not already supported by angr, such as xsave, xrstore, repz, and .

4.2.3 Gadget Identification

Identifying Type-I gadgets. The key requirement of a Type-I gadget is that before the

execution of the indirect jump/call or near return instruction, the values of some registers are

controlled (directly or indirectly) by the adversary, which can only be achieved via EENTER.

We consider two types of Type-I gadget separately: ECall gadgets and ORet gadgets.

To detect ECall gadgets, the symbolic execution starts from the enclave_entry()

function and stops when a Type-I gadget is found. During the path exploration, edi register

is set to a value that leads to an ECall.

To detect ORet gadgets, the symbolic execution starts from a user-specified function

inside the enclave. Once an OCall is encountered, the control flow is transferred to

enclave_entry() and the edi register is set to a value that leads to an ORet. At this

point, all other registers are considered controlled by the adversary and thus are assigned

symbolic values. An ORet gadget is found if an indirect jump/call or near return instruction

is encountered and some of the registers still have symbolic values. The symbolic execution

continues if no gadgets are found until the user-specified function finishes.

Identifying Type-II gadgets. To identify Type-II gadgets, our tool scans the entire enclave

binary and looks for memory reference instructions (i.e., mov and its variants, such as movd

and movq) that load register regB with data from the memory location pointed to by regA.

Both regA and regB are general registers, such as rax, rbx, rcx, rdx, r8 - r15. Once one

of such instructions is found, the following N instructions (e.g., N = 10) are examined to

see if there is another memory reference instruction (e.g., mov, cmp) that accesses a memory

42 location pointed to by register regD. If so, the instruction sequence is a potential Type-II

gadget. It is desired to have a register regC used as the base address for the second memory

reference. However, we also consider gadgets that do not involve regC, because they are

also exploitable.

Once we have identified a potential gadget, it is executed symbolically using angr. The

symbolic execution starts from the first instruction of a potential Type-II gadget, and regB

and regC are both assigned symbolic values. At the end of the symbolic execution of the

potential gadget, the tool checks whether regD contains a derivative value of regB, and when regC is used as the base address of the second memory reference, whether regC still

holds its original symbolic values. A potential gadget is a true gadget if the checks pass. We

use either [regA, regB, regC] or [regA, regB] to represent a Type-II gadget.

4.2.4 Experimental Results of Gadget Detection

We run our symbolic execution tool on three well-known SGX runtimes: the official Intel

Linux SGX SDK (version 2.1.102.43402), Rust-SGX SDK (version 0.9.1), and Graphene-

SGX (commit bf90323). In all cases, a minimal enclave with a single empty ECall was

developed for analysis, because gadgets detected in a minimal enclave binary will appear

in any enclave code developed using these SDKs. When the enclave binary becomes more

complex, the size of the resulting enclave binary will grow to include more components

of the SDK libraries, and the number of available gadgets will also increase. For example,

a simple OCall implementation of printf() introduces three more Type-II gadgets. In

addition, the code written by the enclave author might also introduce extra exploitable

gadgets.

43 To detect ECall Type-I gadgets, the symbolic execution starts from the enclave_entry()

function in all three runtime libraries. To detect ORet Type-I gadgets, in Intel SGX SDK

and Rust-SGX SDK, we started our analysis from the sgx_ocall() function, which is the

interface defined to serve all OCalls. In contrast, Graphene-SGX has more diverse OCall

sites. In total, there are 37 such sites as defined in enclave_ocalls.c. Unlike in other

cases where the symbolic analysis completes instantly due to small function sizes, analyzing

these 37 OCall sites consumes more time: the median running time of analyzing one OCall

site was 39 seconds; the minimum analysis time was 8 seconds; and the maximum was 340

seconds.

The results for Type-I gadgets are summarized in Table 4.1. In Table 4.1, column 2

shows the type of the gadget, whether it being an indirect jump, indirect call, or return;

column 3 shows only the gadget’s end address (because Type-I gadgets always start at the

enclave_entry()), which is the address of an branch instruction, represented using the

function name the instruction is located and its offset; column 4 shows the registers that are

under the control of the adversary when the branch instructions are executed. For example,

the first entry in Table 4.1 shows an indirect jump gadget, which is located in do_ecall()

(with an offset of 0x118). By the time of the indirect jump, the registers that are still under

the control of the adversary are rdi, r8, r9, r10, r11, r14 and r15.

Due to space limit, we list only Type-II gadgets of the form [regA, regB, regC] (which

means at the time of memory reference, two registers, regB and regC, are controlled by

the adversary) in Table 4.2. we have found 6 gadgets in Intel’s SGX SDK, 6 gadgets in

Rust-SGX, and 18 gadgets in Graphene-SGX. For Type-II gadgets of the form [regA, regB], we have found 6, 86, and 180 such gadgets in these three runtime libraries, respectively.

44 Category End Address Controlled Registers indirect jump :0x118 rdi, r8, r9, r10, r11, r14, r15 indirect call — — :0xc rbx, rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x16 rbx, rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x9 rbx, rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 Intel SGX SDK <_ZL16init_stack_guardPv>:0x21 rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 return :0x21 rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x62 rbx, rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x2b rsi, r8, r9, r10, r11, r12, r14, r15 :0x11 r8, r9, r10, r11, r12, r14, r15 :0x46 rbx, r8, r9, r10, r11, r12, r14, r15 indirect jump :0x118 rdi, r9, r10, r11, r12, r13, r14, r15 indirect call — — <_ZL14do_init_threadPv>:0x109 rdi, r9, r10, r11, r12, r13, r14, r15 :0x21 rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x63 rsi, r8, r9, r10, r11, r12, r13, r14, r15 <_ZL16init_stack_guardPv>:0x21 rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 <_ZL16init_stack_guardPv>:0x69 rdi, r8, r9, r10, r11, r12, r13, r14, r15 :0x55 rbx, rsi, r8, r9, r10, r11, r12, r13, r14, r15 :0x2b rsi, r8, r9, r10, r11, r12, r13, r14, r15 Rust SGX SDK return :0xa0 rbx, rdx, rsi, r9, r10, r11, r14, r15 :0xc rdx, rdi, r8, r9, r10, r11, r12, r14, r15 :0x9 rbx, rdi, rsi, r8, r9, r10, r11, r12, r13, r14, r15 <__morestack>:0xe r8, r9, r10, r11 :0x64 r8, r9, r10, r11 <__memcpy>:0xa3 rax, rbx, rdi, r9, r10, r11, r14, r15 <__memset>:0x1d rax, rbx, rdx, rdi, r9, r10, r11, r14, r15 <__intel_cpu_features_init_body>:0x42b rbx, rdx, rdi, r9, r10, r11, r14, r15 indirect jump — — indirect call <_DkGenericEventTrigger>:0x20 r9, r10, r11, r13, r14, r15 <_DkGetExceptionHandler>:0x30 rdi, r8, r9, r10, r11, r12, r13, r14, r15 :0x84 r8, r9, r10, r11, r12, r13, r14, r15 Graphene-SGX <_DkHandleExternelEvent>:0x55 rdi, r8, r9, r10, r11, r12, r13, r14, r15 return <_DkSpinLock>:0x27 rbx, rdi, r8, r9, r10, r11, r12, r13, r14, r15 :0x23 rdi, rsi, r8, r12, r13, r14 :0xcd rdi, rsi, r8 :0xd5 rdx, rdi, rsi, r8

Table 4.1: SGXPECTRE Attack Type-I gadgets in popular SGX runtime libraries.

45 Listing 4.1: An example of a Type-I gadget 0000000000003662 : 3662: cmp $0x0,%rax 3666: jne 3709 366c: xor %rdx,%rdx 366f: mov %gs:0x8,%rax 3676: 00 00 3678: cmp $0x0,%rax 367c: jne 368d 367e: mov %rbx,%rax 3681: sub $0x10000,%rax 3687: sub $0x2b0,%rax 368d: xchg %rax,%rsp 368f: push %rcx 3690: push %rbp 3691: mov %rsp,%rbp 3694: sub $0x30,%rsp 3698: mov %rax,−0x8(%rbp ) 369c: mov %rdx,−0 x18(%rbp ) 36a0: mov %rbx,−0 x20(%rbp ) 36a4: mov %rsi,−0 x28(%rbp ) 36a8: mov %rdi,−0 x30(%rbp ) 36ac: mov %rdx,%rcx 36af: mov %rbx,%rdx 36b2: callq 1f20 ...

0000000000001f20 : 1 f20 : push %r13 1 f22 : push %r12 1f24: mov %rsi,%r13 1 f27 : push %rbp 1 f28 : push %rbx 1f29: mov %rdx,%r12 1f2c: mov %edi,%ebx 1f2e: mov %ecx,%ebp 1f30: sub $0x8,%rsp 1f34: callq b60 ...

0000000000000b60 : b60: sub $0x8,%rsp b64: callq 361b ...

000000000000361b : 361b: lea 0x213886(%rip),%rcx # 216ea8 3622: xor %rax,%rax 3625: mov (%rcx),%eax 3627: r e t q

Listing 4.2: An example of a Type-II gadget 0000000000005c10 : ... 607f: mov 0x38(%rsi),%edi 6082: mov %rdi,%rcx 6085: lea (%rbx,%rdi,8),%rdi 6089: cmp 0x258(%rdi),%rsi ...

46 Start Address Gadget Instructions :0x8a mov 0x38(%rsi),%r9d; mov %r9,%rcx; lea (%rdi,%r9,8),%r9; cmp 0x258(%r9),%rsi :0x299 mov 0x38(%r8),%r9d; mov %r9,%rcx; lea (%rdi,%r9,8),%r9; cmp Intel SGX SDK 0x258(%r9),%r8 :0x180b mov 0x38(%rdx),%r12d; mov %r12,%rcx; add $0x4a,%r12; cmp 0x8(%rsi,%r12,8),%rdx :0x399 mov 0x38(%r8),%edi; mov %rdi,%rcx; lea (%rbx,%rdi,8),%rdi; cmp 0x258(%rdi),%r8 :0x46f mov 0x38(%rsi),%edi; mov %rdi,%rcx; lea (%rbx,%rdi,8),%rdi; cmp 0x258(%rdi),%rsi :0x341 mov 0x38(%rsi),%r10d; mov %r10,%rcx; lea (%rbx,%r10,8),%r10; cmp %rsi,0x258(%r10) :0x8a mov 0x38(%rsi),%r9d; mov %r9,%rcx; lea (%rdi,%r9,8),%r9; cmp 0x258(%r9),%rsi :0x299 mov 0x38(%r8),%r9d; mov %r9,%rcx; lea (%rdi,%r9,8),%r9; cmp Rust-SGX SDK 0x258(%r9),%r8 :0x1eb mov 0x38(%rsi),%r9d; mov %r9,%rcx; lea (%r12,%r9,8),%r9; cmp 0x258(%r9),%rsi :0x180b mov 0x38(%rdx),%r12d; mov %r12,%rcx; add $0x4a,%r12; cmp 0x8(%rsi,%r12,8),%rdx :0x391 mov 0x38(%r8),%edi; mov %rdi,%rcx; lea (%rbx,%rdi,8),%rdi; cmp 0x258(%rdi),%r8 :0x467 mov 0x38(%rsi),%edi; mov %rdi,%rcx; lea (%rbx,%rdi,8),%rdi; cmp 0x258(%rdi),%rsi :0x97 mov 0x2f0(%r8),%rax; mov (%rax,%rdx,4),%eax :0x177 mov 0x2d0(%r8),%rax; mov (%rax,%rdx,4),%r15d :0x200 mov 0x2d8(%r8),%rax; mov (%rax,%r15,4),%r15d :0x98 mov 0x10(%r12),%rcx; movslq %r9d,%rdi; mov %rdi,%rsi; imul (%rcx,%rdx,8),%rsi :0x13 mov 0x10(%rdi),%rax;mov %rsi,%rdx;mov %esi,%ecx;shr $0x6,%rdx;mov (%rax,%rdx,8),%rax :0x32 mov 0x10(%r12),%rax; mov %r13,%rcx; and $0x3f,%ecx; shl %cl,%rbx; lea (%rax,%r14,8),%rdx; mov $0xfffffffffffffffe,%rax; rol %cl,%rax; and (%rdx),%rax Graphene-SGX :0x4a mov 0x10(%r13),%rdx; sub %rbx,%rax; lea (%rdx,%rax,8),%rax; mov -0x8(%rax),%rcx :0x8a mov 0x10(%r13),%rsi; mov $0x40,%edi; mov %r12d,%r8d; sub %r12d,%edi; xor %eax,%eax; mov (%rsi,%rbx,8),%rdx :0x7c mov 0x10(%rdi),%rax; mov -0x8(%rax,%rdx,8),%rdi :0xa0 mov 0x10(%r15),%rdx; mov (%r14),%rsi; mov -0x58(%rbp),%rdi; mov (%rdx,%r13,8),%r8 :0x91 mov 0x10(%rdi),%rcx; mov -0x8(%rcx,%rdx,8),%rsi :0x100 mov 0x10(%r13),%rax; mov %r11,%rdx; add 0x10(%rbx),%rdx; mov 0x10(%r12),%rsi; mov %r14,%rdi; mov (%rax,%r11,1),%rcx :0x37 mov 0x10(%rsi),%r11; xor %ecx,%ecx; mov -0x8(%r11, %r10,8), %r9 :0x129 mov 0x10(%r14),%rax; lea 0x0(, %rdx, 8), %ecx; mov (%rax, %r8, 1), %rax :0x108 mov 0xc(%rbx),%edi; add $0x4,%r8; add $0x10,%rbx; mov %rdi,%rdx; movzbl %dh,%edx; movzbl (%rsi,%rdx,1),%ecx :0x1e8 mov 0x1c(%rbx),%r8d; add $0x20,%rbx; add $0x4,%rdi; mov %r8,%rdx; movzbl %dh,%edx; movzbl (%rsi,%rdx,1),%r9d :0x238 mov -0x14(%rbx), %edx; mov %ecx, (%rbx); xor -0x1c(%rbx), %ecx; mov %ecx, 0x4(%rbx); xor -0x18(%rbx), %ecx; xor %ecx, %edx; mov %ecx, 0x8(%rbx); movzbl %dl, %ecx; mov %edx, 0xc(%rbx); movzbl (%rsi, %rcx, 1), %r9d :0x2c8 mov 0x14(%rbx),%edi; add $0x18,%rbx; add $0x4,%r8; mov %rdi,%rdx; movzbl %dh,%edx; movzbl (%rsi,%rdx,1),%ecx

Table 4.2: SGXPECTRE Attack Type-II gadgets in popular SGX runtimes.

47 4.3 Stealing Enclave Secrets

In this section, we demonstrate two end-to-end SGXPECTRE Attacks against SGX

enclave programs. In the first example, we show how SGXPECTRE Attacks could read

register values from arbitrary enclave program developed using Intel SGX SDK [3]. In

the second example, we demonstrate the extraction of Intel’s secrets (e.g., attestation keys)

using SGXPECTRE Attacks. Both experiments were conducted on a Lenovo Thinkpad X1

Carbon (4th Gen) laptop with an Intel Core i5-6200U processor and 8GB memory.

4.3.1 Reading Register Values from Arbitrary Enclaves

We first demonstrate an attack that enables the adversary to read arbitrary register values

inside an arbitrary enclave program written with Intel SGX SDK [3], because this is Intel’s

official SDK. Rust-SGX was developed based on the official SDK and thus can be exploited

in the same way. For demonstration purposes, the enclave program we developed has

only one ECall function that runs in a busy loop. We verified that our own code does

not contain any Type-I or Type-II gadgets in itself. The exploited gadgets, however, are

located in the runtime libraries of SDK version 2.1.102.43402 (compiled with gcc version

5.4.020160609), which are listed in Listing 4.1 and Listing 4.2.

This attack is possible because during AEX, the values of registers are stored in the

SSA before exiting the enclave. As the SSA is also a memory region inside the enclave

and its address is fixed when loading the enclave, the privileged adversary could leverage

SGXPECTRE Attacks to read register values in the SSA during an AEX. This attack is

especially powerful as it allows the adversary to frequently interrupt the enclave execution with AEX [84] and take snapshots of its SSAs to single-step trace its register values during

its execution.

48 Figure 4.5: Exploiting Intel SGX SDK. The blocks with dark shadows represent instructions or data located in untrusted memory. Blocks without shadows are instructions inside the target enclave or the .data segment of the enclave memory.

In particular, the attack is shown in Figure 4.5. In Step Œ, the targeted enclave code is

loaded into an enclave that is created by a malicious program controlled by the adversary.

After EINIT, the malicious program starts a new thread (denoted as the victim thread) to

issue EENTER to execute the enclave code. The enclave code only runs in a busy loop. But

in reality, the enclave program might complete a remote attestation and establish trusted

communication with its remote owner. In Step , the adversary triggers frequent interrupts

to cause AEXs from the targeted enclave. During an AEX, the processor stores its register values into the SSA, exits the enclave and invokes the system software’s interrupt handler.

Before the control is returned to the enclave program via ERESUME, the adversary pauses the victim thread’s execution at the AEP, a piece of instructions in the untrusted runtime library

that takes control after IRet.

49 In Step Ž, the main thread of the adversary-controlled program sets (through a kernel

module) the reserved bit in the PTE of the enclave memory page that holds g_enclave_state,

a global variable used by Intel SGX SDK to track the state of the enclave, e.g., initialized or

crashed states. As shown in Listing 4.1, this global variable is accessed right before the ret

instruction of the Type-I gadget (i.e., the memory referenced by rcx in the instruction “mov

(%rcx),%eax”. In Step , the main thread poisons the BTB, prepares registers (i.e., rsi

and rdi), and executes EENTER to trigger the attack. Note that rbx will be set to rdi by the

time the ret instruction is executed (line 34 in Listing 4.1), in such a way we can control

rsi and rbx when speculatively executing Type-II gadget. To poison the BTB, the adversary

creates an auxiliary enclave program in another process containing an indirect jump with the

source address equals the address of the ret instruction in the Type-I gadget, and the target

address the same as the start address of the Type-II gadget in the victim enclave. The process

that runs in the auxiliary enclave is pinned onto the same logical core as the main thread.

To trigger the BTB poisoning code, the main thread calls sched_yield() to relinquish the

logical core to the auxiliary enclave program.

In Step , after the main thread issues EENTER to get into the enclave mode, the Type-I

gadget will be executed immediately. Because a reserved bit in the PTE is set, a page fault

is triggered when the enclave code accesses the global variable g_enclave_state. In the

page fault handler, the adversary clears the reserved bit in the PTE, evicts the stack frame

that holds the return address of the ret instruction from cache by accessing 2,000 memory

blocks whose virtual addresses have the same lower 12-bits as the stack address. The RSB

is depleted right before ERESUME from the fault handling, so that it will remain empty until

the ret instruction of Type-I gadget is executed. In Step ‘, due to the extended delay of

reading the return address from memory, the processor speculatively executes the Type-II

50 gadget (as a result of the BTB poisoning and RSB depletion). After the processor detects

the mis-prediction and flushes speculatively executed instructions from the pipeline, the

enclave code continues to execute. However, because rdi is set as a memory address in

our attack, it is an invalid value for the SDK as rdi is used as the index of the ECall table.

The enclave execution will return with an error quickly after the speculative execution. This

artifact allows the adversary to repeatedly probe into the enclave. In Step ’, the adversary

uses FLUSH-RELOAD techniques to infer the memory location accessed inside the Type-II

gadget. One byte of SSA can thus be leaked. The main thread then repeats Step Ž to Step

’ to extract the remaining of the SSA.

In our Type-I gadget, the get_enclave_state() function is very short as it contains

only 4 instructions. Since calling into this function will load the stack into the L1 cache,

it is very difficult to flush the return address out of the cache to win the race condition. In

fact, our initial attempts to flush the return address all failed. Triggering page faults to flush

the return address resolves the issue. However, directly introducing page faults in every

stack access could greatly increase the amount of time to carry out the attack. Therefore,

instead of triggering page faults on the stack memory, the page fault is enforced on the

global variable g_enclave_state which is located on another page. In this way, we can

flush the return address with only one page fault in each run.

In our Type-II gadget, the first memory access reads 4 bytes (32 bits). It is unrealistic

to monitor 232 possible values using FLUSH-RELOAD. However, if we know the value of

lower 24 bits, we can adjust the base of the second memory access (i.e., rbx) to map the 256

possible values of the highest 8 bits to the cache lines monitored by the FLUSH-RELOAD

code. Once all 32 bits of the targeted memory are learned, the adversary shifts the target

address by one byte to learn the value of a new byte. We found in practice that it is not hard

51 to find some initial consecutively known bytes. For example, the unused bytes in an enclave

data page will be initialized as 0x00, as they are used to calculate the measurement hash.

Particularly, we found that there are 4 reserved bytes (in the EXINFO structure) in the SSA

right before the GPRSGX region (which stores registers). Therefore, we can start from the

reserved bytes (all 0s), and extract the GPRSGX region from the first byte to the last. As

shown in Figure 4.5, all register values, including rax, rbx, rcx, rdx, r8 to r15, rip, etc,

can be read from the SSA very accurately. To read all registers in the GPRSGX region (184

bytes in total), our current implementation takes 414 to 3677 seconds to finish. On average,

each byte can be read in 6.6 seconds. We believe our code can be further improved.

Although the demonstrated attack only targets the register values, we note reading other

enclave memory follows exactly the same steps. The primary constraint is that the attack is

much more convenient if three consecutive bytes are known. To read the .data segments,

due to data alignment, some bytes are reserved and initialized as 0s, which can be used to

bootstrap the attack. In addition, some global variables have limited data ranges, rendering

most bytes known. To read the stack frames, the adversary could begin with a relatively

small address which is likely unused and thus is known to be initialized with 0xcc. In this way, the adversary can start reading the stack frames from these known bytes.

4.3.2 Stealing Intel Secrets

Next, we show how to steal Intel secrets, such as seal keys and attestation keys, from

Intel’s prebuilt and signed quoting enclave, i.e., libsgx_qe.signed.so (version 2.1.2).

All the attacks described below have been empirically validated on a Lenovo Thinkpad X1

Carbon (4th Gen) laptop with an Intel Core i5-6200U processor.

52 The demonstrated attack involves first extracting seal keys of the quoting enclave and

then decrypting sealed storage blob for the attestation keys. More particularly, the adversary

could use SGXPECTRE Attacks to read the seal keys from the enclave memory when it is

being used during sealing or unsealing operations. In our demonstration, we targeted Intel

SDK API sgx_unseal_data() used for unsealing a sealed blob. The sgx_unseal_data()

API works as follows: firstly, it calls sgx_get_key() function to derive the seal key and then

store it temporarily on the stack in the enclave memory. Secondly, with the seal key, it calls

sgx_rijndael128GCM_decrypt() function to decrypt the sealed blob. Finally, it clears

the seal key (by setting the memory range storing the seal key on the stack to 0s) and returns.

Hence, to read the seal key, the adversary suspends the execution of the victim enclave when function sgx_rijndael128GCM_decrypt() is being called, by setting the reserved

bit of the PTE of the enclave code page containing sgx_rijndael128GCM_decrypt().

The adversary then launches SGXPECTRE Attacks to read the stack and extract the seal key.

To decrypt the sealed blob, the adversary could export the seal key and then use an

AES-128-GCM decryption algorithm implemented by herself. This may happen outside the

enclave or on a different machine, because the SGX hardware is no longer involved in the

process. We have validated the attacks on our testbed.

Extracting the seal key of the quoting enclave. The quoting enclave has two ECall func-

tions: verify_blob(), which is used to verify the sealed EPID blob, and get_quote(), which is used to generate a quote on behalf of an attested enclave for remote attestation.

Particularly, verify_blob() calls an internal function verify_blob_internal(), which

further calls the sgx_unseal_data() API to unseal the EPID blob. So we targeted the verify_blob() ECall function, suspended its execution when sgx_rijndael128GCM_decrypt()

53 was being called, and read the stack to obtain the quoting enclave’s seal key. These steps

have been described in the previous paragraphs.

Extracting attestation key. After running the provisioning protocol with Intel’s provision-

ing service, an attestation key (i.e., EPID private key) is created and then sealed in an EPID

blob by the provisioning enclave and stored on a non-. Though the location

of the non-volatile memory is not documented, during remote attestation, SGX still relies

on the untrusted OS to pass the sealed EPID blob into the quoting enclave. This offers

the adversary a chance to obtain the sealed EPID blob. With the extracted seal key of the

quoting enclave, we could decrypt the sealed EPID blob to extract the EPID private key.

After obtaining the attestation key, the adversary could use this EPID private key to

generate an anonymous group signature and pass the remote attestation. This means the

adversary can now impersonate any machine in the EPID group. Moreover, the adversary

could also use the attestation key completely outside the enclave and trick the ISVs to

believe their code runs inside an enclave. This attack has been validated in our experiments,

by generating a valid signature of a quote from an ISV’s enclave without running it on SGX.

We note here that one challenge we have addressed in attacking the Intel signed quoting

enclave is that the TCS number of the quoting enclave is set to 1, which means the adversary

has to use the same TCS to enter the enclaves. SGXPECTRE Attacks are still possible as

the number of SSAs per TCS is 2, which is designed to allow the victim to run exception

handlers within the enclave when the exception could not be resolved outside the enclave

during AEXs. However, this also enables the adversary to EENTER into the enclave during an

AEX, thus launching the SGXPECTRE Attack to steal the secrets being used by the quoting

enclave.

54 4.4 Evaluating Existing Countermeasures

Hardware patches. To mitigate branch target injection attacks, Intel has released microcode

updates to support the following three features [13].

• Indirect Branch Restricted Speculation (IBRS): IBRS restricts the speculation of indirect

branches [16]. Software running in a more privileged mode can set a model-specific

register (MSR), IA32_SPEC_CTRL.IBRS, to 1 by using the WRMSR instruction, so that

indirect branches will not be controlled by software that was executed in a less privileged

mode or by a program running on the other logical core of the physical core. By default,

on machines that support IBRS, branch prediction inside the SGX enclave cannot be

controlled by software running in the non-enclave mode.

• Single Thread Indirect Branch Predictors (STIBP): STIBP prevents branch target injection

from software running on the neighboring logical core, which can be enabled by setting

the IA32_SPEC_CTRL.STIBP MSR to 1.

• Indirect Barrier (IBPB): IBPB is an indirect branch control com-

mand that establishes a barrier to prevent the branch targets after the barrier from being

controlled by code before the barrier. The barrier can be established by setting the

IA32_PRED_CMD.IBPB MSR.

Particularly, IBRS provides a default mechanism that prevents branch target injection. To validate the claim, we developed the following tests: First, to check if the BTB is cleansed

during EENTER or EEXIT, we developed a dummy enclave code that trains the BTB to predict

address A for an indirect jump. After training the BTB, the enclave code uses EEXIT and a

subsequent EENTER to switch the execute mode once and then executes the same indirect

jump but with address B as the target. Without the IBRS patch, the later indirect jump will

55 speculatively execute instructions in address A. However, with the patch, instructions in

address A will not be executed.

Second, to test if the BTB is cleansed during ERESUME, we developed another dummy

enclave code that will always encounter an AEX (by introducing page faults) right before

an indirect call. In the AEP, another BTB poisoning enclave code will be executed before

ERESUME. Without the patch, the indirect call speculatively executed the secret-leaking

gadget. The attack failed after patching.

Third, to test the effectiveness of the hardware patch under Hyper-Threading, we tried

poisoning the BTB using a program running on the logical core sharing the same physical

core. The experiment setup was similar to our end-to-end case study in Section 4.3, but

instead of pinning the BTB poisoning enclave code onto the same logical core, we pinned it

onto the sibling logical core. We observed some secret bytes leaked before the patch, but no

leakage after applying the patch.

Therefore, from these tests, we can conclude that SGX machines with microcode patch will cleanse the BTB during EENTER and during ERESUME, and also prevent branch injection via Hyper-Threading, thus they are immune to SGXPECTRE Attacks.

Retpoline. Retpoline is a pure software-based solution to Spectre attacks [81], which has

been developed for major , such as GCC [89] and LLVM [27]. Because modern

processors have implemented separate predictors for function returns, such as Intel’s return

stack buffer [42–46] and AMD’s return-address stack [49], it is believed that these return

predictors are not vulnerable to Spectre attacks. Therefore, the key idea of retpoline is to

replace indirect jumps or indirect calls with returns to prevent branch target injection.

However, in recent Intel Skylake/Kabylake processors, on which SGX is supported, when the RSB is depleted, the BPU will fall back to generic BTBs to predict a function

56 return. This allows poisoning of return instructions. Therefore, Retpoline is useless by itself

in preventing SGXPECTRE Attacks.

4.5 Is SGX Broken?

In the previous sections, we have shown that SGXPECTRE Attacks lead to confidentiality

breaches for both Intel’s enclaves and developers’ enclaves. In this section, we aim to

understand the security implications of SGXPECTRE Attacks (as well as other similar

attacks due to speculative or out-of-order execution [83]): Is SGX completely broken under

these threats?

4.5.1 Intel’s Secrets

As demonstrated in this chapter, all secrets in the memory (or registers saved in the

SSA during AEX) can be extracted by SGXPECTRE Attacks. We believe all secrets that are

exposed in the enclave memory (even only temporarily) can be exfiltrated by these attacks.

While all secrets in developers’ enclaves are exposed, however, not all Intel’s secrets can be

stolen in the same manner. Specifically, Intel’s secrets for its SGX platforms can be found

in Figure 4.6. Next, we explain in details how these secrets are affected by SGXPECTRE

Attacks.

Intel’s root secrets. For Intel’s infrastructure services to trust an SGX machine, during the

manufacturing process, Intel generates a root provisioning key at its internal key generation

facility, and burns it into the e-fuse of an SGX machine. The root provisioning key is also

stored in an Intel’s database to be referenced by Intel’s provisioning service. As such, the

root provisioning key serves as a shared secret that is only known by Intel and the underlying

SGX machine [48]. A 128-bit root seal key is generated inside the processor chip during

57 Figure 4.6: Intel’s secrets and key derivation.

the manufacturing process [11]. This root seal key is not known by Intel. For improving security, these two keys can only be accessed through the EGETKEY instruction and EREPORT instruction, but never exported to enclaves’ protected memory.

Derived keys. Derived secrets of an SGX platform include the provisioning key, the provisioning seal key, the report keys, the seal keys, the EINIT token key. The provisioning key is the secret used to establish trust during the provisioning protocol; the provisioning seal key is a symmetric key used to generate an encrypted backup copy of the attestation key to be stored in Intel’s provisioning service; the EINIT token key is used only by the launch enclave to sign the EINIT token of a legitimate enclave. A report key is a symmetric key possessed by each enclave. The EREPORT instruction is used by an enclave to generate a report of its execution context and produce a CMAC tag of the report using the report key of a specified enclave (e.g., the quoting enclave). However, the report key is not exported

58 to the memory in this process as it is kept secret to the specified enclave; it can only be

exported to the memory by the owner enclave using the EGETKEY instruction. A seal key

is used by an enclave to encrypt and decrypt the sealed storage; therefore, there could be

multiple seal keys for each enclave, which can be identified using their KEYID. It is also

possible for enclaves from the same ISV to share the seal keys. All of these derived secrets

can be exposed in the enclave memory, either by developers’ enclaves or Intel’s enclaves

(e.g., quoting enclaves and launch enclaves). Therefore, SGXPECTRE Attacks are able to

extract all of them.

EPID signing keys for Attestation. In the EPID provisioning protocol, the SGX platform

first sends a message to the Intel provisioning service that contains the platform identifier

(PPID) and the trusted computing base (TCB) version. Upon receiving the message, Intel’s

provisioning service verifies the PPID and selects an EPID group for the SGX platform. Intel

assigns SGX platforms with the same CPU type and the same TCB version the same EPID

group, which contains millions of machines [48]. Intel’s provisioning service then sends the

EPID group public key to the SGX platform. With the group public key, the provisioning

enclave runs an EPID join protocol with Intel’s provisioning service to generate an EPID

private key. The EPID private key is sealed using Intel’s seal key for later use. Each TCB version only requires running the provisioning protocol once. Breach of one EPID private

key might invalidate the entire EPID group. Unfortunately, the EPID private key is also an

in-memory secret that can be extracted by SGXPECTRE Attacks, as demonstrated in this

chapter.

Summary. The relationship among these keys are illustrated in Figure 4.6. The gray boxes

represent secrets only in the hardware and firmware, and the white boxes represent secrets

exposed to memory. Dashed boxes represent values that are known to the platform, some of

59 which are used to derive secrets. We can see that all Intel’s secrets, except the root seal key

and root provisioning key, can be exposed by SGXPECTRE Attacks.

4.5.2 Defense via Centralized Attestation Services

Defenses by Intel’s attestation service. Although the IBRS microcode patch defeats

SGXPECTRE Attacks, unpatched processors remain vulnerable. The key to the security of

the SGX ecosystem is whether attestation measurements and signatures from processors without the IBRS patch can be detected during remote attestation. Indeed, Intel’s attestation

service arbitrates every attestation request from the ISV, detects attestation signatures

generated from unpatched CPUs, and responses to the ISV with an error message indicating

outdated CPUSVN (see Table 4.3). Therefore, the combination of the microcode patches and

defenses by Intel’s attestation service has been an effective defense against SGXPECTRE

Attacks (and also the attack [83]).

Implications for developer enclaves. Despite the defense, developers should be aware of

the security implications of running (or having run) an enclave on unpatched SGX processors.

First, any secret provisioned to an unpatched processor can be leaked. This includes secrets

in enclaves that are provisioned before the remote attestation, or after the remote attestation

if the ISV chooses to ignore the error message returned by the attestation service. Moreover,

because the ISV enclave’s seal key can be compromised by SGXPECTRE Attacks, any secret

sealed by an enclave run on an unpatched processor can be decrypted by the adversary.

Furthermore, any legacy sealed secrets become untrustworthy, as they could be forged by

the adversary using the stolen seal key.

Second, as the EPID private key used in the remote attestation can be extracted by the

attacker, the attacker can provide a valid signature for any SGX processors in the EPID

60 Result Description Trustworthy OK EPID signature was verified correctly and the TCB level of the SGX Yes platform is up-to-date. SIGNATURE_INVALID EPID signature was invalid. No GROUP_REVOKED EPID group has been revoked. No SIGNATURE_REVOKED EPID private key used has been revoked by signature. No KEY_REVOKED EPID private key used has been directly revoked (not by signature). No SIGRL_VERSION_MISMATCH SigRL version does not match the most recent version of the SigRL. No GROUP_OUT_OF_DATE EPID signature was verified correctly, but the TCB level of SGX plat- Up to ISV form is outdated. CONFIGURATION_NEEDED EPID signature was verified correctly, but but additional configuration Up to ISV of SGX platform may be needed

Table 4.3: Attestation results [47]

group [48]. With the attestation key, it is also possible for the attacker to run the enclave code

entirely outside the enclave and forge a valid signature to fool the ISV. As shown in Table 4.3,

Intel currently rely on ISV to make their own decisions after receiving this error message. An

error message during attestation with GROUP_OUT_OF_DATE or CONFIGURATION_NEEDED

implies that the enclave cannot be trusted at all.

4.6 Summary

In this chapter, we studied techniques to perform SGXPECTRE Attacks, developed a

symbolic execution tool to automatically detect exploitable gadgets, demonstrated end-to-

end attacks to show how secrets (including Intel’s secrets) can be extracted, and discussed

the security implications on the SGX ecosystems. Our study concludes that SGXPECTRE

Attacks are powerful to extract any in-memory secrets from SGX enclaves (including register values that are stored in memory during AEX), but also points out Intel’s control of enclave

attestation provides a layer of defense that effectively mitigates such vulnerabilities via

microcode updates.

61 Chapter 5: HYPERRACE: Hyper-Threading Side-Channel Mitigation4

In previous chapters, we presented two side-channel attacks, one leverages Hyper-

Threading while the other abuses speculative execution. From this chapter on, we begin to

explore countermeasures to address as many side-channel attacks as possible. Particularly,

in this chapter, we propose a software solution, HYPERRACE, to close all Hyper-Threading

side channels. The chapter is laid out as follows. Section 5.1 presents an overview of HY-

PERRACE; Section 5.2 describes our physical-core co-location test technique; Section 5.3

presents the security analysis of the co-location tests; Section 5.4 elaborates the design and

implementation of HYPERRACE; Section 5.5 provides the results of performance evaluation

on our prototype and Section 5.6 summarizes this chapter.

5.1 Overview

In this section, we highlight the motivation of this chapter and an overview of HYPER-

RACE’s design.

5.1.1 Motivation

Although Hyper-Threading improves the overall performance of processors, it makes

defenses against side-channel attacks in SGX more challenging. The difficulty is exhibited

in the following two aspects:

4This chapter is excerpted from [29]

62 Side Channels Shared Cleansed at AEX Hyper-Threading only Caches Yes Not flushed No BPUs Yes Not flushed No Store Buffers No N/A Yes FPUs Yes N/A Yes TLBs Yes Flushed Yes

Table 5.1: Hyper-Threading side channels.

Introducing new attack vectors. When the enclave program executes on a CPU core that

is shared with the malicious program due to Hyper-Threading, a variety of side channels

can be created. In fact, most the shared resources listed in Section 2.2.2 can be exploited

to conduct side-channel attacks. For example, prior work has demonstrated side-channel

attacks on shared L1 D-cache [63,64], L1 I-cache [17,18,95], BTBs [19], FPUs [20], and

store buffers [21]. These attack vectors still exist on SGX processors.

Table 5.1 summarizes the properties of these side channels. Some of them can only be

exploited with Hyper-Threading enabled, such as the FPUs, store buffers, and TLBs. This is

because the FPU and store-buffer side channels are only exploitable by concurrent execution

(thus N/A in Table 5.1), and TLBs are flushed upon AEXs. Particularly interesting are the

store-buffer side channels. Although the two logical cores of the same physical core have

their own store buffers, false dependency due to 4K-aliasing introduces an extra delay to

resolve read-after-write hazards between the two logical cores [21, 76]. The rest vectors,

such as BPU and caches, can be exploited with or without Hyper-Threading. But Hyper-

Threading side channels provide unique opportunities for attackers to exfiltrate information without frequently interrupting the enclaves.

63 Creating challenges in SGX side-channel defenses. First, because Hyper-Threading

enabled or Hyper-Threading assisted side-channel attacks do not induce AEX to the target

enclave, these attacks are much stealthier. For instance, many of the existing solutions to

SGX side-channel attacks detect the incidences of attacks by monitoring AEXs [30, 71].

However, as shown in Chapter 3, Hyper-Threading enables the attacker to flush the TLB

entries of the enclave program so that new memory accesses trigger one complete page table walk and update the accessed flags of the page table entries. This allows attackers to monitor

updates to accessed flags without triggering any AEX, completely defeating defenses that

only detect AEXs.

Second, Hyper-Threading invalidates some defense techniques that leverage Intel’s

Transactional Synchronization Extensions (TSX)—Intel’s implementation of hardware

transactional memory. While studies have shown that TSX can help mitigate cache side

channels by concealing SGX code inside of hardware transactions and detecting cache line

eviction in its write-set or read-set (an artifact of most cache side-channel attacks) [36],

it does not prevent an attacker who share the same physical core when Hyper-Threading

is enabled. As such, Hyper-Threading imposes unique challenges to defense mechanisms

alike.

While disabling Hyper-Threading presents itself as a feasible solution, the significant

performance loss introduced by disabling Hyper-Threading prevents it from being adopted

in practice.

5.1.2 Design Summary

To prevent Hyper-Threading side-channel leaks, we propose to create an auxiliary

enclave thread, called shadow thread, to occupy the other logic core on the same physical

64 core. By taking over the entire physical core, the Hyper-Threading enabled or assisted

attacks can be completely thwarted.

Specifically, the proposed scheme relies on the OS to schedule the protected thread and

its shadow thread to the same physical core at the beginning, which is then verified by the

protected thread before running its code. Because thread migration between logical cores

requires context switches (which induce AEX), the protected thread periodically checks the

occurrence of AEX at runtime (through SSA, see Section 5.4.1) and whenever an AEX is

detected, verifies its co-location with the shadow thread again, and terminates itself once a violation is detected.

Given the OS is untrusted, the key challenge here is how to reliably verify the co-location

of the two enclave threads on the same physical core, in the absence of a secure clock. Our

technique is based upon a carefully designed data race to calibrate the speed of inter-thread

communication with the pace of execution (Section 5.2).

Design goals. Our design targets same-core side-channel attacks that are conducted from

the same physical core where the enclave program runs:

• Hyper-Threading side-channel attacks from the other logical core of the same physical

core, by exploiting one or more attack vectors listed in Table 5.1.

• AEX side-channel attacks, such as exception-based attacks (e.g., page-fault attacks [73,

91]), through manipulating the page tables of the enclave programs, and other interrupt-

based side-channel attacks (e.g., those exploiting cache [38] or branch prediction

units [53]), by frequently interrupting the execution of the enclave program using

Inter-processor interrupts or APIC timer interrupts.

65 5.2 Physical-core Co-Location Tests

In this section, we first present a number of straw-man solutions for physical-core co-

location tests and discuss their limitations, and then describe a novel co-location test using

contrived data races.

5.2.1 Straw-man Solutions

A simple straw-man solution to testing physical-core co-location is to establish a covert

channel between the two enclave threads that only works when the two threads are scheduled

on the same physical core.

Timing-channel solutions. One such solution is to establish a covert timing channel using

the L1 cache that is shared by the two threads. For instance, a simple timing channel can

be constructed by measuring the PROBE time of a specific cache set in the L1 cache set in

a PRIME-PROBE protocol [64], or the RELOAD time of a specific cache line in a FLUSH-

RELOAD protocol [92]. One major challenge of establishing a reliable timing channel in

SGX is to construct a trustworthy timing source inside SGX, as SGX version 1 does not

have rdtsc/rdtscp supports and SGX version 2 provides rdtsc/rdtscp instructions to

enclave but allows the OS to manipulate the returned values. Although previous work has

demonstrated that software clocks can be built inside SGX [30, 68, 87], manipulating the

speed of such clocks by tuning CPU core frequency is possible [30]. Fine-grained timing

channels for measuring subtle micro-architectural events, such as cache hits/misses, in a

strong adversary model is fragile. Besides, timing-channel solutions are also vulnerable to

man-in-the-middle attacks, which will be described shortly.

Timing-less solutions. A timing-less scheme has been briefly mentioned by Gruss et

al. [36]: First, the receiver of the covert channel initiates a transaction using hardware

66 transactional memory (i.e., Intel TSX) and places several memory blocks into the write-set

of the transaction (by writing to them). These memory blocks are carefully selected so

that all of them are mapped to the same cache set in the L1 cache. When the sender of

the covert channel wishes to transmit 1 to the receiver, it accesses another memory blocks

also mapped to the same cache set in the L1 cache; this memory access will evict the

receiver’s cache line from the L1 cache. Because Intel TSX is a cache-based transactional

memory implementation, which means the write-set is maintained in the L1 cache, evicting

a cache-line in the write-set from the L1 cache will abort the transaction, thus notifying the

receiver. As suggested by Gruss et al., whether or not two threads are scheduled on the same

physical core can be tested using error rate of the covert channel: 1.6% when they are on the

same core vs. 50% when they are not on the same core.

Man-in-the-middle attacks. As acknowledged in Gruss et al. [36], the aforementioned

timing-less solution may be vulnerable to man-in-the-middle attacks. In such attacks, the

adversary can place another thread to co-locate with both the sender thread and the receiver

thread, and then establish covert channels with each of them separately. On the sender

side, the adversary monitors the memory accesses of the sender using side channels (e.g.,

the exact one that is used by the receiver), and once memory accesses from the sender is

detected, the signal will be forwarded to the receiver thread by simulating the sender on

the physical core where the receiver runs. The timing-channel solutions discussed in this

section are also vulnerable to such attacks.

Covert-channel (both timing and timing-less) based co-location tests are vulnerable to

man-in-the-middle attacks because these channels can be used by any software components

in the system, e.g., the adversary outside SGX enclaves can mimic the sender’s behavior.

Therefore, in our research, we aim to derive a new solution to physical-core co-location tests

67 that do not suffer from such drawbacks—by observing memory writes inside enclaves that

cannot be performed by the adversary. We will detail our design in the next subsection.

5.2.2 Co-Location Test via Data Race Probability

Instead of building micro-architectural covert channels between the two threads that are

supposed to occupy the two logic cores of the same physical core, which are particularly vulnerable to man-in-the-middle attacks, we propose a novel co-location test that verifies

the two threads’ co-location status by measuring their probability of observing data races on

a shared variable inside the enclave.

In this section, we first illustrate the idea using a simplified example, and then refine the

design to meet the security requirements. A hypothesis testing scheme is then described to

explain how co-location is detected by comparing the observed data race probability with

the expected one.

An illustrating example. To demonstrate how data race could be utilized for co-location

tests, consider the following example:

1. An integer variable, V , shared by two threads is allocated inside the enclave.

2. Thread T0 repeatedly performs the following three operations in a loop: writing 0 to

V (using a store instruction), waiting N (e.g., N = 10) CPU cycles, and then reading

V (using a load instruction).

3. Thread T1 repeatedly writes 1 to V (using a store instruction).

There is a clear data race between these two threads, as they write different values to the

same variable concurrently. When these two threads are co-located on the same physical

core, thread T0 will read 1, the value written by thread T1, from the shared variable V with a

68 Figure 5.1: Data races when threads are co-located/not co-located.

high probability (close to 100%). In contrast, when these two threads are located on different

physical cores, thread T0 will observe value 1 with very low probability (i.e., close to zero).

Such a drastic difference in the probability of observing data races is caused by the

location in which the data races take place. As shown in Figure 5.1, when the two threads

are co-located, data races happen in the L1 cache. Specifically, both thread T0 and T1 update

the copy of V in the L1 data cache. However, the frequency of thread T0’s updates to the

shared variable V is much lower than that of T1, because the additional read and N-cycle waiting in thread T0 slow down its execution. Therefore, even though the load instruction

in thread T0 can be fulfilled by a store-to-load forwarding from the same logical core, when

the load instruction retires, almost always the copy of V in the L1 cache is the value stored

by thread T1, invalidating the value obtained from store-to-load forwarding [31]. As such,

the load instruction in thread T0 will read value 1 from V with a very high probability.

However, when the two threads are not co-located—e.g., thread T0 runs on physical

core C0 and thread T1 runs on physical core C1—the data races happen in the L1 cache of

69 physical core C0. According to the cache coherence protocol, after thread T0 writes to V , the

corresponding cache line in C0’s L1 cache, denoted by CL0, transitions to the Modified state.

If T0’s load instruction is executed while CL0 is still in the same state, thread T0 will read

its own value from CL0. In order for thread T0 to read the value written by thread T1, one

necessary condition is that CL0 is invalided before the load instruction of thread T0 starts

to execute. However, this condition is difficult to meet. When thread T1 writes to V , the

corresponding cache line in C1’s L1 cache, denoted by CL1, is in the Invalidate state due to

T0’s previous store. T1’s update will send an invalidation message to CL0 and transition CL1

to the Modified state. However, because the time needed to complete the cache coherence

protocol is much longer than the time interval between thread T0’s write and the following

read, CL0 is very likely still in the Modified state when the following read is executed. Hence,

thread T0 will read its own value from variable V with a high probability.

A refined data-race design. The above example illustrates the basic idea of our physical-

core co-location tests. However, to securely utilize data races for co-location tests under

a strong adversarial model (e.g., adjusting CPU frequency, disabling caching), the design

needs to be further refined. Specifically, the refined design aims to satisfy the following

requirements:

• Both threads, T0 and T1, observe data races on the same shared variable, V , with high

probabilities when they are co-located.

• When T0 and T1 are not co-located, at least one of them observes data races with low

probabilities, even if the attacker is capable of causing cache contention, adjusting

CPU frequency, or disabling caching.

70 Figure 5.2: Co-location detection code.

To meet the first requirement, T0 and T1 must both write and read the shared variable.

In order to read the value written by the other thread with high probabilities, the interval

between the store instruction and the load instruction must be long enough to give the

other thread a large window to overwrite the shared variable. Moreover, when the two

threads are co-located, their execution time in one iteration must be roughly the same and

remain constant. If a thread runs much faster than the other, it will have a low probability of

observing data races, as its load instructions are executed more frequently than the store

instructions of the slower thread. To satisfy the second requirement, instructions that have a

non-linear slowdown when under interference (e.g., cache contention) or execution distortion

(e.g., CPU frequency change or cache manipulation) should be included.

The code snippets of refined thread T0 and T1 are listed in Figure 5.2. Specifically, each

co-location test consists of n rounds, with k data race tests per round. What follows is the

common routine of T0 and T1:

71 1. Initialize the round index %rdx to n (running the test for n rounds); and reset counter

%rcx, which is used to count the number of data races (the number of times observing

the other thread’s data).

2. Synchronize T0 and T1. Both threads write their round index %rdx to the other thread’s

sync_addr and read from each others’ sync_addr. If the values match (i.e., they are

in the same round), T0 and T1 begin the current round of co-location test.

3. At the beginning of each round, set the test index %rsi to b0 + k for T0 and to b1 + k

for T1. Therefore, T0 will write b0 + k, b0 + k − 1, b0 + k − 2, ···, b0 + 1 to the shared

variable; T1 will write b1 + k, b1 + k − 1, b1 + k − 2, ···, b1 + 1. [b0,b0 + k] does not

overlap with [b1,b1 + k] so either thread, when writes its %rsi to V and reads from it,

knows whether it receives the input from the other thread. After that, initialize the

address of shared variable V in %r8.

4. For T0, store the content of %rsi to V , determine whether a data race happens, and

update %rcx if so. For T1, determine whether a data race happens, update %rcx if so,

and then store %rsi to V . A data race is counted if and only if contiguous values

written by the other thread are read from V , which indicates that the two threads run

at the same pace.

5. Record the data race in a counter using the conditional move (i.e., CMOV) instruction.

This avoids fluctuations in the execution time due to conditional branches.

72 Figure 5.3: The basic idea of the data race design. Monitoring the memory operations of the two threads on V . LD: load; ST: store.

6. Execute the padding instructions to (1) make the execution time of T0 and T1 roughly

the same; (2) increase the interval between the store instruction and the load instruc-

tion; (3) create non-linear distortion in the execution time when being manipulated

(see discussions in Section 5.3).

7. Decrease %rsi by 1 and check whether it hits b0 (for T0) or b1 (for T1), which indicates

the end of the current round. If so, go to step 8. Otherwise, go to step 4.

8. Decrease %rdx by 1 and check whether it becomes 0. If so, all rounds of tests finish;

Otherwise, go to step 2.

The time for one data race test for thread T0 and T1 is roughly the same when both threads are running on the same physical core. As shown in Figure 5.3, when the two threads are co-located, since the interval from load to store (line 22 to 24 for T0, line 22 to 39 for T1) is much shorter than the interval between store and load (line 24 to 52 then jump to 21 for T0, line 39 to 54, including the serializing instruction lfence, then jump to 21 for T1), there is a high probability that the store operation from the other thread will fall into the interval between the store and load. As a result, each thread becomes much more likely to see the other’s data than its own. In contrast, when the two threads are not co-located, the communication time between the two physical cores is longer than the interval between store and load: that is, even when one thread’s store is performed in the other’s store

73 to load interval, the data of the store will not be seen by the other due to the delay caused

by the cache coherence protocol. Therefore, data races will not happen.

Testing co-location via statistical hypothesis testing. To determine whether two threads

are co-located on the same physical core, we perform the following hypothesis test.

During each round of a co-location test, k samples are collected by each thread. We

consider the k samples as k − 1 unit tests; each unit test consists of two consecutive samples:

if both samples observe data races (and the observed counter values are also consecutive),

the unit test passes; otherwise it fails. We take the i-th (i = 1,2,...,k − 1) unit test from

each round (of the n rounds), and then consider this n unit tests as n independent Bernoulli

trials. Then, we have k −1 groups of Bernoulli trials. We will conduct k −1 hypothesis tests

for each of the two threads as follows, and consider the co-location test as passed if any of

the k − 1 hypothesis tests accepts its null hypothesis:

We denote the j-th unit test as a binary random variable Xj, where j = 1,2,...,n; Xj = 1

indicates the unit test passes, and Xj = 0 otherwise. We assume when the two threads are

co-located, the probability of each unit test passing is p. Therefore, when they are co-located,

P(Xj = 1) = p. We denote the actual ratio of passed unit tests in the n tests as pˆ. The null

and alternative hypotheses are as follows:

H0:p ˆ ≥ p; the two threads are co-located.

H1:p ˆ < p; the two threads are not co-located.

Because Xj is a test during round j and threads T0 and T1 are synchronized before

each round, we can consider X1,X2,··· ,Xn independent random variables. Therefore,

n the sum of n random variables, i.e., X = ∑ j=1 Xj, follows a Binomial distribution with parameters n and p. The mean of the Binomial distribution is E(X) = np and the variance is

74 D(X) = np(1 − p). When n is large, the distribution of X can be approximated by a normal

distribution N(np,np(1 − p)). Let the significance level be α. Then " # X − np Pr < −uα = α. pnp(1 − p)

We will reject H0 and decide that the two threads are not co-located, if

p X < np − uα np(1 − p).

In our prototype implementation, we parameterized n, p, and α. For example, when

n = 256 and α = 0.01, uα = 2.33. From the measurement results given in Table 5.4

(Section 5.3), the probabilities for T0 and T1 to see data races with co-location are p0 = 0.969 and p1 = 0.968, respectively. So we have for both T0 and T1

Pr [X < 242] = 0.01.

In other words, in the hypothesis test, we reject the null hypothesis if less than 242 unit tests

(out of the 256 tests) pass in T0 (or T1).

Here the probability of a type I error (i.e., falsely rejecting the null hypothesis) is about

1%. The probability of a type II error is the probability of falsely accepting H0 when the

alternative hypothesis H1 is true. For example, when X follows a normal distribution of

N(np,np(1 − p)) and p = 0.80, the probability of a type II error in T0 and T1 will be (let

Z = √X−np ∼ N(0,1)): np(1−p) h i

Pr X ≥ 242 X ∼ N(np,np(1 − p)) " # 242 − 256 · 0.80 = Pr Z ≥ Z ∼ N(0,1) < 0.01%. p256 · 0.80 · (1 − 0.80)

Practical considerations. The above calculation only provides us with theoretic estimates

of the type I and type II errors of the hypothesis tests. In practice, because system events

75 cannot be truly random and independent, approximation has to be made. Particularly, the

two threads are only synchronized between rounds, and the k samples in each round are

collected without re-synchronization. Therefore, although samples in different rounds

can be considered independent, the k samples within the same round may be dependent.

Second, within each round, a truly random variable X requires T0 and T1 to start to monitor

data races uniformly at random, which is difficult to achieve in such fine-grained data

race measurements. We approximate the true randomness using the pseudo-randomness

introduced in the micro-architecture events (e.g., data updates in the L1 cache reflected in

memory reads) during the synchronization. To account for the dependence between unit

tests in the same round and the lack of true randomness of each unit test, we select the

i-th unit test from each round to form the i-th n-sample hypothesis test, and consider the

co-location test as passed if any of the k − 1 hypothesis tests accepts its null hypothesis. We will empirically evaluate how this design choice impacts the type I errors and type II errors

in Section 5.3.3.

5.3 Security Analysis of Co-location Tests

In this section, we provide an analysis on the security of the co-location tests. To do so, we first establish the relationship between the execution time of the communicating threads

and the data race probability. We next empirically estimate the execution time of the threads

under a variety of execution conditions that the adversary may create (e.g., Priming caches,

disabling caching, adjusting CPU frequencies, etc.) and then apply the measurement results

to analytically proof that, under all attacker-created conditions we have considered, the data

race probability cannot reach the same level as that when the two threads are co-located.

76 Figure 5.4: The model of thread T0 and thread T1. •: load; : store.

Finally, we empirically performed attacks against our proposed scheme and demonstrated

that none of the attacks could pass the co-location tests.

5.3.1 Security Model

To establish the relationship between the execution time of the communicating threads

and the probability of data races, we first construct execution models of thread T0 and thread

T1 (see their code snippets in Figure 5.2). Particularly, we abstract the execution of T0 and

T1 as sequences of alternating load and store operations on the shared variable V . After

each load or store operation, some delays are introduced by the padding instructions. We

i use Iw, where w ∈ {store,load} and i ∈ {0,1} to denote a code segment between two

instructions for thread Ti: when w is load, the segment is from load to store (line 22 to

24 for T0, line 22 to 39 for T1; see Figure 5.2); when w is store, the segment begins with

the store instruction and ends with the first load encountered (line 24 to 52 then jump to

21 for T0, line 39 to 54, then jump to 21 for T1).

The execution time of these code segments depends on their instructions and the memory

hierarchy v on which the data access (to variable V ) operation w is performed (i.e., memory

i i access latency). Therefore, the execution time of the code segment Iw is denoted by T (Iw,v),

i i where i ∈ {0,1} and v ∈ {L1,L2,LLC,Memory}. We further denote Tw,v = T (Iw,v) for short.

77 As such, the period of thread Ti’s one iteration of the store and load sequence (line 22 to

i i i 52, then jump to 21 for T0, line 22 to 54, jump to 21 for T1) is Rv = Tload,v + Tstore,v, i.e.,

the time between two adjacent load instructions’ retirements of thread Ti when the data

accesses take place in memory hierarchy v.

We use variable Gv,u, where u ∈ {c,nc}, to denote the communication time, i.e., the

time that the updated state of V appears in the other thread’s memory hierarchy v, after one

thread modifies the shared variable V , if two threads are co-located (u = c) or not co-located

(u = nc).

i Consider the data race happens in memory hierarchy v. If Tstore,v < Gv,u, i ∈ {0,1},

during the time thread Ti⊕1’s updated state of V is propagated to thread Ti’s memory

hierarchy v, Ti has updated V and fetched data from v at least once. As a result, data races

i will not happen. In contrast, if Tstore,v ≥ Gv,u, a data race will happen if the data value of

i V is propagated from thread Ti⊕1 to Ti’s memory hierarchy v during Tstore,v.

i i⊕1 Further, if Tstore,v ≥ Rv , at least one store from thread Ti⊕1 will appear in v during

i i i⊕1 Tstore,v. Then data races will be observed by thread Ti. If Tstore,v < Rv , the data race

i i⊕1 probability of thread Ti will be Tstore,v/Rv , since the faster the store-load operations

of Ti compared with the other thread’s iteration, the less likely Ti will see the other’s data.

Hence, we have the data race probability of thread Ti (i ∈ {0,1}):

( i = 0 if Tstore,v < Gv,u pi i i⊕1 i . (5.1) ≈ min(Tstore,v/Rv ,1) if Tstore,v ≥ Gv,u

It is worth noting that when the two threads run at drastically different paces, the faster

thread will have a low probability to observe data races, as its load instructions are executed

more frequently than the store instructions of the slower thread. Therefore, we implicitly

0 1 assume that Rv is close to Rv . This implicit requirement has been encoded in our design

78 of the unit tests: the way we count data race requires two consecutive data races to read

consecutive counter values from the other thread.

Necessary conditions to pass the co-location tests: To summarize, in order to pass the

co-location tests, an adversary would have to force the two threads to execute in manners that

0 1 satisfy the following necessary conditions: (1) They run at similar paces. That is, Rv /Rv is close to 1. (2) The communication speed must be faster than the execution speed of the

i i i⊕1 threads. That is, Tstore,v ≥ Gv,u, where i ∈ {0,1}. (3) Tstore,v/Rv must be close to 1, where i ∈ {0,1}, to ensure high probabilities of observing data races.

5.3.2 Security Analysis

In this section, we systematically analyze the security of the co-location tests by in- vestigating empirically whether the above necessary conditions can be met when the two

threads are not co-located. Our empirical analysis is primarily based on a Dell Optiplex

7040 machine equipped with a Core i7-6700 processor. We also conducted experiments

on machines with a few other processors, such as E3-1280 V5, i7-7700HQ, i5-6200U (see

Table 5.4).

We consider the scenarios in which the two threads T0 and T1 are placed on different

CPU cores by the adversary and the data races are forced to take place on the memory

hierarchy v, where v = {L1/L2,LLC,memory}. We discuss these scenarios respectively.

5.3.2.1 L1/L2 Cache Data Races

We first consider the cases where v = {L1,L2}. This may happen when the adversary

simply schedule T0 and T1 on two cores without cache intervention (e.g., cache Priming or

caching disabling). However, the adversary is capable of altering the CPU frequency on

i which T0 or T1 runs to manipulate Tstore,v and Gv,nc.

79 T0 T1 0 0 1 1 Tstore,v Rv Tstore,v Rv Caching Enabled 95.90 96.30 88.70 98.69 Caching Disabled 1.32e+5 1.35e+5 1.34e+4 2.57e+4

Table 5.2: Time intervals (in cycles) of T0 and T1.

Latency of cache accesses. We use the pointer-chasing technique [55,79] to measure cache

access latencies. Memory load operations are chained as a linked list so that the address

of the next pointer depends on the data of previous one. Thus the memory accesses are

completely serialized. In each measurement, we access the same number of cache-line sized

and aligned memory blocks as the number of ways of the cache at the specific cache level, so

that every memory access induces cache hits on the target cache level and cache misses on

all lower-level caches. According to the result averaged over 10,000,000 measurements, the

average value of cache access latencies for the L1/L2/L3 caches were 4, 10 and 40 cycles,

respectively.

Cross-core communication time. We developed a test program with two threads: Thread

Ta repeatedly writes to a shared variable in an infinite loop, without any additional delays

between the two consecutive writes. Thread Tb runs on a different physical core, which after writing to the shared variable executes a few dummy instructions to inject a delay, and then

reads from the variable to check for data race. The execution time of the dummy instructions

can be used to measure the communication time: When dummy instructions are short, Tb will observe no data race; but when the execution time of the dummy instructions increases

to certain threshold, Tb will start to observe data race. We draw the histogram of 10,000,000

measurements (Figure 5.5). The solid bars represent measurements in which data races

80 1e+08 No Data Race Data Race 1e+07 1e+06 100000 10000 1000 # (instances) 100 10 1 100 200 300 time (cycles)

Figure 5.5: Demonstration of the cross-core communication time. There is no data race if the dummy instructions take time shorter than 190 cycles.

were not observed (i.e., Tb reads its own data) and the shaded bars represent measurements where data races were observed (i.e., Tb reads Ta’s data). From the experiment, we see when the execution time of the dummy functions is less than 190 cycles, data races were

hardly observed. Therefore, we believe the latency of cross-core communication is about

190 cycles.

Effects of frequency changes. In our experiments, we managed the CPU frequency with

the support of Hardware-Controlled Performance states (HWP). Specifically we first enabled

HWP by the writing to the IA32_PM_ENABLE MSR, then configured the frequency range by

the writing to the IA32_PM_REQUEST MSR. To understand the relation between instructions

latencies and the CPU frequency, we evaluated the latency of L1/L2/L3 cache accesses,

the latency of executing nop,load, and store instructions, respectively, and the latency

81 L1 latency load;lfence latency L2 latency load/nop/store latency 1000 L3 latency cross-core comm. latency 850 699 497 387 169.7 316 135.5 266 232 94.1 190 74.8 100 63.1 60.8 50.5 52.3 45.1 36.0 39.9 28.0 42.56 22.9 19.4 35.4 16.8 14.8 24.1 time (cycles) 18.78 17.75 15.3 10 13.5 13.1 11.3 4.3 9.5 10.0 3.4 7.5 6.25 2.5 5.5 4.5 4.0 1.9 1.6 1.3 1.2 1.0 1 800 1000 1400 1800 2200 2600 3000 3400 CPU frequency (MHz)

Figure 5.6: The effects of frequency changing on execution speed, cache latencies, and cross-core communication time.

of executing the store; lfence instruction sequence, under different CPU frequencies.

We also measured the cross-core communication speed under these frequencies. The

measurements were conducted in a tight loop, averaged over 10,000,000 tests. The results

are plotted in Figure 5.6. The results suggest that when the CPU frequency changes from

3.40 Ghz to 800 Mhz the instruction execution speed (4.3×), cache access latencies (4.25×–

4.44×), and cross-core communication time (4.47×) are affected in the similar order of

magnitude.

Discussion. For v ∈ {L1,L2}, we have Gv,c ≤ 12 cycles (the latency for a L2 access) and

Gv,nc > 190 cycles (the latency of cross-core communication). According to Table 5.2,

0 1 i Tstore,v = 95.90 and Tstore,v = 88.70. Therefore, Gv,c < Tstore,v < Gv,nc, i ∈ {0,1}. As such, data races will happen only if the two threads are co-located. Altering the CPU

82 frequency will not change the analysis. According to Figure 5.6, frequency changes have

i similar effects on Tstore,v and Gv,nc. That is, when the CPU frequency is reduced, both

i Tstore,v and Gv,nc will increase, with similar derivatives. As a result, when the adversary

places T0 and T1 on different cores, and reduces the frequency of these two cores, their

communication speed will be slowed down at the same pace as the slowdown of the

execution.

5.3.2.2 LLC Data Races

We next consider the cases where v = {LLC}. This may happen when the adversary

PRIMEs the private caches used by T0 and T1 (from co-located logical cores) to evict the

shared variable V to the LLC.

Effects of cache PRIMEs. The data races can occur on the shared LLC when the copies of

V in the private L1 and L2 caches are invalidated, which can only be achieved by having an

attacking thread frequently PRIMEing the shared L1/L2 caches from the co-located logical

core. To counter such attacks, thread T0 and T1 both include in their padding instructions

redundant load instructions (i.e., line 46 to 49 of T0 and line 42 to 50 of T1 in Figure 5.2).

These load instructions precede the load instruction that measures data races, thus they

effectively pre-load V into the L1/L2 caches to prevent the adversary’s PRIMEs of related

cache lines. This mechanism not only defends against attempts to PRIME local L1/L2 caches,

but TLBs and paging structure caches.

Discussion. According to our measurement study, the time needed to PRIME one cache

set in L1 and one cache set in L2 (to ensure that V is not in L1 and L2 cache) is at least

10×(wL2 −1)+40×1 cycles (wL2 is the number of cache lines in one L2 cache set), which

is significantly larger than the interval between the pre-load instructions and the actual load

83 instruction (i.e., 1 cycle). Moreover, because CPU frequency changes are effective on both

logical cores of the same physical core, altering CPU frequency will not help the adversary.

Therefore, we conclude that data race cannot happen on LLC.

5.3.2.3 Data Races in Main Memory

We next consider the cases where v = {Memory}. This may happen when the adversary

(1) PRIMEs the caches, (2) invalidates the caches, or (3) disables the caching.

Latency of cache invalidation instructions. According to Intel software developer’s man-

ual [11, Chapter 8.7.13.1], the wbinvd instruction executed on one logical core can invalidate

the cached data of the other logical core of the same physical core. Directly measuring

the latency of cache invalidation using the wbinvd instruction is difficult. Instead, we

measure the execution time of wbinvd to approximate the latency of cache invalidation.

This is reasonable because wbinvd is a serialized instruction. Specifically, we conducted the

following experiments: We run wbinvd in a tight loop for 1,000,000 times and measure the

execution time of each loop, which is shown in Figure 5.7. We observe that in some cases

the latency is as high as 2 × 106 cycles, which typically happens early in the experiments, while most of the times the latency is only 1 × 106 cycles. We believe this is because dirty

cache lines need to be written back to the memory in the first few tests, but later tests usually

encounter already-empty caches.

Effects of disabling caching. The attacker can disable caching on a logical core by setting

the CD bit of control registers. According to Intel Software Developer’s Manual [11, Chapter

8.7.13.1], “the CD flags for the two logical processors are ORed together, such that when

any logical processor sets its CD flag, the entire cache is nominally disabled.” This allows

the adversary to force an enclave thread to enter the no-fill caching mode. According to

84 1e+06

100000

10000

1000

# (instances) 100

10

1 1.0e+06 1.5e+06 2.0e+06 time (cycles)

Figure 5.7: The histogram of wbinvd execution time over 1,000,000 measurements.

Intel’s manual [11, Sec. 11.5.3 and Table 11-5], after setting the CD bit, the caches need to

be flushed with wbinvd instruction to insure system memory coherency. Otherwise, cache

hits on reads will still occur and data will be read from valid cache lines. The adversary

can also disable caching of the entire PRM by setting the PRMRR [8, Chapter 6.11.1], as “all

enclave accesses to the PRMRR region always use the memory type specified by the PRMRR,

unless the CR0.CD bit on one of the logical processors on the core running the enclave is set.”

It is worth noting that the PRMRR_BASE and PRMRR_MASK MSRs are set in an early booting

stage, and cannot be updated after the system boots.

We measured the latency of the nop, load, store instructions, and the load;lfence

instruction sequence, respectively, in tight loops (averaged over 10,000,000 measurements) with the caching enabled and disabled. The results are shown in Table 5.3. The slowdowns

85 Instructions Caching enabled Caching disabled Slowdown nop 1.00 901 901× load 1.01 1266 1253× store 1.01 978 968× load; lfence 14.82 2265 153×

Table 5.3: Instruction latencies (in cycles) caused by disabling caching.

were calculated by comparing the latency with caching disabled and enabled. It can be

seen that the slowdowns of nop, load, and store instructions are around 1000×. But the

slowdown of load;lfence instruction sequence is only two orders of magnitude. This

result leads to the non-linear distortion of T1 when caching are disabled (see Figure 5.2),

0 1 which is also shown in Table 5.2: Tstore,v and Tstore,v are on the same order of magnitude when caching is enabled but become drastically different when caching is disabled (i.e.,

1.32e+5 vs. 1.34e+4).

Discussion. A prerequisite of observing data races in the memory is that the load operations

miss L1/L2/LLC caches. This may be achieved using one of the following mechanisms:

• Evicting the shared variable to memory on-the-fly. The adversary could leverage two

approaches to evict the shared variable to memory: (1) Flushing cache content using

the wbinvd instruction. However, as the latency of the instruction (on the order of

106 cycles) is too large (see Figure 5.7), it cannot effectively evict the shared variable

to memory. In fact, during the execution of the wbinvd instruction, caches can still

be filled normally. We have empirically confirmed that co-location tests that happen

during the execution of the wbinvd instruction are not affected. (2) Evicting the cache

content using PRIME-PROBE techniques. However, according to our measurement

study, the time needed to PRIME one cache set in LLC is at least 40 × wLLC cycles

86 (wLLC is the number of cache lines in one LLC slides), which is significantly larger

than the interval between the pre-load instructions and the actual load instruction

(i.e., 1 cycle). Even if the adversary could distribute the task of cache PRIMEs to

multiple threads running on different CPU cores, which is by itself challenging due

to cache conflicts among these threads, the gap of speed should be huge enough to

prevent such attacks. We will empirically verify this artifact in Section 5.3.3.

• Disabling caching. We have examined several approaches to disable caching: First,

the adversary can disable caching by editing PRMRR, which will be effective after

system reboots. Second, the adversary can interrupt the co-location tests before the

load instructions and flush the cache content using the wbinvd instruction or PRIME-

PROBE operations (though interruption of the co-location tests will be detected and

thus restart the co-location tests). Third, the adversary can disable the caching of

the two physical cores on which T0 and T1 executes by setting the CD bits of the

control registers. However, none of this methods can pass the co-location tests. This is

because we use load instructions as paddings in thread T0, and use load followed by

lfence instructions as paddings in thread T1. If caching is disabled, the slowdown of

“load; lfence” is much smaller than the other instructions, since the former already

serializes the load operations (see Table 5.3). As a result, the relative speed of the

0 1 two threads changes significantly (see Table 5.2). Particularly, as Rv /Rv is no longer close to 1, the co-location tests will not pass.

• Altering CPU frequency when caching is disabled. We further consider the cases of

changing CPU frequency after disabling caching by setting the CD bits. Suppose the

frequency change slows down thread T0 and T1 by a factor of c0 and c1, respectively,

87 0 5 0 5 1 which are constant. Then Tstore,v = c0 · 1.32 × 10 , Rv = c0 · 1.35 × 10 , Tstore,v =

4 1 4 c1 · 1.34 × 10 , Rv = c1 · 2.57 × 10 , according to Table 5.2. Then, based upon

5 c0·1.32×10 Equa. (5.1), the data race probabilities of T0 and T1 are pˆ0 = min( 4 ,1) and c1·2.57×10 4 5 4 c1·1.34×10 c0·1.32×10 c1·1.34×10 pˆ1 = min( 5 ,1) respectively. Since pˆ0 · pˆ1 ≤ 4 · 5 ≈ 0.51, c0·1.35×10 c1·2.57×10 c0·1.35×10 we can see that the probability for a thread to observe the data race will not exceed √ 0.51 ≈ 71.4%, which has a near zero probability to pass our co-location test.

• Nonlinear CPU frequency changes. The only remaining possibility for the adversary

0 to fool the co-location test is to change the CPU frequency nonlinearly so that Tstore,v,

0 1 1 Tload,v, Tstore,v, Tload,v change independently. However, the CPU frequency transi- tion latency we could achieve on our testbed is between 20µs and 70µs (measured

using the method proposed by Mazouz et al. [58]), which is on the same order of

1 1 magnitude as Rv when caching is disabled (and thus much larger than Rv when caching is enabled), making it very difficult, if not impossible, to introduce desired

nonlinear frequency change during the co-location tests.

In summary, when the data races take place in the memory through any of the methods we discussed above, the attacker cannot achieve high probability of observing data races in

both T0 and T1. The hypothesis tests will fail in all cases.

5.3.3 Empirical Security Evaluation

We empirically evaluated the accuracy of the co-location tests. As the primary goal of

the co-location test is to raise alarms when the two threads are not co-located, we define a

false positive as a false alarm (i.e., the co-location test fails) when the two threads are indeed

scheduled on the same physical core, and a false negative as a missed detection (i.e., the

co-location test passes) of the threads’ separation.

88 False positive rates. A False positive of the co-location tests is approximately the combined type I error of two hypothesis tests (from T0 and T1, respectively). We run the same code shown in Figure 5.2 on four different processors (i.e., i7-6700, E3-1280 v5, i7-7700HQ, and i5-6200U) without modification. The empirical probabilities of passing unit tests by T0 and T1 on these processors are listed in Table 5.4. These values are estimated by conducting

25,600,000 unit tests. Then with parameter n = 256 and the corresponding values of p0 and p1, we run co-location tests with α = 0.01, α = 0.001, α = 0.0001, respectively. The false positive rates are reported in Table 5.4. Although the empirical values are close to the theoretical values of α, there are cases where the empirical values are 3× the theoretical ones (i.e., on i5-6200U with α = 0.0001). This is probably because of the lack of true randomness and independence in our statistical tests (explained in Section 5.2.2). However, these values are on the same order of magnitude. We believe it is reasonable to select a desired α value to approximate false positives in practice.

False negative rates. A false negative of the co-location test is approximately the type II error of the hypothesis test. We particularly evaluated the following four scenarios:

1. The adversary simply places the two threads on two physical cores without interfering

with their execution.

2. The adversary simply places the two threads on two physical cores, and further reduces

the frequency of the two physical cores to 800 Mhz.

3. The adversary simply places the two threads on two physical cores, and further

disabling caching on the cores on which the two threads run, by setting the CD flag.

89 false positive rates (α =) CPU p0 p1 0.01 0.001 1e−4 i7-6700 0.969 0.968 0.005 5e−4 4e−5 E3-1280 V5 0.963 0.948 0.004 4e−4 5e−5 i7-7700HQ 0.965 0.950 0.005 5e−4 2e−4 i5-6200U 0.968 0.967 0.006 0.001 3e−4

Table 5.4: Evaluation of false positive rates.

4. The adversary simply places the two threads on two physical cores, and creates 6

threads that concurrently PRIME the same LLC cache set to which the shared variable

V is mapped.

We run 100,000 co-location tests for every scenarios. The tests were conducted on the i7-6700 processor, with parameter n = 256, p0 = 0.969, p1 = 0.968, α = 0.0001. Results are shown in Table 5.5. Column 2 and 3 of the table show pˆ0 and pˆ1, the probability of passing unit tests under the considered scenarios, respectively. We can see that in all cases, the probabilities of observing data races from T0 and T1 are very low (e.g.., 0.03% to 2.2%).

In all cases, the co-location tests fail, which suggests we have successfully detected that the two threads are not co-located. We only show results with α = 0.0001 because larger

α values (e.g., 0.01 and 0.001) will lead to even lower false negative rates. In fact, with the data collected in our experiments, we could not achieve any false negatives even with a much smaller α value (e.g., 1e−100). This result suggests it is reasonable to select a rather small α value to reduce false positives while preserving security guarantees. We leave the decision to the user of HYPERRACE.

90 Scenario pˆ0 pˆ1 false negative rates (α = 1e−4) 1 0.0004 0.0007 0.000 2 0.0003 0.0008 0.000 3 0.0153 0.0220 0.000 4 0.0013 0.0026 0.000

Table 5.5: Evaluation of false negative rates.

5.4 Protecting Enclave Programs with HYPERRACE

In this section, we introduce the overall design and implementation of HYPERRACE that

leverages the physical core co-location test presented in the previous sections.

5.4.1 Safeguarding Enclave Programs

HYPERRACE is a compiler-assisted tool that compiles a program from source code into

a binary that runs inside enclaves and protects itself from Hyper-Threading side-channel

attacks (as well as other same-core side-channel attacks).

At the high-level, HYPERRACE first inserts instructions to create a new thread (i.e., the

shadow thread) at runtime, which shares the same enclave with the original enclave code

(dubbed the protected thread). If the enclave program itself is already multi-threaded, one

shadow thread needs to be created for each protected thread.

HYPERRACE then statically instruments the protected thread to insert two types of

detection at proper program locations, so the subroutines will be triggered

periodically and frequently at runtime. The first type of subroutines is designed to let the

enclave program detect AEXs that take place during its execution. The second type of

subroutines performs the aforementioned physical-core co-location tests. The shadow thread

is essentially a loop that spend most of its time waiting to perform the co-location test.

91 At runtime, the co-location test is executed first when the protected thread and the

shadow thread enter the enclave, so as to ensure the OS indeed has scheduled the shadow

thread to occupy the same physical core. Once the test passes, while the shadow thread runs

in a busy loop, the protected thread continues the execution and frequently checks whether

an AEX has happened. Once an AEX has been detected, which may be caused by either

a malicious or a regular timer interrupt, the protected thread will instruct the

shadow thread to conduct another co-location test and, if passes, continue execution.

AEX detection. HYPERRACE adopts the technique introduced by Gruss et al. [36] to detect

AEX at runtime, through monitoring the State Save Area (SSA) of each thread in the enclave.

Specifically, each thread sets up a marker in its SSA, for example, writing 0 to the address within SSA that is reserved for the instruction pointer register RIP. Whenever an AEX

occurs, the current value of RIP overrides the marker, which will be detected by inspecting

the marker periodically.

When an AEX is detected, the markers will be reset to value 0. A co-location test will be performed to check co-location of the two threads, because AEX may indicate

a privilege-level switch—an opportunity for the OS kernel to reschedule one thread to a

different logical core. By the end of the co-location test, AEX detection will be performed

again to make sure no AEX happened during the test.

Co-location test. To check the co-location status, HYPERRACE conducts the physical-core

co-location test described in Section 5.2 between two threads. Since the shared variable in

the test is now in the enclave memory, the adversary has no means to inspect or modify its value. Once the co-location status has been verified, subsequent co-location tests are only

needed when an AEX is detected.

92 5.4.2 Implementation of HYPERRACE

HYPERRACE is implemented by extending the LLVM framework. Specifically, the

enclave code is complied using Clang [1], a front-end of LLVM [5] that translates C code

into LLVM intermediate representation (IR). We developed an LLVM IR optimization pass

that inserts the AEX detection code (including a conditional jump to the co-location test

routine if an AEX is detected) into every basic block. Further, we insert one additional AEX

detection code every q instructions within a basic block, where q is a parameter we could

tune. Checking AEX in every basic block guarantees that secret-dependent control flows are

not leaked due to side-channel attacks; adding additional checks prevents data-flow leakage.

We will evaluate the effects of tuning q in Section 5.5.

The shadow thread is created outside the enclave and system calls are made to set the

CPU affinity of the protected thread and the shadow thread prior to entering the enclave. We

use spin locks to synchronize the co-location test routines for the protected thread and the

shadow thread. Specifically, the shadow thread waits at the spin until the protected

thread requests a co-location test. If the co-location test fails, the enclave program reacts

according to a pre-defined policy, e.g., retries r times and, if all fail, terminates.

5.5 Performance Evaluation

In this section, we evaluate the performance overhead of HYPERRACE. All experiments were conducted on a Dell Optiplex 7040 machine with an Intel Core i7-6700 processor and

32GB memory. The processor has four physical cores (8 logical cores). The parameter α of

the co-location tests was set to 1e−6; p0, p1, and n were the same as in Section 5.3.

93 5.5.1 nbench

We ported nbench [6], a lightweight benchmark application for CPU and memory performance testing, to run inside SGX and applied HYPERRACE to defend it against

Hyper-Threading side-channel attacks.

Contention due to Hyper-Threading itself. Before evaluating the performance overhead of HYPERRACE, we measured the execution slowdown of nbench due to contention from the co-located logical core. This slowdown is not regarded as an overhead of HYPERRACE, because the performance of an enclave program is expected to be affected by resource contention from other programs; a co-located thread running a completely unrelated program is normal.

We set up two experiments: In the first experiment, we run nbench applications with a shadow thread (busy looping) executing on a co-located logical core; in the other exper- iment, we run nbench with the co-located logical core unused. In both cases, the nbench applications were complied without HYPERRACE instrumentation. In Figure 5.8, we show the normalized number of iterations per second for each benchmark application when a shadow thread causes resource contention; the normalization was performed by dividing the number of iterations per second when the benchmark occupies the physical core by itself.

As shown in Figure 5.8, the normalized number of iterations ranges from 67% to 98%.

For instance, the benchmark numeric sort runs 1544.1 iterations per second with a shadow thread while 1635.2 iterations per second without it, which leads to a normalized value of 1544.1/1635.2 = 0.944. The following evaluations do not include the performance degradation due to the Hyper-Threading contention.

94 0.981 1 0.944 0.952 0.897 0.900 0.874 0.839 0.8 0.752 0.748 0.671 0.6

0.4 per second

0.2 Normalized # of iterations 0

numeric sort string sort bit fifpeld emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 5.8: Normalized number of iterations of nbench applications when running with a busy looping program on the co-located logical core.

Overhead due to frequent AEX detection. The performance overhead of the HYPERRACE

consists of two parts: AEX detection and co-location tests. We evaluated these two parts

separately because the frequency of AEX detection depends on the program structure (e.g.,

control-flow graph) while the frequency of the co-location tests depends on the number of

AEXs detected. We use the execution time of non-instrumented nbench applications (still

compiled using LLVM) with a shadow thread running on the co-located logical core as the

baseline in this evaluation.

To evaluate the overhead of AEX detection, we short-circuited the co-location tests even when AEXs were detected in HYPERRACE. Hence no co-location tests were performed.

Figure 5.9 shows the overhead of AEX detection. Note that q = Inf means that there is

only one AEX detection at the beginning of every basic block; q = 5 suggests that if there

95 × 2.5 q = Inf q = 20 2× q = 15 q = 10 q = 5 1.5×

0.5× Normalized overhead

numeric sort string sort bit fifpeld emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 5.9: Runtime overhead due to AEX detection; q = Inf means one AEX detection per basic block; q = 20/15/10/5 means one additional AEX detection every q instructions within a basic block.

are more than 5 instructions per basic block, a second AEX detection is inserted; q = 20,

q = 15, and q = 10 are defined similarly. Since each instrumentation for AEX detection (by

checking SSA) consists of two memory loads (one SSA marker for each thread) and two

comparisons, when the basic blocks are small, the overhead tends to be large. For example,

the basic blocks in the main loop of assignment benchmark application containing only

3 or 4 instructions per basic block, the overhead of HYPERRACE on assignment is large

(i.e., 1.29×) even with q = Inf. Generally, the overhead increases as more instrumentations

are added. With q = Inf, the overhead ranges from 0.8% to 129.3%, with a geometric mean

of 42.8%; when q = 5, the overhead ranges from 3.5% to 223.7%, with geometric mean of

101.8%.

96 Original q = 20 q = 15 q = 10 q = 5 Bytes 207,904 242,464 246,048 257,320 286,448 Overhead - 16.6% 18.3% 23.7% 37.7%

Table 5.6: Memory overhead (nbench).

Overhead due to co-location tests. The overhead of co-location tests must be evaluated when the number of AEX is known. HYPERRACE triggers a co-location test when an AEX

happens in one of the two threads or both. By default, the operating system generates timer

interrupts and other types interrupts to each logical core. As such, we observe around 250

AEXs on either of these two threads per second. To evaluate the overhead with increased

numbers of AEXs, we used a High-Resolution Timers in the kernel (i.e., hrtimer) to

induce interrupts to cause more AEXs. The overhead is calculated by measuring the overall

execution time of one iteration of the nbench applications, which includes the time to

perform co-location tests when AEXs are detected.

We fixed the instrumentation parameters as q = 20 in the tests. The evaluation results

are shown in Figure 5.10. The overhead of AEX detection has been subtracted from the

results. From the figure, we can tell that the overhead of co-location tests is small compared

to that of AEX detection. With 250 AEXs per second, the geometric mean of the overhead

is only 3.5%; with 1000 AEXs per second, the geometric mean of the overhead is 16.6%.

The overhead grows almost linear in the number of AEXs.

Memory overhead. The memory overhead of the enclave code is shown in Table 5.6.

We compared the code size without instrumentation and that with instrumentation under

different q values. The memory overhead ranges from 16.6% to 37.7%.

97 × 0.3 250 AEXs per second 427 AEXs per second 0.25× 611 AEXs per second 1000 AEXs per second 0.2×

0.15×

0.1×

Normalized overhead 0.05×

numeric sort string sort bit fifpeld emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 5.10: Runtime overhead of performing co-location tests when q = 20.

5.5.2 Cryptographic Libraries

We also applied HYPERRACE to the Intel SGX SSL cryptographic library [4] and

measured the performance overhead of eight popular cryptographic algorithms. We run

each algorithm repeatedly for 10 seconds and calculated the average execution time for

one iteration. Figure 5.11 gives the overhead (for both AEX detection and co-location test) when instrumented every q = 20 instructions per basic block, and no extra AEXs introduced

(the default 250 AEXs per second).

The overhead for AES_decrypt algorithm is small (around 2%) compared to other

algorithms since its dominating basic blocks are relative large. In contrast, the overhead

for ECDH_compute_key and ECDSA_sign are relatively large (i.e., 102.1% and 83.8%)

because elliptic curve algorithms consist of many small basic blocks. The overhead for other

98 1.2× 1.021 1× 0.838 0.8×

0.6× 0.496 0.4× 0.240 0.228 0.210 0.2× 0.146 Normalized overhead 0.022 0×

ign ey s k ey ecrypt k d ncrypt ecrypt e d cbc ompute ompute igest (sha1) n c c rivate D igest (sha256) p D AES DES DH ECDH ECDSARSA EVP EVP

Figure 5.11: Overhead of crypto algorithms.

evaluated algorithms ranges from 14.6% to 49.6%. The geometric mean is 36.4%.The size of the complied static trusted library libsgx_tsgxssl_crypto.a grew from 4.4 MB to

6.6 MB, resulting in an memory overhead of 50%.

5.6 Summary

In summary, HYPERRACE is a tool for protecting SGX enclaves from Hyper-Threading side-channel attacks. The main contribution of our work is the proposal of a novel physical- core co-location test using contrived data races between two threads running in the same enclave. Our design guarantees that when the two threads run on co-located logical cores of the same physical core, they will both observe data races on a shared variable with a close-to-one probability. Our security analysis and empirical evaluation suggest that the adversary is not able to schedule the two threads on different physical cores while keeping

99 the same probability of data races that are observed by the enclave threads. Performance evaluation with nbench and the Intel SGX SSL library shows that the performance overhead due to program instrumentation and runtime co-location tests is modest.

100 Chapter 6: Securing TEEs with Verifiable Execution Contracts

In previous chapter, we introduced one mitigation scheme for closing Hyper-Threading side channels. In this chapter, we aim to design a defense mechanism to address all threats described in Section 2.4. We propose the concept of verifiable execution contracts, which define a guaranteed execution environment to run SGX enclaves on untrusted OS, such that side-channel attacks and OS-dependent attacks are infeasible. Specifically, we design three categories of execution contracts and detail the required modification of the OS kernel to fulfill the execution contracts and the enclave code instrumentation to verify the execution contracts at runtime. Then, we analyze the security gain by the execution contracts, and demonstrate through a set of examples how various side-channel attacks against enclaves can be prevented. We also introduce a prototype implementation of the proposed verifiable execution contracts and evaluate their effectiveness and efficiency.

This chapter is organized as follows. Section 6.1 gives an overview of verifiable ex- ecution contracts. Section 6.2 presents the details of various execution contracts and the security guarantees they could provide. Section 6.3 discusses how to reliably verify the pro- posed execution contracts. Section 6.4 describes the implementation details and Section 6.5 presents the evaluation results of security and performance. Section 6.6 discusses potential mitigations of attacks that could break SGX’s confidentiality guarantee using verifiable

101 execution contracts when additional microcode updates are available. Section 6.7 presents a

discussion of limitation and application scenarios and Section 6.8 summarizes this chapter.

6.1 Overview

In this section, we first describe the limitations of existing mitigation schemes. Then, we will present an overview of our proposed solution, verifiable execution contracts, to address

these limitations.

6.1.1 Limitations of Existing Defenses

Defenses have been proposed to address aforementioned threats. Déjá Vu [30] aims

to detect page-fault based attacks. Cloak [36] make use of TSX to address cache side-

channel attacks. HyperRace [29] mitigates same-core side-channel attacks enabled by

Hyper-Threading and Varys [62] protects enclaves from cache timing and page table side-

channel attacks. Essentially, these methods resemble signature-based intrusion detection

techniques. They need to correctly depict the characteristics of side-channel attack activities

to establish their signatures, and then monitor the system behavior at runtime to compare it with these signatures.

However, as mentioned by Wagner et al. [86], “signature-based schemes are typically

trivial to bypass simply by varying the attack slightly.” This is also true for these side-channel

attack detection schemes. For example, Oleksenko et al.assumed an AEX rate of 100Hz

as “normal” [62], because regular interrupts will be delivered to the CPU core running the

enclave program at such a rate. However, this allows the adversary to launch attacks at a

rate lower than 100Hz and remain undetected. Although the fidelity of the attack is low, the

adversary could repeatedly sample the execute traces of the same algorithm and improve the

attack.

102 The root cause of such evasion attacks is that the signatures of side-channel attacks (e.g.,

AEXs) are not unique to attack activities. Some normal system activities share the same

traits. Thus, to avoid false alarms, it is inevitable to tune the intrusion detection system

towards lower false positives and higher false negatives. Therefore, sneaky attacks can be

performed by mimicking system activities.

Our solution to this challenge is to constrain normal operating system operations that

resemble side-channel attacks, such that system activities are not falsely detected as attacks.

As such, the attack detection system can be tuned to eliminate false negatives while keeping very low false positives. Thus, mimicry attacks become very difficulty, if not impossible.

The philosophy behind this idea is to request the operating system to trade off performance

and compatibility of its operations for stronger security guarantees of trusted execution

environments.

6.1.2 Verifiable Execution Contracts as Defense

In this chapter, we advocate for a new concept, which we call execution contracts, to

describe the execution environment for an SGX enclave provided by the underlying OS. To

defend SGX enclaves against various malicious attacks from the OS, the enclave owner first

negotiates an execution contract with the hosting OS, so that enclave is guaranteed to be

executed in a setting in which the attacks are impossible to be performed.

The rationale is that since enclave programs typically have higher security requirements,

it is reasonable to ask the hosting OS to take the burden of restricting its regular operations

and complying to a contract that is more favorable to the enclaves. Such a customized

execution environment aims to minimize or even eliminate potential attack windows, making

it quite difficult, if not impossible, for the adversary to launch attacks. For example, if

103 the service provider agrees to modify its OS so that no page fault would occur during the

enclave’s execution, page-fault based attacks would be eliminated completely. In this chapter, we envision the following three categories of execution contacts that may be provided:

• Resource reservation contracts reserve certain resource for the enclave exclusively to

prevent concurrently executed side-channel attacks.

• Runtime interaction contracts restrict the interaction between the OS and the enclave

during runtime to mitigate side-channel attacks by interleaved execution.

• Service contracts request the OS not to abuse its role in managing enclave and handling

enclave’s communication.

These contracts will be detailed in Section 6.2. Note that some threats might not be

removed by solely one contract, we will show later, that a combination of multiple contracts

could not only mitigate various attacks but also provide better performance than adopting

some single contract.

Verifiable execution contracts. As the execution contracts are provided by the untrusted

OS, the ability for the enclave program to verify that the contracts are fulfilled correctly

(without the help from a privileged software component) is crucial. We will use the term

verifiable execution contracts to represent the subset of all execution contracts that could be verified. We will discuss the verification methods of the verifiable execution contracts in

Section 6.3.

6.2 Execution contracts

In this section, we mainly describe three categories of execution contracts and how they

could be leveraged to mitigate threats against SGX enclaves.

104 Figure 6.1: Resource reservation contracts

6.2.1 Construction of Execution Contracts

We consider three categories of execution contracts: resource reservation contracts, runtime interaction contracts, and service contracts. Resource reservation contracts ask the OS to reserve execution resources for the enclaves exclusively; runtime interaction contracts regulate the interaction between the OS and the enclaves during the runtime; service contracts determine how OS serves enclave application’s functionalities. The first two categories of contracts mainly target side-channel threats, while the last aims to address

OS-dependent threats.

6.2.1.1 Resource Reservation Contracts

As discussed in Section 2.4, side channels are usually introduced due to the shared re- sources between the victim and the adversary. Resource reservation contracts are introduced to mitigate side channels with a focus on concurrent exploits, as shown in Figure 6.1. In this chapter, we introduce two such contracts: Hyper-Threading Control to address same-core side channels and LLC Reservation to deal with cross-core side channels.

105 Hyper-Threading Control: When the enclave code is executed on one hyper-thread, no

other processes or kernel threads may run on the sibling hyper-thread.

Hyper-Threading enables the adversary to run the attack code on the sibling-hyper

thread of the victim code to spy on the victim concurrently, via same-core side channels,

such as BPU, store buffer, TLB, and L1/L2 caches. The Hyper-Threading Control contract

is introduced to close these side channels, by requesting the OS to either disable Hyper-

Threading before system booting or enable Hyper-Threading with a guarantee that once

program enters the enclave mode, its sibling hyper-thread is not used or reserved to the code

of the same enclave.

LLC Reservation: The OS must reserve a specified portion of the last-level cache for

the enclave code to execute exclusively.

This contract deals with cross-core LLC attacks. LLC is shared between multiple cores,

and thus it enables cross-core LLC attacks. By reserving a portion of the LLC to the enclave

program, it guarantees no the enclave program does not share the LLC with any other

programs during its execution. Thus, side-channel attacks that exploit the shared LLC

can be mitigated. As the EPC memory of an enclave is by design not shared with other

enclaves or software outside the enclaves, reuse-based side channels (e.g., Flush+Reload,

Flush+Flush) are not possible. LLC partition can eliminate contention-based side channels

(e.g., Prime+Probe, Evict+Reload, Evict+Time). Methods to achieve LLC partitioning

include cache-set partitioning (or page coloring) [94,97], or cache-way partitioning using

Intel Cache Allocation Technology [56], etc.

106 Figure 6.2: Runtime interaction contracts

6.2.1.2 Runtime Interaction Contracts

Besides concurrent exploits, some side channels could be exploited by interleaving the execution of the victim and the adversary. Existing attacks usually require a high interleaving frequency to get fine-grained execution information [84], leading to frequent context switches. To preempt the execution of the enclaves, the adversary needs either to generate frequent interrupts to the CPU core on which the enclave runs or manipulate the page tables of the enclave memory to trigger page faults when certain memory pages are accessed by the enclave. These approaches have been used in most prior work on SGX side channel attacks [26, 35, 38, 53, 68, 87, 91]. Hence, we propose to introduce Runtime interaction contracts to regulate the interaction between the enclave and the OS, as shown in Figure 6.2.

Exception Regulation: Exceptions (e.g., page faults) cannot occur during the enclave code’s critical sections.

As demonstrated in [91], the OS could arbitrarily introduce page faults to infer the page-level access pattern of the enclave. We propose this contract to constrain the OS’s behavior. Specifically, the OS should ensure that (1) all EPC pages of the enclave are loaded before calling into the enclave, and (2) no EPC page of the enclave will be swapped out

107 during execution, and (3) the OS cannot intentionally set the reserved flag or present

flag to trigger page faults, and (4) the segmentation-level and page-level access permissions are correctly configured.

In SGX1, since all memory range can be loaded in advance, page faults could be eliminated completely. In SGX2, dynamic memory allocation and commitment might be implemented based on page faults [90]. In this case, the enclave knows that page faults could only occur when she requests to allocate memory via such as malloc(). Therefore, by excluding such API calls from the execution of the critical sections of the enclave, it is possible to remove all observable traces from the corresponding side channels.

Interrupt Disabling: The OS must disable all interrupts, including I/O interrupts and local timer interrupts, for the cores when the enclave runs inside its critical sections.

While interrupts (i.e., timer interrupts and I/O interrupts) are quite normal in any OS, they increase the difficulty of the side-channel attack detection by the SGX enclaves. Defenders are forced to consider a certain rate of “benign” AEXs in their signature-based detection schemes. However, such “benign” AEXs could be leveraged to the adversary to monitor the enclave’s execution without being detected. While I/O interrupts could be redirected to other cores without causing additional problems, disabling local timer interfere with the scheduling mechanisms as enclaves cannot be preempted before it finish its critical section. This is a trade-off between security and performance/compatibility. In cases that multiplexing has high priority, we would request the enclave developers to break down its enclave to small pieces of critical sections, allowing the OS to schedule other enclaves or processes in between of the critical sections.

108 6.2.1.3 Service Contracts

We introduce service contracts to address OS-dependent threats against SGX enclaves.

Currently, the SGX trusted platform service provides two trusted services: monotonic coun-

ters and trusted timers. Monotonic counters are managed in a secure database maintained

by the PSE. The integrity and replay protection of this secure database are ensured by the

replay-protected storage in the CSME. The monotonic counter could be used to address

replay attacks. However, it has a large latency and also limited number of writes [57].

Therefore, it is not well-suited for applications that request frequent use of such monotonic

counters. The trusted timer service is provided by PSE using the CSME Protected Real-Time

Clock based timer. The trusted timer enables the enclaves to calculate the time interval (in

seconds) between two timer reads. However, as the untrusted OS forwards the requests to

and responses from the trusted timer service, the enclave code is only guaranteed to obtain a

lower bound of the time interval measurement. For example, when two reads indicate a time

interval of 2 seconds, it only ensures that the actual time interval is at least 2, because the

adversary could delay the response to the second reads arbitrarily. We introduce two service

contracts, Check Point and Timely Response to address these problems.

Check Point: The “checkpoint” function called by an enclave to store its application

state must be completed by the OS before destructing the enclave.

The Check Point contract prevents replay attacks against stateful enclave applications.

Specifically, it tries to enforce an inc-then-store method of counter-based replay protec-

tion [57]: the enclave will first increase its trusted counter after it is initialized of restored

from a previously stored state. After updating its internal states, the enclave then seals the

updated states together with the counter value. In this way, multiple state updates requires

109 only one counter operation. However, if the enclave crashes before storing the sealed states

and counter value, it cannot be restored. Hence, the Check Point contract requests the OS

to ensure the execution of a checkpoint function, which would store the updated states and

counter value, before destructing the enclave.

Timely Response: The OS must process the enclave’s service requests, e.g., system

calls, network I/O, disk I/O, etc, in a timely manner: the delay of these operations must be

controlled with a specified time threshold.

We introduced this contract to deal with the possible delays in reading the trusted timer

from local hardware, or from a remote time . If the remote server implements a

trusted wall clock timer, the enclave could request a wall clock time from the server with a

controlled error. Other system services can also be guaranteed to finish within a bounded

time, maintaining the correct system abstraction. The enclave will estimate the elapsed time

between the time the request is issued and the time the response is returned, and compare

the estimated elapsed time with a specified time threshold to detect possible violations.

6.2.2 Security Guarantees

As shown in Table 6.1, enforcing some of the above execution contracts will help the

enclave mitigate threats listed in Table 2.2. It is worth noting that most of mitigation

mechanisms listed in the table require enforcing both of the runtime interaction contracts,

i.e.,Exception Regulation and Interrupt Disabling. This is because with these two contracts

combined together, enclaves are guaranteed to execute in an AEX-free execution window, within which the enclave’s execution will not preempted by any other process. We will see

it soon that such a condition is important for preventing most side-channel leakage and OS

manipulation.

110 Micro-architectural side channels Side Channels Mitigations Deterministic? FPU HC (+ ER + ID) depends on HC Store buffer HC (+ ER + ID) depends on HC TLBs HC (+ ER + ID) depends on HC BPU HC + ER + ID depends on HC L1/L2 cache HC + ER + ID depends on HC LLC HC + LR + ER + ID No Page table HC + ER + ID depends on HC

OS-dependent threats Threats Mitigations Deterministic? Replay CP Yes Communication delay TR (+ ER + ID) Yes

HC: Hyper-Threading Control, LR: LLC Reservation, ER: Exception Regulation, ID: Interrupt Disabling, CP: Check Point, TR: Timely Response.

Table 6.1: Security analysis

FPU, store buffers, TLBs. Per-core CPU resources, such as FPU, store buffers, TLBs, are

only exploitable by a malicious software running on the sibling hyper-thread concurrently.

The Hyper-Threading Control contract eliminates such attack completely by disabling

Hyper-Threading or reserving the sibling hyper-thread during enclaves’ execution. In the

latter case, because CPU scheduling and migration could take place when the control of a

CPU is trapped into the kernel, the enforcement of the contract (and hence the verification

of it) needs only take place right after a context switch. As such, runtime interaction

contracts, i.e.,Exception Regulation and Interrupt Disabling, need to be combined with

Hyper-Threading Control.

111 BPU, L1/L2 caches. Per-core resource like BPU and L1/L2 caches may be shared by two

programs when they run on the same core. With Hyper-Threading enabled, they can be

exploited in an concurrent manner. Without Hyper-Threading, they are still exploitable in an

interleaving manner. This is because these resources are not cleansed at context switch. To

ensure security, the Hyper-Threading Control contract, the Exception Regulation contract,

and the Interrupt Disabling contract should be enforced.

LLC and paging structures. LLC is shared by multiple cores, which means it could be

leveraged by the adversary from a different physical core. The execution contract that can

prevent LLC exploitation from a different core is the LLC Reservation contract. However, as

the LLC can also be exploited from the same core in an interleaved or from a sibling hyper-

thread in a concurrent manner, completely closing LLC side channels require a combination

of the Hyper-Threading Control contract, the Exception Regulation contract, the Interrupt

Disabling contract, and the LLC Reservation contract.

Paging structures can be exploited in two ways: page faults [91] and page table up-

dates [87]. In the latter case, by frequently clearing the accessed bit of the PTEs, memory

accesses that lead to page-table walks will set the accessed bit of the corresponding PTEs,

leaving observable traces to the adversary. However, to repeatedly observe side-channel

leaks from victim’s accessing of the same page, the TLBs entries for the PTE must be flushed

(to enforce a page-table walk). The known methods to flush TLBs of a core occupied by the

enclave code include interrupting its execution and switch CR3 in the kernel mode, or Prime

the TLB entries (in the same manner as Priming cache lines in a cache side-channel attack)

from a sibling hyper-thread. Hence, the Hyper-Threading Control contract together with the

Exception Regulation and Interrupt Disabling contracts (that create AEX-free execution win-

dows) are sufficient to address cross-core page-table side channel attacks. In fact, the same

112 set of execution contracts also addresses page-fault side-channel attacks [91]. Therefore, the

adversary cannot conduct any page-access side-channel analysis by manipulating the PTEs

if these three execution contracts are followed.

6.2.3 Remaining Challenges

While we have shown that execution contracts effectively defeat side-channel and OS-

dependent attacks, two technical challenges remain:

• Verification without system support. Due to SGX’s threat model, any information out-

side the ELRANGE including inputs provided by OS is untrusted. The limited trusted

information about the execution environment makes verification within the enclave very

challenging. We discuss how enclave code can verify the compliance of execution

contracts in Section 6.3.

• OS modification. Modifying OS to fulfill the execution contracts could also be challenging.

We leave the discussion of overcoming some of these challenges in Section 6.4.

6.3 Verifiability

6.3.1 Available Signals

To verify an execution contract, we need to consider two aspects: what can be used for

verification and how to use them. For an execution contract to be verifiable, there should

be some signals due to any violation that could be observed within the enclave. Note that what the enclave could trust is the memory content within the ELRANGE and the underlying

hardware. We categorized the signals into two types:

• Deterministic signals. If a signal caused by contract violation cannot be cleared by the

adversary, we call the signal a deterministic signal. Such signals could be checked after

113 the enclave’s execution to detect whether any violation happens during the execution. For

example, if the enclave sets a marker in the SSA after any AEX overwrites the marker,

the adversary has no means to recover it.

• Probabilistic signals. If a signal can be cleared over time (e.g., by the adversary), it

is considered a probabilistic signals. Such signals need to be monitored continuously

to increase the chance of violation detection. For example, in HyperRace [29], the

monitored signal is the probability of observing data races, which can be cleared by the

adversary if the co-location is resumed.

6.3.2 Verifiability Models

Now we know what signals to monitor, the next step is to produce the verification result

from the observed signals. According to the signal types being used, we introduce two verification models:

• Deterministic verification model. For deterministic signals, the verification method is

straightforward: check the signal before and/or after the execution of critical sections.

Once the specific signal is detected, we could confirm a violation.

• Probabilistic verification model. Verification of execution contracts using probabilistic

signals needs to perform hypothesis testing

Verification with probabilistic signals via hypothesis testing. Let Xi (where i = 1,2,...,n)

denote the i-th trial of detecting the signal; Xi = 1 indicates that the signal is observed and

Xi = 0 otherwise. Since the violation behaviors could be quite different and thus might result

in quite different probabilities of observing the signals, we choose to measure probabilities when the contract is kept, (which should be small or even equals to 0), and consider the

abnormal probabilities as indicators of violations. Let p denote the probability that the

114 signal is observed when the contract is kept, and pˆ is the actual ratio of observing signals in

n tests. We have the null and alternative hypotheses:

H0:p ˆ ≤ p; The contract is kept.

H1:p ˆ > p; The contract is violated.

Assuming the tests Xi (where i = 1,2,...,n) are independent random variables. The sum of

n the n random variables, X = ∑i=1 Xi follows a binomial distribution with parameters n and p, which can be approximated by a normal distribution N(np,np(1 − p)) when n is large.

Given a Type I error, i.e., false positive α, we have " # X − np Pr > uα = α. (6.1) pnp(1 − p) from which we could calculate a threshold

p X¯ = np + uα np(1 − p). (6.2)

such that, when X > X¯ , we will reject the null hypotheses and consider the contract is violated.

To calculate Type II error, i.e., false negative, consider the contract is violated in a way

0 such that signals are observed with a probability of p , the probability of accepting H0 that

the contract is not violated when H1 is true, i.e., the contract is in fact violated, is

h i ¯ 0 0 0 Pr X ≤ X X ∼ N(np ,np (1 − p )) " # X¯ − np0 (6.3) = Pr Z ≤ Z ∼ N(0,1) . pnp0(1 − p0) Note that for a specific execution contract, there could be various types of violations

resulting in different kinds of signals, deterministic and/or probabilistic. When only deter-

ministic signals are used, the verification results are also deterministic and thus free of false

115 negative. Otherwise, the verification results might suffer from false negatives. We aim to

reduce false negatives in our detection.

6.3.3 Verification of Proposed Contracts 6.3.3.1 Verification of Runtime Interaction Contracts

To verify the Exception Regulation contract or the Interrupt Disabling contract, ideally,

the SGX enclaves can be notified when an AEX takes place. However, the current SGX

design does not offer such hardware supports.

We propose to verify the Exception Regulation contract or the Interrupt Disabling

contract using the following method: after the enclave code has entered a code region where

AEXs are undesired (i.e., an AEX-free execution window), it places a marker (e.g., by setting

the memory field storing RIP to 0) in its SSA. If an AEX takes place, the marker value will

be overwritten [36]. At the end of the AEX-free execution window, the enclave checks again

the marker in the SSA. If it is not changed since the last check, AEX-free execution window

is guaranteed to be AEX-free. Otherwise, either the Exception Regulation contract or the

Interrupt Disabling contract is violated. It is challenging to accurately separate the violation

of the two contract, as the exit reason reported in the EXITINFO field of its SSA can be

overwritten by subsequent AEXs and will be lost if not examined in time.

6.3.3.2 Verification of Resource Reservation Contracts

For Hyper-Threading Control contract, disabling Hyper-Threading can be verified during

remote attestation by checking the CPUSVN. However, since disabling Hyper-Threading will

introduce significant performance loss, we aim to design an alternative solution. The idea of

reserving the sibling hyper-thread and then verify the reservation result has been studied

in [29,36,62]. Specifically, these works use co-location tests (by measuring probabilistic

116 signals: TSX aborts [36], data race probability [29], and memory access timing [62])

to verify that the scheduling requirement is observed. Since the monitored signals are

probabilistic, the enclave has to periodically perform the tests. Another observation is that

context switches are necessary to reschedule processes to CPU cores. Hence, these work

combines co-location tests with AEX detections together, and their verification results are

based on two observations: failed co-location tests or high AEX rates.

In this chapter, we choose data-race based co-location tests due to its thorough security

analysis presented in HyperRace [29] and low false negative (The authors claimed tat no

false negative was detected during their experiments). Further, when combined with AEX-

free execution window, the adversary could not change the co-location state during enclave’s

execution without introducing any AEX. Hence, only one co-location test is needed at the

beginning of the AEX-free execution window, and Hyper-Threading Control contract is

considered observed if no AEX is detected at the end of the AEX-free execution window.

The verification of the LLC Reservation contract is more challenging. Since the LLC

slice is reserved for the enclave, eviction of cache lines caused by any process other than the

enclave would be considered as a violation. To detect LLC cache line evictions, we propose

to use TSX. Specifically, for an enclave thread accessing memory addresses that are potential

targets of LLC-based attacks, we launch an auxiliary enclave thread (which is also used in

Hyper-Threading Control) to repeatedly construct TSX-based transactions and load memory

addresses used by the main enclave thread in these transactions. When the adversary evicts

some of the monitored memory addresses out of LLC, the transaction will abort, which can

be used to detect such attacks. While the enclave itself might introduce self-contention and

evict its own cache lines, as long as the memory monitored inside the transactions is small

(see our implementation), self-contention will be rare. Therefore, the verification can be

117 conducted via hypothesis testing (see Section 6.3.2). Empirical evaluation of the approach will be provided in Section 6.5.1.

6.3.3.3 Verification of Service Contracts

To verify the Check Point contract for executing stateful enclave programs, the developer

needs to first identify the code region that updates the enclave states. Before entering this

code region, the state of the enclave is restored from a sealed storage and the version of the

state (maintained as a monotonic counter) is checked (by comparing with the one in the

sealed storage) and incremented. At the end of this code region, the state of the enclave is

stored in the sealed memory together with the value of the monotonic counter. A violation of

the contract would lead to inconsistency of counter values: no sealed data with the specific

monotonic counter value can be provided.

The verification of the Timely Response contract is by checking whether an estimated

response time exceeds a predefined threshold. The response delay can be estimated using

a software clock that is implemented inside the enclave. The challenge is to implement a

software clock in enclave reliably, without trusting the OS. Our solution is to implement the

clock as a counter that is incremented in a loop at constant rate. To prevent the clock from

being interrupted by the OS, the loop is protected by the Interrupt Disabling and Exception

Regulation contracts to execute inside a AEX-free execution window. The only way for the

OS to slow down the clock, therefore, is to lower the CPU frequency the core. However,

existing work has shown that the impact of CPU frequency on the accuracy of clock is

small [30].

There are two ways to implement the software clock: When the auxiliary thread (e.g.,

in the Hyper-Threading Control contract, the sibling hyper-thread is reserved to run an

auxiliary thread of the same enclave) is available, the software clock can be implemented

118 using the auxiliary thread. Before the enclave code invokes an OCall (for system calls or I/O

operations), the software clock is created in the auxiliary thread and the reading is recorded.

After the OCall returns, the elapsed time could be estimated by comparing the difference

between the current clock value and the recorded one.

When the auxiliary thread is not available, the software clock can be implemented in

the same enclave thread. Instead of making OCalls to implement system calls and I/Os,

the enclave code leverages the switchless feature [15] to issue the service requests to the

OS. Specifically, within an AEX-free execution window, the enclave writes a request (e.g.,

making a system call) to a predefined memory address outside its ELRANGE and asks a host

program that runs concurrently on another core to process the request on its behalf. The

result will be returned to another predefined memory address (outside the enclave) which the

enclave monitors. While the request is being processed, the enclave maintains the software

clock to estimate the delay of the OS service.

6.4 Implementation

We implemented a prototype of our verifiable execution contract framework, which

consists of two components: the kernel component to enforce the execution contract and the

enclave component to verify the execution contracts. Linux kernel 4.4.40 and Intel SGX

SDK (version 2.2.100.45311) were used in our implementation.

6.4.1 Enforcing Execution Contracts

To enforce the execution contracts, we have developed loadable kernel modules to extend

the original functionality of Linux kernels, modified the boot loader configuration files to

alter the boot parameters, and modified the Intel SGX Driver.

119 Enforcing resource reservation contracts. The Hyper-Threading Control contract does

not require particular modification on the OS side. Disabling Hyper-Threading can be

achieved by configuring the BIOS. For the other option, the host program could configure

on which logical cores to run the enclave threads.

The LLC Reservation contract requires OS to reserve a portion of LLC exclusively for

the enclave. Specifically, on Intel processors that support Intel CAT, the OS could assign the

logical cores running the enclave code into a COS with an isolated CBM.

Enforcing runtime interaction contracts. Fulfilling runtime interaction contracts is more

challenging which includes modifications to Intel SGX SDK, Intel SGX Driver and imple-

mentation of a kernel module.

To fulfill the Exception Regulation contract, the OS needs to map virtual addresses

in ELRANGE to EPCs in advance and retain the mappings throughout the execution of the

enclave code, so that the enclave will not be encountered by any page fault. Specifically, we modified the Intel SGX SDK such that whenever invoking an ECall function, a request

to load all EPC pages of the called enclave will be made before entering the enclave mode.

The request handling is implemented as an input/output control (ioctl) function in the

Intel SGX driver. We implemented an ioctl function in the Intel SGX driver to check all

pages in the ELRANGE of the calling enclave. If any page is found to be swapped out, the

driver will swap it back. Since the range of the PRM is limited, when there are not enough

available EPC pages to hold the swapped-back pages, EPC pages of other enclaves will be

swapped out to make room for the calling enclave.

To fulfill Interrupt Disabling contract, the OS should only disable interrupts on the

cores during the enclave’s critical sections and be able to re-enable interrupts afterwards.

Specifically, we used the scheduler option isolcpus in the Linux kernel configuration to

120 isolate CPUs from the kernel scheduler so that the OS will not schedule other tasks on

the logical core that runs the enclave code to trigger interrupts. However, this alone could

not eliminate local timer interrupts. To control local timer interrupts, we implemented a

kernel module to provide APIs for configuring the local Advanced Programmable Interrupt

Controller (APIC). Specifically, we disabled and enabled the APIC using the APIC software

enable/disable flag in the spurious-interrupt vector register, according to Intel manual [11].

Hence, before entering the enclave’s critical sections, the local APIC will be disabled via the

kernel module. After leaving the enclave’s critial sections, the local APIC will be enabled

again.

Enforcing service contracts. For service contracts, modifications are only needed for

enclave programs. No modification is needed for OS kernels. The OS only needs to behave

normally, such as maintaining the normal execution of the enclave program and handling

the enclave’s requests in a timely manner.

6.4.2 Verifying Execution Contracts

We implemented a shared library, called libsgx_vec, to provide APIs for enclave devel-

opers to verify the execution contracts.

Verifying runtime interaction contracts. We provide two APIs for the enclave developers

to label the beginning and end of an AEX-free execution window. At the beginning of the

AEX-free execution window, the markers in the SSAs of enclave threads is set; at the end of

the AEX-free execution window, the markers are examined. If the markers have been altered

during the execution of the AEX-free execution window, AEX has taken place during the

AEX-free execution window, which violates the execution contracts of Exception Regulation

and Interrupt Disabling.

121 Verifying resource reservation contracts. For Hyper-Threading Control contract, dis-

abling Hyper-Threading can be verified using remote attestation [2]. However, completely

disabling Hyper-Threading induces high performance penalty. Reserving the sibling hyper-

thread to the same enclave code is more preferable in practice. Therefore, we adopted the

data race based co-location test scheme introduced in HyperRace [29] in our implementa-

tion. A shadow thread is created and scheduled by the OS scheduler to run on the sibling

hyper-thread. Note that only one co-location test needs to be performed at the beginning

of the AEX-free execution window, since AEX-free execution window guarantees that no

AEX could take place within the interval. Due to the AEX-free guarantee, there is no need

to instrument every basic block of the source code to frequently detect the occurrences of

AEXs, which is a major source of performance overhead of HyperRace [29].

To verify that LLC cache is reserved, the shadow thread runs in a loop that creates TSX

transactions and measure the abort rate of the transactions. Within each TSX transaction, the

addresses of the enclave function that is currently being executed are loaded. The abort of

transactions are used as the indicator of LLC side-channel attacks. To synchronize between

the shadow thread and the main enclave thread, two macros are provided, which are to be

placed at the beginning and the end of each function, so that the range of the instruction

addresses of each function is recorded at the function entrance. Memory addresses in this

range will be loaded inside the TSX transaction. After entering a TSX transaction, the

cache lines of the monitored memory range will be loaded one by one in a loop, e.g., 1000

iterations per transaction. While read-only data could also be protected in the same way as

the code, writable data would trigger aborts if accessed concurrently by the main enclave

thread. To address this, the enclave developer needs to explicitly use TSX transactions

122 to process the writable secrets that could be leaked via data access patterns, as described

in [36].

6.5 Evaluation

In this section, we evaluate the performance and security of our proposed scheme.

To perform the evaluation, we ported nbench, a set of lightweight CPU and memory

performance benchmarks, to run inside SGX enclaves. nbench includes 10 benchmark

applications. It measures and reports the number of iterations to perform each benchmark

per seconds.

6.5.1 Security Evaluation

Since the verification of execution contracts using deterministic signals produces no

false negatives nor false positives, we evaluate only verification schemes using probabilistic

signals. For the Hyper-Threading Control contract, the robustness of the co-location tests were evaluated in [29]. Therefore, we only need to evaluate the security of the LLC

Reservation contract. Due to the lack of a machine that supports both Intel SGX and Intel

CAT, we evaluated the security of the LLC Reservation contract on a PowerEdge R440

server with a Intel(R) (R) Silver 4114 (with CAT support) and 32GB. The enclave was

built and executed in the simulation mode.

With Intel CAT in place, we first measured the probability of transaction aborts due

to self-contention. We observed only 11 aborts due to self-contention in a 1000-second

execution of nbench with a total of 398,314,223 transactions. Hence, we estimated that the

probability of transaction aborts due to self-contention is p = 11/398,314,223 = 2.76e − 8.

We implemented an enclave thread to simulate a powerful side-channel attacker. The

enclave thread periodically evicts a random code address of the function currently being

123 Type I error rates(α =) 0.01 0.001 1e−4 Threshold (X¯ ) clflush 0 0 1 frequencies p0 Type II error rates 30.0 6.25e − 5 3.95e − 7 4.29e−7 4.59e−7 294 6.23e − 4 4.0e − 56 4.4e − 56 4.7e − 56

Table 6.2: Estimation of Type II errors (false negative rates).

executed, using the clflush instruction. We measured the Type II errors (false negative) rates

under different eviction frequencies.

When we perform one hypothesis testing every second, n = 398,314,223/1000 =

398,314 transactions are performed and used for the test. Under different values of Type I

error α, we could calculate the threshold X¯ according to Equation (6.2). The corresponding

Type II error rates with regards to different clflush frequency are also estimated and reported

in Table 6.2. Smaller Type II error rates mean lower chance an attack could escape the

detection. We can see that, with α = 1e − 4, when the attacker performs cache-line eviction

at a rate of 30 per second, which is considerably smaller than the rates used in demonstrated

attacks [62], the Type II error rates is only 4.59e−7, which means it is almost impossible

for the attacker to evade detection.

6.5.2 Performance Evaluation

Performance evaluation was conducted on a 5480 laptop with an Intel Core

i7-7820HQ processor and 8GB memory. The process has 4 physical cores, i.e., 8 logical

cores when Hyper-Threading is enabled.

124 1 0.946 0.960 0.891 0.831 0.795 0.798 0.788 0.755 0.8 0.707 0.660 0.6

0.4 per second

0.2 Normalized # of iterations 0

numeric sort string sort bit fpfield emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 6.3: Normalized number of iterations per second.

Contention due to a busy sibling hyper-thread. Since occupying the sibling hyper-thread

is necessary for closing both same-core and cross-core side-channel attacks, we first evalu-

ated how a busy sibling hyper-thread could affect the performance of nbench. Specifically, we experimented with two scenarios: one with an idle sibling hyper-thread and another with

a sibling hyper-thread repeatedly executing the pause instruction. We report the normalized

number of iterations by computing the ratio between the number of iterations with a busy

sibling hyper-thread and that with an idle sibling hyper-thread. The results are shown in

Figure 6.3. The normalized number of iterations ranges from 66% to 96%, with a geometric

mean of 80.8%. Note that the level of contention may vary if the sibling hyper-thread runs

a different application. Here we used a relatively less intense application, i.e., repeatedly

executing pause instructions, which is similar to a shadow thread waiting on a spin-lock.

125 0.060 0.06

0.05

0.04 0.033 0.03

per second 0.02 0.018 0.013 0.013 0.013 0.011 0.011 0.01 0.004 Normalized # of iterations 0.001 0

numeric sort string sort bit fpfield emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 6.4: Performance gain when Hyper-Threading Control contract and AEX-free execu- tion window are applied.

In this way, we could measure the performance overhead due to the execution contracts more precisely without considering the performance degradation due to Hyper-Threading contention.

Overhead for defeating same-core side-channel attacks. As analyzed in Section 6.2.2, to defeat same-core side-channel attacks, Hyper-Threading Control contract, and the two runtime interaction contract are needed. Our evaluation shows that the performance of the protected application is actually improved. On one hand, due to the AEX-free execution window, the cost of context switches is saved. On the other hand, there is no need to periodically detect AEXs within the AEX-free execution window; only one co-location test

(two memory writes to the markers in the SSAs of both the main thread and the shadow

126 0.6 0.558 0.5 0.446 0.436 0.389 0.4 0.290 0.306 0.3 0.215 0.2 0.127 0.107 0.123 Normalized overhead 0.1 0

numeric sort string sort bit fifpeld emulation fourier assignment idea hu ff man neurallu decomposition net

Figure 6.5: Overhead for defeating both same-core and cross-core side-channel attacks.

thread and two memory reads to check the markers) is needed for each AEX-free execution window. Figure 6.4 shows the performance gain, reflecting the extra percentages of iterations that could be performed per second. The performance gain ranges from 0.1% to 6%, with a geometric mean of 1.8%.

Overhead for defeating both same-core and cross-core side-channel attacks. To defeat both same-core and cross-core side-channel attacks, all the resource reservation contracts and runtime interaction contracts (i.e., Hyper-Threading Control, LLC Reservation, Excep- tion Regulation, and Interrupt Disabling) need to be enforced and verified. As shown in

127 CAT is off 6/11 LLC reserved 600 2/11 LLC reserved 8/11 LLC reserved 4/11 LLC reserved 10/11 LLC reserved 500

400

300

200

Running time (seconds) Running 100

0

perlbench bzip2 gcc mcf gobmk hmmer sjenglibquantum h264ref omnetpp astarxalancbmk

Figure 6.6: The running time of the SPEC CPU2006 benchmark suite under various CAT settings, on a CPU with 13.75M 11-way associated LLC. Each way can be assigned to specific set of cores.

Figure 6.5, the performance overhead ranges from 10.7% to 55.8%, with a geometric mean of 29.1%5

In sum, adopting Hyper-Threading Control and AEX-free execution window to defeat same-core side-channel attacks improves the performance slightly due to the elimination of expensive context switch costs. When combining with LLC Reservation to defeat both same-core and cross-core side-channel attacks, a overhead with a geometric mean of 29.1% is incurred.

5In the idea, one frequently called function, mul, is made inline to reduce the overhead of setting the memory range of TSX transactions for verifying the LLC Reservation contract. Similar changes were made to the SetCompBit and GetCompBit functions of huffman.

128 Impact to the entire OS. Intuitively, applying execution contracts will have negative im-

pacts on the performance of the entire OS and other applications sharing the OS. We believe

the impact is modest. First, Interrupt Disabling and Exception Regulation contracts do not

affect the execution of the programs on other cores. Although the Exception Regulation con-

tract prioritizes the enclave’s memory usage, the performance gain of the enclave program

compensates the performance loss of others. Hyper-Threading Control reduces the perfor-

mance of the entire machine when Hyper-Threading is disabled (e.g.., by 30%). However,

compared to disabling Hyper-Threading completely, reserving sibling hyper-threads when

the enclave runs significantly improves the system performance.

To evaluate the effect of the LLC Reservation contract on other concurrently executed

processes, we ran the SPEC CPU2006 benchmark suite [7] and measured the running time.

The results are shown in Figure 6.6. We can see that the performance of applications like

hmmer, sjeng and h264ref stays almost the same even when a large portion of the LLC

(e.g., 10/11) is reserved for enclaves. The performance of the other applications is almost

unaffected by the LLC Reservation contract when a small portion of the LLC (e.g., 4/11) is

reserved. With more LLC ways are reserved, the performance of these applications starts

to degrade, but not significantly. As such, we anticipate that enforcing these execution

contracts still preserves the performance and usability of the entire system.

6.6 Execution Contracts without Memory Confidentiality

The recently disclosed Meltdown and Spectre attacks enable a malicious program to

read memory content outside its security domain (e.g., userspace reading kernel data). Their variants, SGXPECTRE (Chapter 4) and Foreshadow [83], specifically target Intel SGX to read

enclave memory content, completely breaking the confidentiality guarantee of SGX. Though

129 microcode patches have been released to mitigate SGXPECTRE and Foreshadow attacks, the

fix does not remove the root cause of the vulnerabilities—speculative executing instructions beyond security boundary check and legitimate control flows. Therefore, variants of such attacks may be discovered in the future. In this section, we further assume that SGX’s confidentiality guarantee for memory content may be broken, and particularly consider how to leverage verifiable execution contracts to secure enclaves.

6.6.1 Threat Analysis

We first analyze SGXPECTRE and Foreshadow attacks. SGXPECTRE abuses the BPU to force the victim enclave to speculatively executes some secret-leaking gadgets. One key requirement here is to pollute the BPU and flush the valid branching target. In their attack, they interrupted the execution of an enclave thread to prepare the attack environment.

Foreshadow attack targets the enclave secrets that reside in the L1 cache. Hence, the adversary has to either occupy the sibling hyper thread for concurrent exploit or using EWB and ELD to load the enclave data to the L1 cache of another physical core controlled by the adversary.

Given the above discussion, we assume that to compromise the confidentiality of enclave by exploiting speculative execution based vulnerabilities (e.g., SGXPECTRE and Foreshadow attacks) an adversary must meet one of the following two requirements:

• R1: Hyper-Threading is enabled on the target CPU and the malicious code is able to

execute on the same physical core concurrently as the victim enclave.

• R2: For cross-core attack, the adversary needs to explicitly interfere the execution of

the enclave, either by directly interrupting its execution, such as page faults, scheduling

130 interrupts, or by means that could potentially trigger interrupts, such swapping the

enclave’s pages, which might trigger page faults.

Nevertheless, if the adversary could read the LLC content without introducing any side

effect to the enclave, or learning some secrets from the states after the enclave completes its

execution, our solution would fail. For example, [61] provided a proof-of-concept attack where the preparation could be done completely before the enclave’s execution, and secrets

could be learnt after the execution finishes normally. However, this example is specially

crafted, e.g., the enclave function uses a memory address outside the ELRANGE (thus is

under the control of the adversary completely) in a branching condition computation. We

anticipate such poor design should be avoided during developing.

6.6.2 Defeating Memory Leaks with Execution Contracts 6.6.2.1 Confidentiality during AEX-free execution window

With the proposed resource reservation contracts and runtime interaction contracts, we

can see that the adversary could not learn the secret within the AEX-free execution window without being detected:

• For R1, the Hyper-Threading Control contract occupies the sibling hyper thread and thus

prevent the adversary from launching attacks from the sibling hyper thread to learn the

enclave secret concurrently without being detected.

• For R2, the AEX-free execution window prevents the adversary from performing cross

core attack. Direct interrupts will be captured. Page swapping could also be detected with

high probability if the page containing secrets are accessed frequently enough, which

should be easy since EWB and ELD need to load the whole page, i.e., 4096 bytes to L1.

131 However, the enclave memory content during the gaps between the AEX-free execution

window is not protected. For example, after remote attestation, a shared ECDH key used

to protect the communication between the remote party and the attested enclave will be

generated and stored in the enclave memory. Before the invocation of the next ECall

function, the shared ECDH key will be a natural target. In SGX2, a page fault due to

dynamic memory allocation (which breaks the execution into two AEX-free execution

windows) might allow the adversary to peek the enclave memory when the secrets are being

processed.

6.6.2.2 Bridging the gaps

To address these gaps that not covered by our proposed verifiable execution contracts, we propose to leverage the sealing mechanism to protect secrets. The key observation is that

the seal key is generated from within the CPU core and thus there is no need to store it in

the memory. Hence we could introduce the sealing-protected gaps:

• When new secrets are derived within the AEX-free execution window, these secrets will

be sealed and the seal key will be cleaned after the sealing process.

• When the enclave needs to access existing (sealed) secrets within the AEX-free execution

window, the sealed secrets will be unsealed, processed and then re-sealed again.

Note that each time a new seal key (with a new KeyID) should be used to prevent the

adversary to decrypt the secrets using previously leaked seal key.

Hence, it is safe to conclude: if at the end of an AEX-free execution window, no AEX is detected, the confidentiality of the secrets is guaranteed.

However, while the enclave ensures it has processed the secrets during the execution without being attacked, it is still challenging to convince the remote party that the execution

132 results are trustworthy because the local attestation and remote attestation mechanisms could both be compromised by the adversary. The adversary could learn the report keys (which are used to sign reports for the local attestation) and/or the EPID member private keys (which are used to sign the quotes for the remote attestation). Even if the enclaves could detect that they are attacked, with the leaked keys, the adversary could fake any enclave and thus any result.

6.6.3 Microcode-Level Mitigation

While current SGX design seems unable to address these problems, we propose solutions if Intel could extend the functionality of SGX instructions to protect the attestation keys of both local attestation and remote attestation processes, and thus enable the remote party to detect attacks using our proposed verifiable execution contracts. We propose to secure local attestation and remote attestation, separately.

Securing local attestation. We first deal with local attestation. When an attested enclave tries to attest itself to a target enclave, it calls EREPORT to sign its attestation data. EREPORT instruction derives the report key of the target enclave and signs the report data. Note that during this process, the report key is not exposed to the memory. However, when the target enclave tries to verify the attestation data, it uses EGETKEY to derive the report key to verify the signature. In this procedure, the report key is exported to the target enclave’s memory, offering the adversary a window to steal the report key. Once the adversary obtains the report key of a target enclave, she could pretend to be any enclave to gain the trust of the target enclave. For example, with the report key of Intel quoting enclave, a malicious program could pretend to be any valid enclave to deceive the Intel’s quoting enclave and pass the remote attestation.

133 To address this problem, we propose to introduce a new SGX leaf instruction called

EVERIFY, which takes as input a report and the corresponding report signature, derives

the report key, verifies report signature, and outputs the verification result (e.g., a single

bit indicating success or failure). EVERIFY should be used to replace EREPORT. The extra

overhead to implement EVERIFY should be low, because the main components, i.e., deriving

report key and generating the CMAC of the report, are already implemented in EREPORT.

With the introduction of EVERIFY, the report keys will never be exported to the enclave’s

memory and the local attestation can be protected.

Securing remote attestation. In current SGX remote attestation design, the attestation key

is exposed in Intel’s quoting enclave and used to sign the attestation data, which presents

the adversary a chance to extract the attestation key. To address the problem, we propose to

extend the functionality of EREPORT (or introduce a new instruction) to allow the derivation

of a special report key using the root provisioning key only, to sign the attestation data. In

this way, the attestation key will never be exposed to the enclave memory. The drawback

is that the verifier needs the same attestation key for verification. Intel needs to provide a

service for such verification. And the privacy guarantee of the SGX platforms provided

by EPID scheme will be lost. Further, EREPORT uses CMAC based on symmetric keys

for message authentication. So Intel has to perform the verification process on behalf of

the clients because the verifying key (which is also the signing key) cannot be revealed.

Whether it is feasible to implement an asymmetric message authentication code, or even an

EPID-like group signature using microcode is beyond the scope of this chapter.

With the above extensions securing both local and remote attestation, we can mitigate

the attacks that can read the entire memory. Here is an example enclave application:

134 • The ISV client generates an ECDH key pair, sends the ECDH public key and a nonce to

the SGX platform.

• The SGX platform launches the ISV enclave, the ISV enclave generates an ECDH key

pair, derives the shared ECDH key, within an AEX-free execution window. Sealing-

protected gaps are adopted to protect these secrets. If no AEX is detected, the enclave

produces a report with its ECDH public key and nonce (encrypted using the shared key)

as report data using EREPORT.

• The report is passed to Intel’s quoting enclave and verified by EVERIFY. The quoting

enclave then signs the report via EREPORT using report key derived from the root provi-

sioning key.

• The signed report is then returned to the client, and forwarded to Intel for verification. A

secure communication channel is created between the client and the enclave.

• If the verification passes, the client could then send secrets to be processed encrypted

using the shared keys to the enclave.

• The encrypted inputs are decrypted and processed, and the results are encrypted, within

the AEX-free execution window. Note that the shared ECDH key is unsealed within the

AEX-free execution window for decryption and encryption, and then re-sealed after these

operations. If no AEX is detected, the ISV enclave produces a report for the encrypted

results. The report is then signed by Intel’s quoting enclave and returned to the client.

• After verification by Intel, the client could assure that the results are reliable and the

secrets are not leaked.

135 6.6.4 Preventing Replay Attacks

While replay attacks can be addressed by trusted monotonic counters, in current SGX design, the SGX platform trusted service itself is vulnerable if memory confidentiality is breached:

• Seal keys of the ISV enclave might be obtained by the adversary to forge valid sealed

data with old state and updated counter value to bypass the monotonic counter check.

• The adversary might attack the platform service enclave (PSE) that provides the trusted

monotonic timer service, to get the credentials for communication with the Intel Con-

verged Security and Management Engine (CSME), which manages the replay-protected

storage. In this way, the adversary could manipulate the monotonic counter values to

launch replay attacks.

With the aforementioned extensions of SGX instructions, such replay attacks can be mitigated as follows:

• The communication between the ISV enclave and the PSE can be protected using the

extended local attestation.

• The CSME can be viewed as a remote client of the PSE enclave when requesting to

update the monotonic counter database status in the CSME, extended remote attestation

needs to be performed to guarantee that the update request is generated by a PSE without

being attacked. Further, the remote attestation key could be provisioned into the CSME,

so that further remote attestation could be saved.

136 6.7 Discussion

Limitations. Our current design could not fully address DRAM side channels [65]. The LLC

Reservation could partial mitigate DRAM attacks as the most effective DRAM attacks [87]

requires proactively evicting cache lines used by enclaves out of the LLC. However, when

the enclave code has a memory footprint larger than LLC, the adversary could passively

learn the DRAM accesses as cache lines used by the enclaves will be self-evicted. Another

limitation is that our current LLC Reservation design covers only code pages. The same verification method can be applied to read-only data pages directly. Dealing with writable

data pages is more challenging, as TSX transactions will abort when the enclave writes to

the same page from the sibling hyper-thread.

AMD SEV. While our work is driven by the protection of Intel SGX, similar concepts can

be applied to AMD’s Secure Encrypted virtualization (SEV). The excepted challenges are

due to the differences between the abstraction interfaces that the untrusted system software

provides to the SGX and SEV-protected TEE software. As SEV is still in its early stage, we

leave the study of execution contracts for SEV to future work.

Security without enclave confidentiality. The recently discovered SgxPectre attacks [28]

and Foreshadow attacks [83] could read the entire enclave memory content, rendering

all software solutions to side-channel threats ineffective. While these issues have been

temprarily addressed, concerns of potential vulnerabilities remain. Section 6.6 discusses

a potential solution that defeats these attacks using the concept of verifiable execution

contracts.

137 6.8 Summary

In this chapter, we proposed the concept of verifiable execution contracts, which define a contract between the OS and the enclaves that describes a guaranteed execution environment for the enclaves that could prevent side-channel attacks and OS-dependent attacks. We designed three types of execution contracts, implemented prototypes of enforcing them in the OS kernel, derived methods to verify their compliance the enclaves, and analyzed how existing attacks could be thwarted under these contracts.

138 Chapter 7: Conclusion

As a new generation of hardware support for scalable trusted execution environments

(TEE), Intel SGX has attracted much attention from both academia and industry. However,

its security promises have been questioned since its emergence. In this dissertation, we

explore one particular type of its security problems, i.e., side-channel attacks and conclude

that current understanding of the attack vectors and the corresponding countermeasures is

insufficient as the adversary could exploit various hardware features and vulnerabilities.

This conclusion is evidenced by demonstration of (1) a new memory-based attack, called

Hyper-Threading assisted sneaky page monitoring attacks, which does not trigger any AEX

directly, thus bypassing existing AEX-based detection schemes; and (2) a Spectre-like

attack, called SGXPECTRE Attacks, which compromises the confidentiality of SGX enclaves

completely by exploiting speculative execution vulnerabilities. On the other hand, we aim

to mitigate existing attacks and new attacks by design and implementation of (1) a LLVM-

based tool, HYPERRACE, to automatically instrument SGX enclave programs, protecting

them from all Hyper-Threading side channels; and (2) a new concept, verifiable execution

contracts, that defines a contract requesting the OS to provide a guaranteed execution

environment for the enclaves, within which launching attacks against enclaves becomes

infeasible.

139 There are several research problems to be addressed in future research. First, as already demonstrated by the two new side-channel attacks introduced in this dissertation, we believe that partially restricting the privileged adversary’s behavior is necessary to achieve comprehensive protection. And verifiable execution contracts have demonstrated potentials to address various side-channel attacks without significant performance overhead. However, in our current design, extra verifiable execution contracts are needed to completely address

LLC and DRAM attacks. Second, as demonstrated by SGXPECTRE Attacks, no software- only solution is feasible before the hardware patch. While hardware patches to specific attacks always have a long latency, exploring hardware modifications that could facilitate software solutions to address future attacks is significant. Particularly, when secrets exposed in enclave memory might be leaked under future attacks, designing software solutions (with possible microcode support) to detect such leakage can be significant.

140 Bibliography

[1] Clang: a C language family frontend for LLVM. http://clang.llvm.org/.

[2] INTEL-SA-00161. https://www.intel.com/content/www/us/en/ security-center/advisory/intel-sa-00161.html?wapkw=intel-sa-00161.

[3] Intel SGX SDK. https://github.com/intel/linux-sgx.

[4] Intel software guard extensions SSL. https://github.com/intel/ intel-sgx-ssl.

[5] The LLVM compiler infrastructure. https://llvm.org/.

[6] Nbench-byte benchmarks. http://www.math.cmu.edu/~florin/bench-32-64/ nbench/.

[7] SPEC CPU 2006. https://www.spec.org/cpu2006/.

[8] Intel software guard extensions programming reference. https://software.intel. com/sites/default/files/managed/48/88/329298-002.pdf/, 2014. Order Number: 329298-002, October 2014.

[9] ECS bare metal instance: Elastic & scalable physical servers - alibaba cloud, 2017. https://www.alibabacloud.com/product/ebm.

[10] Graphene / Graphene-SGX Library OS - a library OS for Linux multi-process applica- tions, with Intel SGX support. https://github.com/oscarlab/graphene/, 2017. Accessed May 16, 2017.

[11] Intel 64 and IA-32 architectures software developer’s manual, combined vol- umes:1,2A,2B,2C,3A,3B,3C and 3D. https://software.intel.com/sites/ default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf, 2017. Order Number: 325462-063US, July 2017.

[12] Intel software guard extensions developer guide. https://download.01.org/ intel-sgx/linux-2.0/docs/Intel_SGX_Developer_Guide.pdf, 2017. Intel SGX Linux 2.0 Release.

141 [13] Intel analysis of speculative execution side channels, 2018. Revision 1.0, January 2018.

[14] Intel developer zone: Forums. https://software.intel.com/en-us/forum, 2018.

[15] Intel software guard extensions developer reference for Linux* OS. https://download.01.org/intel-sgx/linux-2.4/docs/Intel_SGX_ Developer_Reference_Linux_2.4_Open_Source.pdf, 2018.

[16] Speculative execution side channel mitigations. http://kib.kiev.ua/x86docs/ SDMs/336996-001.pdf, 2018. Revision 1.0, January 2018.

[17] Onur Aciiçmez. Yet another microarchitectural attack: exploiting I-Cache. In 2007 ACM workshop on Computer security architecture, pages 11–18, 2007.

[18] Onur Aciiçmez, Billy Bob Brumley, and Philipp Grabher. New results on instruction cache attacks. In 12th international conference on Cryptographic hardware and embedded systems, pages 110–124, 2010.

[19] Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. Predicting secret keys via branch prediction. In 7th Cryptographers’ track at the RSA conference on Topics in Cryptology, pages 225–242, 2007.

[20] Onur Aciicmez and Jean-Pierre Seifert. Cheap hardware parallelism implies cheap security. In Workshop on Fault Diagnosis and Tolerance in Cryptography, pages 80–91, 2007.

[21] Berk Sunar Ahmad Moghimi, Thomas Eisenbarth. MemJam: A false dependency attack against constant-time crypto implementations. arXiv:1711.08002, 2017. https: //arxiv.org/abs/1711.08002.

[22] Ittai Anati, Shay Gueron, Simon P Johnson, and Vincent R Scarlata. Innovative technology for cpu based attestation and sealing. In 2nd International Workshop on Hardware and Architectural Support for Security and Privacy. ACM, 2013.

[23] Haitham Akkary Andy Glew, Glenn Hinton. Method and apparatus for performing page table walks in a capable of processing speculative instructions. US Patent 5680565 A, 1997.

[24] Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin, Chris- tian Priebe, Joshua Lind, Divya Muthukumaran, Dan O’Keeffe, Mark L. Stillwell, David Goltzsche, Dave Eyers, Rüdiger Kapitza, Peter Pietzuch, and Christof Fetzer. Scone: Secure linux containers with intel SGX. In 12th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 2016.

142 [25] A. Baumann, M. Peinado, and G. Hunt. Shielding applications from an untrusted cloud with Haven. ACM Transactions on Computer Systems, 33(3), August 2015.

[26] Ferdinand Brasser, Urs Müller, Alexandra Dmitrienko, Kari Kostiainen, Srdjan Capkun, and Ahmad-Reza Sadeghi. Software grand exposure: SGX cache attacks are practical. In USENIX Workshop on Offensive Technologies, 2017.

[27] Chandler Carruth. Retpoline patch for LLVM. https://reviews.llvm.org/ D41723, 2018.

[28] Guoxing Chen, Sanchuan Chen, Yuan Xiao, Yinqian Zhang, Zhiqiang Lin, and Ten H. Lai. SgxPectre Attacks: Stealing Intel secrets from SGX enclaves via speculative execution. In 4th IEEE European Symposium on Security and Privacy (EuroS&P), 2019.

[29] Guoxing Chen, Wenhao Wang, Tianyu Chen, Sanchuan Chen, Yinqian Zhang, Xi- aoFeng Wang, Ten H. Lai, and Dongdai Lin. Racing in hyperspace: Closing hyper- threading side channels on sgx with contrived data races. In 2018 IEEE Symposium on Security and Privacy (SP), volume 00, pages 388–404.

[30] Sanchuan Chen, Xiaokuan Zhang, Michael Reiter, and Yinqian Zhang. Detecting privileged side-channel attacks in shielded execution with DEJA VU. In 12th ACM Symposium on Information, Computer and Communications Security, 2017.

[31] Muntaquim F Chowdhury and Douglas M Carmean. Method, apparatus, and system for maintaining processor ordering by checking load addresses of unretired load instructions against snooping store addresses, November 19 2002. US Patent 6,484,254.

[32] Victor Costan, Ilia Lebedev, and Srinivas Devadas. Sanctum: Minimal hardware exten- sions for strong software isolation. In 25th USENIX Security Symposium. USENIX Association, 2016.

[33] Yu Ding, Ran Duan, Long Li, Yueqiang Cheng, Yulong Zhang, Tanghui Chen, Tao Wei, and Huibo Wang. Poster: Rust SGX SDK: Towards memory safety in Intel SGX enclave. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages 2491–2493, New York, NY, USA, 2017. ACM.

[34] Agner Fog. The of intel, amd and via cpus: An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering, 2017.

[35] Johannes Götzfried, Moritz Eckert, Sebastian Schinzel, and Tilo Müller. Cache attacks on Intel SGX. In EUROSEC, 2017.

143 [36] Daniel Gruss, Julian Lettner, Felix Schuster, Olya Ohrimenko, Istvan Haller, and Manuel Costa. Strong and efficient cache side-channel protection using hardware transactional memory. In USENIX Security Symposium, 2017.

[37] Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. Cache template attacks: Au- tomating attacks on inclusive last-level caches. In USENIX Security Symposium, pages 897–912, 2015.

[38] Marcus Hähnel, Weidong Cui, and Marcus Peinado. High-resolution side channels for untrusted operating systems. In USENIX Annual Technical Conference, pages 299–312, 2017.

[39] Matthew Hoekstra, Reshma Lal, Pradeep Pappachan, Vinay Phegade, and Juan Del Cuvillo. Using innovative instructions to create trustworthy software solutions. In 2nd International Workshop on Hardware and Architectural Support for Security and Privacy. ACM, 2013.

[40] Jann Horn. Reading privileged memory with a side- channel. https://googleprojectzero.blogspot.com/2018/01/ reading-privileged-memory-with-side.html, 2018.

[41] Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, and Emmett Witchel. Ryoan: A distributed sandbox for untrusted computation on secret data. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 2016.

[42] Intel. Method and apparatus for implementing a speculative return stack buffer. US5964868, 1999.

[43] Intel. Method and apparatus for predicting target addresses for return from instructions utilizing a return address cache. US Patent, Intel Corporation, US6170054, 2001.

[44] Intel. Return address predictor that uses branch instructions to track a last valid return address. US Patent, Intel Corporation, US6253315, 2001.

[45] Intel. System and method of maintaining and utilizing multiple return stack buffers. US Patent, Intel Corporation, US6374350, 2002.

[46] Intel. Return register stack target predictor. US Patent, Intel Corporation, US6560696, 2003.

[47] Intel. Attestation service for intel software guard extensions (Intel SGX): API doc- umentation. https://software.intel.com/sites/default/files/managed/ 7e/3b/ias--spec.pdf, 2018.

144 [48] Simon Johnson, Vinnie Scarlata, Carlos Rozas, Ernie Brickell, and Frank Mckeen. Intel Software Guard Extensions: EPID Provisioning and Attestation Services. Technical report, Intel, Tech. Rep, 2016.

[49] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD processor for multiprocessor servers. IEEE Micro, 23(2):66–76, March 2003.

[50] James C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385– 394, July 1976.

[51] Paul Kocher, Jann Horn, Anders Fogh, , Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative execution. In 40th IEEE Symposium on Security and Privacy (S&P’19), 2019.

[52] Dmitrii Kuvaiskii, Oleksii Oleksenko, Sergei Arnautov, Bohdan Trach, Pramod Bha- totia, Pascal Felber, and Christof Fetzer. Sgxbounds: Memory safety for shielded execution. In 12th European Conference on Computer Systems. ACM, 2017.

[53] Sangho Lee, Ming-Wei Shih, Prasun Gera, Taesoo Kim, Hyesoon Kim, and Mar- cus Peinado. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. In 26th USENIX Security Symposium, pages 557–574, 2017.

[54] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown: Reading kernel memory from user space. In 27th USENIX Security Symposium, 2018.

[55] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee. Last-level cache side-channel attacks are practical. In 36th IEEE Symposium on Security and Privacy, May 2015.

[56] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. Catalyst: Defeating last-level cache side channel attacks in cloud comput- ing. In High Performance (HPCA), 2016 IEEE International Symposium on, pages 406–418. IEEE, 2016.

[57] Sinisa Matetic, Kari Kostiainen, Aritra Dhar, David Sommer, Mansoor Ahmed, Arthur Gervais, Ari Juels, and Srdjan Capkun. Rote: Rollback protection for trusted execution. Cryptology ePrint Archive, Report 2017/048, 2017. http://eprint.iacr.org/ 2017/048.pdf.

[58] Abdelhafid Mazouz, Alexandre Laurent, Benoît Pradelle, and William Jalby. Evalua- tion of CPU frequency transition latency. Computer Science - Research and Develop- ment, 29(3):187–195, Aug 2014.

145 [59] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, Carlos Rozas, Hisham Shafi, Vedvyas Shanbhogue, and Uday Savagaonkar. Innovative instructions and software model for isolated execution. In 2nd International Workshop on Hardware and Architectural Support for Security and Privacy. ACM, 2013.

[60] Olga Ohrimenko, Felix Schuster, Cedric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious multi-party on trusted processors. In 25th USENIX Security Symposium. USENIX Association, 2016.

[61] Dan O’Keeffe, Divya Muthukumaran, Pierre-Louis Aublin, Florian Kelbert, Christian Priebe, Josh Lind, Huanzhou Zhu, and Peter Pietzuch. Sgxspectre. https://github. com/lsds/spectre-attack-sgx, 2018.

[62] Oleksii Oleksenko, Bohdan Trach, Robert Krahn, Mark Silberstein, and Christof Fetzer. Varys: Protecting SGX enclaves from practical side-channel attacks. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 227–240, Boston, MA, 2018. USENIX Association.

[63] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: the case of AES. In 6th Cryptographers’ track at the RSA conference on Topics in Cryptology, pages 1–20, 2006.

[64] Colin Percival. Cache missing for fun and profit. In 2005 BSDCan, 2005.

[65] Peter Pessl, Daniel Gruss, Clementine Maurice, Michael Schwarz, and Stefan Mangard. Drama: Exploiting dram addressing for cross-CPU attacks. In the 25th USENIX Security Symposium, 2016.

[66] Mark Russinovich. Introducing azure confidential comput- ing, 2017. https://azure.microsoft.com/en-us/blog/ introducing-azure-confidential-computing/.

[67] F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, and M. Russinovich. VC3: Trustworthy data analytics in the cloud using SGX. In 36th IEEE Symposium on Security and Privacy, 2015.

[68] Michael Schwarz, Samuel Weiser, Daniel Gruss, Clémentine Maurice, and Stefan Mangard. Malware guard extension: Using sgx to conceal cache attacks. In Michalis Polychronakis and Michael Meier, editors, Detection of Intrusions and Malware, and Vulnerability Assessment: 14th International Conference, DIMVA 2017, Bonn, Germany, July 6-7, 2017, Proceedings. Springer International Publishing, 2017.

[69] Jaebaek Seo, Byoungyoung Lee, Seongmin Kim, Ming-Wei Shih, Insik Shin, Dongsu Han, and Taesoo Kim. Sgx-shield: Enabling address space layout randomization for sgx programs. In The Network and Distributed System Security Symposium, 2017.

146 [70] Hovav Shacham. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). In 14th ACM Conference on Computer and Communications Security, 2007.

[71] Ming-Wei Shih, Sangho Lee, Taesoo Kim, and Marcus Peinado. T-SGX: Eradicating controlled-channel attacks against enclave programs. In Proceedings of the 2017 Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2017.

[72] Shweta Shinde, Zheng Leong Chua, Viswesh Narayanan, and Prateek Saxena. Pre- venting page faults from telling your secrets. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 317–328. ACM, 2016.

[73] Shweta Shinde, Dat Le Tien, Shruti Tople, and Prateek Saxena. Panoply: Low-tcb linux applications with SGX enclaves. In The Network and Distributed System Security Symposium, 2017.

[74] Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Audrey Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In IEEE Symposium on Security and Privacy, 2016.

[75] Raoul Strackx and Frank Piessens. Ariadne: A minimal approach to state continuity. In 25th USENIX Security Symposium. USENIX Association, 2016.

[76] Dean Sullivan, Orlando Arias, Travis Meade, and Yier Jin. Microarchitectural mine- fields: 4k-aliasing covert channel and multi-tenant detection in IaaS clouds. In Network and Distributed Systems Security (NDSS) Symposium, 2018.

[77] Sandeep Tamrakar, Jian Liu, Andrew Paverd, Jan-Erik Ekberg, Benny Pinkas, and N. Asokan. The circle game: Scalable private membership test using trusted hardware. In ACM on Asia Conference on Computer and Communications Security. ACM, 2017.

[78] Florian Tramer, Fan Zhang, Huang Lin, Jean-Pierre Hubaux, Ari Juels, and Elaine Shi. Sealed-glass proofs: Using transparent enclaves to prove and sell knowledge. Cryptology ePrint Archive, Report 2016/635, 2016. https://eprint.iacr.org/ 2016/635.

[79] Eran Tromer, Dag Arne Osvik, and Adi Shamir. Efficient cache attacks on AES, and countermeasures. J. Cryptol., 23(2):37–71, January 2010.

[80] Chia-Che Tsai, Donald E. Porter, and Mona Vij. Graphene-SGX: A practical library OS for unmodified applications on SGX. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 645–658, Santa Clara, CA, 2017. USENIX Association.

147 [81] Paul Turner. Retpoline: a software construct for preventing branch-target-injection. https://support.google.com/faqs/answer/7625886, 2018. [82] D. Tychalas, N. G. Tsoutsos, and M. Maniatakos. SGXCrypter: IP protection for portable executables using intel’s SGX technology. In 22nd Asia and South Pacific Design Automation Conference, 2017. [83] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx. Foreshadow: Extracting the keys to the intel SGX kingdom with transient out-of- order execution. In 27th USENIX Security Symposium (USENIX Security 18), page 991–1008, Baltimore, MD, 2018. USENIX Association. [84] Jo Van Bulck, Frank Piessens, and Raoul Strackx. Sgx-step: A practical attack framework for precise enclave execution control. In Proceedings of the 2Nd Workshop on System Software for Trusted Execution, SysTEX’17, pages 4:1–4:6, New York, NY, USA, 2017. ACM. [85] Jo Van Bulck, Nico Weichbrodt, Rüdiger Kapitza, Frank Piessens, and Raoul Strackx. Telling your secrets without page faults: Stealthy page table-based attacks on en- claved execution. In Proceedings of the 26th USENIX Security Symposium. USENIX Association, 2017. [86] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detection systems. In Proceedings of the 9th ACM Conference on Computer and Communications Security, CCS ’02, pages 255–264, New York, NY, USA, 2002. ACM. [87] Wenhao Wang, Guoxing Chen, Xiaorui Pan, Yinqian Zhang, XiaoFeng Wang, Vincent Bindschaedler, Haixu Tang, and Carl A Gunter. Leaky cauldron on the dark land: Understanding memory side-channel hazards in sgx. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017. [88] Samuel Weiser and Mario Werner. SGXIO: Generic trusted i/o path for Intel SGX. arXiv preprint, arXiv:1701.01061, 2017. https://arxiv.org/abs/1701.01061. [89] David Woodhouse. Retpoline patch for GCC. http://git.infradead.org/users/ dwmw2/gcc-retpoline.git, 2018. [90] Bin Cedric Xing, Mark Shanahan, and Rebekah Leslie-Hurd. Intel software guard extensions (Intel SGX) software support for dynamic memory allocation inside an enclave. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016, HASP 2016, pages 11:1–11:9, New York, NY, USA, 2016. ACM. [91] Yuanzhong Xu, Weidong Cui, and Marcus Peinado. Controlled-channel attacks: Deterministic side channels for untrusted operating systems. In Security and Privacy (SP), 2015 IEEE Symposium on, pages 640–656. IEEE, 2015.

148 [92] Y. Yarom and K. E. Falkner. FLUSH+RELOAD: A high resolution, low noise, L3 cache side-channel attack. In USENIX Security Symposium, pages 719–732, 2014.

[93] F. Zhang, E. Cecchetti, K. Croman, A. Juels, , and E. Shi. Town crier: An authenticated data feed for smart contracts. In 23rd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.

[94] Yinqian Zhang, Ari Juels, Alina Oprea, and Michael K. Reiter. HomeAlone: Co- residency detection in the cloud via side-channel analysis. In IEEE Symposium on Security and Privacy, 2011.

[95] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. Cross-VM side channels and their use to extract private keys. In ACM Conference on Computer and Communications Security, 2012.

[96] Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gon- zalez, and Ion Stoica. Opaque: An oblivious and encrypted distributed analytics platform. In 14th USENIX Symposium on Networked Systems Design and Implementa- tion. USENIX Association, 2017.

[97] Ziqiao Zhou, Michael K. Reiter, and Yinqian Zhang. A software approach to de- feating side channels in last-level caches. In ACM Conference on Computer and Communications Security, 2016.

149