Memory-Based Side-Channel Attacks and Countermeasures

A Dissertation Presented by

Zhen Hang Jiang

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University Boston, Massachusetts

July 2019 To my parents, wife, brother, and sister.

i Contents

Acknowledgments v

Abstract of the Dissertation vi

1 Introduction 1 1.1 Motivation ...... 1 1.2 Existing Memory-Based Side-Channel Attacks and Countermeasures ...... 3 1.3 Dissertation Overview ...... 5 1.4 Dissertation Contribution ...... 5

2 Information Leakage in Memory Coalescing Unit 7 2.1 Introduction ...... 7 2.2 Related Work ...... 9 2.3 Background ...... 10 2.3.1 GPU Memory Architecture ...... 10 2.3.2 AES GPU Implementation ...... 11 2.4 Correlation Timing Attack ...... 14 2.4.1 SIMT Architecture Leakage ...... 15 2.4.2 AES Encryption Leakage ...... 19 2.4.3 Correlation Timing Attack on GPU AES Implementation ...... 24 2.4.4 Attack on Highly Occupied GPU ...... 31 2.4.5 Discussion ...... 34 2.5 Countermeasures ...... 35 2.6 Summary ...... 35

3 Information Leakage in Banks 36 3.1 Introduction ...... 36 3.2 Background ...... 38 3.2.1 AES Encryption ...... 38 3.2.2 Nvidia GPU ...... 39 3.2.3 Single Instruction Multiple Threads Execution Model ...... 41 3.3 Threat Model ...... 42 3.4 Bank Conflicts-Based Side-Channel Timing Channel ...... 42 3.5 Differential Timing Attack ...... 44

ii 3.5.1 Mapping Between the AES Lookup Tables and GPU Shared Memory Banks 46 3.5.2 Collecting Data ...... 46 3.5.3 Calculating the Shared Memory Bank Index ...... 46 3.5.4 Recovering Key Bytes ...... 47 3.5.5 More Realistic Attack Scenarios ...... 52 3.6 Timing Analysis on Other Architectures ...... 55 3.7 Discussions and Countermeasures ...... 59 3.7.1 Multi-Key Implementation As Countermeasure ...... 61 3.8 Summary ...... 65

4 Information Leakage in L1 Cache Banks 66 4.1 Introduction ...... 66 4.2 Background ...... 67 4.2.1 AES Encryption ...... 67 4.2.2 Intel Cache Architecture ...... 68 4.2.3 Cache Timing Attacks ...... 69 4.2.4 Countermeasures against Cache Timing Attacks ...... 70 4.2.5 L1 Cache Bank and CacheBleed Attack ...... 71 4.3 Cache Bank Timing ...... 73 4.3.1 Threat Model ...... 73 4.3.2 The Cache Bank Timing Channel ...... 73 4.3.3 Attacking AES Encryption ...... 74 4.4 Countermeasures ...... 83 4.5 Summary ...... 84

5 The Countermeasure - MemPoline 85 5.1 Introduction ...... 85 5.2 Background and Related Work ...... 87 5.2.1 Microarchitecture of the Memory Hierarchy ...... 87 5.2.2 Data Memory Access Footprint ...... 88 5.2.3 Vulnerable Ciphers ...... 90 5.3 Threat Model ...... 91 5.4 Our Countermeasure - MemPoline ...... 91 5.4.1 Design Overview ...... 91 5.4.2 Define the Data Structures ...... 93 5.4.3 Initialization - Loading Original Sensitive Data ...... 94 5.4.4 Epochs of Permuting ...... 95 5.4.5 Security Analysis ...... 98 5.4.6 Operations Analysis ...... 98 5.4.7 Implementation - API ...... 99 5.5 Evaluation ...... 101 5.5.1 Experimental Setup ...... 101 5.5.2 Security Evaluation of AES ...... 101 5.5.3 Performance Evaluation ...... 107 5.6 Summary ...... 108

iii 6 Conclusion 110

Bibliography 112

iv Acknowledgments

I would like to express my deepest gratitude to my advisor, Professor Yunsi Fei, and my dissertation committee members, Professors David Kaeli, Adam Ding, and Thomas Wahl, for their invaluable advice and continual support throughout my PhD study at Northeastern University. Finally, my sincere appreciation goes to my wife, for her encouragement and being the consummate partner in all aspects of life, and my parents, brother, and sister, for their unconditional and constant love and support.

v Abstract of the Dissertation

Memory-Based Side-Channel Attacks and Countermeasures

by Zhen Hang Jiang Doctor of Philosophy in Computer Engineering Northeastern University, July 2019 Dr. Yunsi Fei, Advisor

Recent years have seen various side-channel timing attacks demonstrated on both CPUs and GPUs, in diverse settings such as desktops, clouds, and mobile systems. These attacks observe events on different shared resources on the memory hierarchy from timing information, then the secret-dependent memory access pattern is inferred, and finally, the secret is retrieved through statistical analysis. We generalize these attacks as memory-based side-channel attacks. In this dissertation, we identify several side-channel vulnerabilities in memory resources on both GPU and CPU platforms and propose novel side-channel attacks to exploit these vulnerabilities for secret retrieval. Specifically, We examine the memory coalescing unit and Shared Memory unit on GPU platforms, and L1 cache bank on CPU platforms. These microarchitectural resources, indispensable for performance optimization, inadvertently leak applications’ memory access pattern. We craft memory-based side-channel attacks to capture such leakage and exploit it to successfully recover the entire 16-byte key of Advanced Encryption Standard (AES). As memory-based side-channel attacks are very powerful and many common microarchi- tecture resources on various system are vulnerable, defenses against them should be sought after. Based on the insight that all existing memory-based side-channel attacks (including our proposed ones) exploit the fixed mapping between the content and memory resources, we propose a novel

vi software countermeasure, MemPoline, against memory-based side-channel attacks. MemPoline hides the secret-dependent memory access pattern by moving sensitive data around randomly within a memory space. Although an adversary may still observe events on microarchitecture resources, the randomness prevents her from retrieving useful secret information. We implement efficient permutations directed by parameters, significantly lighter weight than the prior oblivious RAM technology, yet achieving similar security. The countermeasure only requires changes in the source code, and has great advantages of being general - algorithm-agnostic, portable - independent of the underlying architecture, and compatible - a user-space approach that works for any operating system or hypervisor. The contributions of this dissertation include identification of several new memory-based side-channels on CPUs and GPUs, which are weaker than the traditional CPU cache side-channel but are on different microarchitecture resources and therefore orthogonal to cache side-channel countermeasures. The proposed software countermeasure addresses the root cause of memory-based side-channel attacks and effectively protects cryptographic implementations on both CPUs and GPUs against all these memory-based attacks with a minimal performance impact.

vii Chapter 1

Introduction

This dissertation focuses on memory-based side-channel attacks, which exploit the memory access footprint inferred from observable microarchitectural events, and countermeasures that prevent these attacks. In this chapter, we start with motivations for further investigations of memory- based side-channel attacks beyond the existing work, and then give an overview of the attacks and countermeasures proposed in this dissertation. Finally, we summarize the contribution of this dissertation.

1.1 Motivation

Cryptography plays a crucial role in providing three fundamental security properties: con- fidentiality, integrity, and authenticity, through various cryptographic functions including encryption, hashing, signing, authentication, etc. Rather than relying on “secure by obscurity,” information security relies on only keys being secret while the algorithms and even implementations all being open and standardized. Hence, adequately protecting secret key is critical in order to deliver the security guarantee. Since the very first successful key-recovery demonstration of Differential Power Analysis (DPA) [1] by Kocher et al., side-channel attacks have changed the notion of “security” for crypto- graphic algorithms despite their mathematically proven security. Various side channels, including the most common power consumption and electromagnetic (EM) emanation, have been leveraged to break cryptographic engines, such as Advanced Encryption Standard (AES) and RSA, on many plat- forms, such as FPGA [2] ASIC [3], and GPUs [4]. While this type of attacks requires physical access to a targeted system to obtain the physical side-channel information, memory-based side-channel

1 CHAPTER 1. INTRODUCTION attacks can be mounted remotely, presenting a serious cyber threat to cryptographic software, servers, and cloud services. Memory-based side-channel attacks, which exploit the memory access footprint inferred from observable microarchitectural events, have gained the popularity in the side-channel security community and become a serious cyber threat to not only cryptographic implementations but also general software bearing secrets. For example, researchers have demonstrated successfully in recovering a full encryption key [5, 6, 7] and logging keyboard events [8, 9, 10] using memory-based side-channel attacks. Most of memory-based side-channel attacks target one of memory resources, the cache structure, and exploit its significant difference in a cache hit vs. miss access time. With the introduction of programmable shader cores and high-level programming frame- works [11, 12], GPUs have been integrated into complex heterogeneous computer systems for accelerating applications. Given their ability to provide high throughput and efficiency, GPUs are now being leveraged to offload cryptographic workloads from CPUs [13, 14, 15, 16, 17, 18]. This move to the GPU allows cryptographic processing to achieve up to 28X higher throughput [13]. While an increasing number of security systems are deploying GPUs, the security of GPU execution has not been well studied. In this dissertation, we take the first step and thoroughly analyze two memory resources on GPUs, Memory Coalescing unit and banked Shared Memory unit, and discover side-channel timing leakage of these two resources, and devise two memory-based side-channel attacks to successfully break the 16-byte AES encryption on a GPU. Similar to banked Shared Memory unit, the L1 cache of modern complex processors is also banked in order to achieve high bandwidth for superscalar processors and reduce the power consumption. Rather than a monolithic piece of a microarchitectural module, L1-cache is composed of multiple cache banks, which allow multiple concurrent accesses to different cache banks at one time. However, when two or more accesses target the same bank, a bank conflict arises, and they would be processed in a serialized manner. The subtle timing difference between parallel and serial cache bank accesses can be exploited to leak sensitive information. Based on this timing difference, we design another memory-based side-channel attack to recover the 16-byte AES encryption key. Despite numerous countermeasures [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], none of them can prevent all existing memory-based side-channel attacks. Protecting vulnerable applications against different memory-based side-channel attacks is challenging and can be costly, thus calling for more general countermeasures that work across architectures and applications. We propose a software countermeasure, MemPoline, to provide just-in-need security level to defend against memory-based side-channel attacks. Specifically, we use a parameter-

2 CHAPTER 1. INTRODUCTION

L1 Cache Bank Shared Memory Memory Timing Channel Cache Miss/Hit Conflict Bank Conflict Coalescing Platform CPU CPU GPU GPU Bernstein Timing Attack [31], Prime+Probe [32, Jiang, et al. Jiang, et al. Jiang, al et. Attacks 33], [ICCAD 2017], [GLSVLSI [HPCA 2016] Evict+Time [32], CacheBleed [34] 2017]. Flush+Reload [6], Flush+Flush [9], ...

Table 1.1: Memory-Based Side-Channel Attacks based permutation function to shuffle the memory space progressively. Results show that our countermeasure can effectively mitigate all these known memory-based side-channel attacks with significantly low performance degradation.

1.2 Existing Memory-Based Side-Channel Attacks and Countermea- sures

Attack. Cache is a critical structure for performance that reduces the speed gap between the main memory storage and the computation (on CPU or GPU cores), by utilizing both spatial and temporal locality exhibited in program codes and data. As caches store only a portion of memory content, a memory request can be served directly by the cache in case of cache hits, otherwise by the off-chip memory (cache misses). The timing difference between a cache hit and miss can be hundreds of cycles, and hence it forms a strong timing side-channel that many memory-based side-channel attacks exploit. However, as the memory subsystem become more complex, there exist many other vulnerable memory resources. We classify the memory-based side-channel attacks and our three proposed attacks according to the memory resource that each utilizes, and present them in Table 1.1. In this dissertation, we identify and explore memory-based side-channels other than the common and strong one that exploits the timing difference between a cache hit and miss. Memory-based side-channel attacks can be classified into access-driven and time-driven.

3 CHAPTER 1. INTRODUCTION

For a time-driven attack [32, 35, 31], the adversary observes the total execution time of the victim under different inputs and uses statistical methods with a large number of samples to infer the secret. For an access-driven attack [32, 33, 34, 9], the adversary intentionally creates contentions on certain shared resources with the victim to infer the memory access footprint of the victim. It consists of three steps: 1. preset - the adversary sets the shared resource to a certain state; 2. execution - the victim runs; 3. measurement - the adversary checks the state of the resource using timing information. Countermeasure. While the number of memory-based side-channel attacks continues to grow, various countermeasures are proposed. Hardware-based countermeasures modify the cache architecture and policies, and can be efficient [19, 20, 21, 29, 28]. However, they are invasive and require hardware redesign, and often times only address a specific attack. Software countermea- sures [23, 24, 25, 36] require no hardware modification and make changes at different levels of the software stack, e.g., the source code, binary code, , or the operating system. They are favorable for existing computer systems with the potential to be general, portable, and compatible. The software implementation of Oblivious RAM (ORAM) scheme shown in the prior work [37] has been demonstrated to be successful in mitigating cache side-channel attacks. The ORAM scheme [38, 39] was originally designed to hide a client’s data access pattern to the remote storage from an untrusted server by repeatedly shuffling and encrypting data. Raccoon [37] re- purposes ORAM to prevent memory access pattern from leaking through cache side-channel. The Path-ORAM scheme [39] uses a small client-side private storage to store a position map for tracking real locations of the data, and assumes the server cannot monitor the access pattern. However, in side-channel attacks, all access patterns can be monitored, and indexing to a position map is considered insecure against memory-based side-channel attacks. Instead of indexing, Raccoon [37], which focuses on control flow obfuscation and uses ORAM for storing data, streams in the position map to look for the real data location, so that it provides a strong security guarantee. However, since it relies on ORAM for storing data, its memory access runtime is O(N) given N data elements, and the ORAM related operations can incur more than 100x performance overhead. We propose a software countermeasure, MemPoline, to address the side-channel security issue of Path-ORAM [39] and the performance issue of both the prior work [39, 37]. MemPoline adopts the ORAM idea of shuffling, but implements a much more efficient permutation scheme. In our scheme, permutation is directed by a parameter. Thus, we only need to keep the parameter value private (instead of a position map) to track the real dynamic locations of data. For our countermeasure, the memory access runtime is O(1), significantly lower than O(log(N)) of Path-ORAM [39] and O(N) of Raccoon [37].

4 CHAPTER 1. INTRODUCTION

1.3 Dissertation Overview

The dissertation consists of two major parts. The first part explores vulnerable memory resources other than the cache structure that leaks side-channel information via cache hit and miss accesses, and the second part proposes a software-based countermeasure to mitigate memory-based side-channel attacks for applications running on systems with vulnerable memory resources. In Chapter 2 and 3, we examine vulnerable memory resources on GPU platforms. Specifi- cally, we thoroughly analyze two memory resources on GPUs, Memory Coalescing unit in Chapter 2 and banked Shared Memory unit in Chapter 3. We discover side-channel timing leakage of these two resources, and devise two memory-based side-channel attacks to successfully break the 16-byte AES encryption on various GPU platforms. In Chapter 4, we analyze the banked L1 cache of modern complex CPU. We derive a memory-based side-channel attack exploit the subtle timing difference between parallel and serial cache bank accesses and recover the 16-byte AES encryption key. In Chapter 5, we propose a software countermeasure, MemPoline, to provide just-in- need security level to defend against memory-based side-channel attacks. Specifically, we use a parameter-based permutation function to shuffle the memory space progressively and obfuscate memory accesses. We apply the countermeasure MemPoline to both T-table implementation of AES and sliding-window implementation of RSA. We evaluate the countermeasure against various memory-based attacks on both CPU and GPU platforms. Results show that the countermeasure can effectively mitigate known memory-based side-channel attacks with significantly less performance degradation than other ORAM-based countermeasures. We conclude the dissertation in Chapter 6.

1.4 Dissertation Contribution

In this dissertation, we propose a number of new memory-based side-channel attacks and countermeasures. We thoroughly examine several microarchitectural units in terms of their timing leakage, reverse-engineer partial structure and behavior of these units, and identify their vulnerability to side-channel attacks. Our work significantly augments the awareness of side-channel security in broader computer architecture across various platforms. The contributions of the dissertation to the areas of computer architecture and side-channel security include:

1. Memory Coalescing Unit: We discover the very first memory resource on GPUs that can leak the memory access footprint of an application. We overcome the challenges for memory-based

5 CHAPTER 1. INTRODUCTION

side-channel attacks introduced by GPU’s feature and design an effective memory coalescing side-channel attack against AES encryption. This attack is time-driven, non-invasive, non-interfering, and only measures the total execution time of the GPU under different data input. We demonstrate that even a slight timing difference can render memory resources on a GPU vulnerable to memory-based side-channel attacks.

2. Shared Memory Banks: We discover another memory resource on GPUs that can leak the memory access footprint of an application, shared memory banks. We design another effective time-driven memory-based attack that only explores the interaction among parallel threads through the shared memory banks. No prior work has investigated such memory resource on GPUs.

3. L1 Cache Bank: There is a very subtle timing side-channel in L1 cache banks of CPU, caused by the small stalling delay due to conflicts between concurrent access requests to the same bank. We design an access-driven cache bank attack with a spy process and a concurrent victim process, supported by Hyperthreading. Observing the total execution time of the spy process allows malicious users to infer the memory access pattern of the victim process due to their contention on the L1 cache bank. Since all the existing countermeasures target cache side- channel attacks that rely on cache miss penalty, none of them can prevent our new cache bank attack as it is orthogonal to other cache attacks and yields different side-channel granularity - cache bank versus cache line.

4. MemPoline: We propose a software-based countermeasure to mitigate existing memory-based side-channel attacks across different memory resources and platforms. The countermeasure is built on top of a novel efficient and effective technique to randomize a memory space at run- time so that it obfuscates a program’s memory access pattern. We apply the countermeasure to multiple ciphers on different platforms (CPUs and GPUs) and evaluate the resilience and demonstrate the countermeasure can defeat all known memory-based side-channel attacks, both empirically and theoretically.

6 Chapter 2

Information Leakage in Memory Coalescing Unit

2.1 Introduction

With introduction of programmable shader cores and high-level programming frame- works [12, 11], GPUs have become fully programmable parallel computing devices. Compared to modern multi-core CPUs, a GPU can deliver significantly higher throughput by executing workloads in parallel over thousands of cores. As a result, GPUs have quickly become the accelerator of choice for a large number of applications, including physics simulation, biomedical analytics and signal processing. Given their ability to provide high throughput and efficiency, GPUs are now being leveraged to offload cryptographic workloads from CPUs [13, 14, 15, 16, 17, 18]. This move to the GPU provides cryptographic processing to achieve up to 28X higher throughput [13]. While an increasing number of security systems are deploying GPUs, the security of GPU execution has not been well studied. Pietro et al. identified that information leakage can occur throughout the memory hierarchy due to the lack of memory-zeroing operations on a GPU [40]. Previous work has also identified vulnerability of GPUs using software methods [41, 42]. While there has been a large number of studies on side-channel security on other platforms, such as CPUs and FPGAs, there has been little attention paid to side-channel vulnerability of GPU devices. Timing attacks have been demonstrated to be one of the most powerful classes of side- channel attacks [31, 35, 5, 6, 33, 7]. Timing attacks exploit the relationship between input data and the time (i.e., number of cycles) for the system to process/access the data. For example, in a cache

7 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT collision attack [35], the attacker exploits the difference in terms of CPU cycles to serve a cache miss versus a hit, and considers the cache locality produced by a unique input data set. There is no prior work of evaluating timing attacks on GPUs. To the best of our knowledge, our work is the first one to consider timing attacks deployed at the architecture level on a GPU. The GPU’s Single Instruction Multiple Threads (SIMT) execution model prevents us from simply leveraging prior timing attack methods adopted for CPUs on GPUs. A GPU can perform multiple encryptions concurrently, and each encryption will compete for hardware resources with other threads, providing the attacker with confusing timing information. Also, when using SIMT, the attacker would not be able to time-stamp each encryption individually. The timing information that the attacker obtains would be dominated by the longest running encryption. Given these challenges in GPU architectures, most existing timing attack methods become infeasible. In this chapter, we demonstrate that information leakage can be extracted when executing on an SIMT-based GPU to fully recover the encryption secret key. Specifically, we first observe that the kernel execution time is linearly proportional to the number of unique cache line requests generated during a kernel execution. In the L1-Cache memory controller in a GPU, memory requests are queued and processed in the First-In-First-Out (FIFO) order, so the time to process all of memory requests depends on the number of memory requests. As AES encryption generates memory requests to load its S-box/T-tables entries, the addresses of memory requests are dependent on the input data and encryption key. Thus, the execution time of an encryption kernel is correlated with the key. By leveraging this relationship, we can recover all the 16 AES secret key bytes on an Nvidia Kepler GPU. Although we demonstrate this attack on a specific Nvidia GPU, other GPUs also have the same exploitable leakage. We have set up a client-server infrastructure shown in Figure 2.1. In this setting, the attacker (client) sends messages to the victim (encryption server) through the internet, the server employs its GPU for encryption, and sends encrypted messages back to the attacker. For each message, the encrypted ciphertext is known to the attacker, as well as the timing information. If the timing data measured is clean (mostly attributes to the GPU kernel computation), we are able to recover all the 16 key bytes using one million timing samples. In a more practical attack setting where there is CPU noise in our timing data, we are still able to fully recover all the key bytes by collecting a larger number of samples and filtering out the noise. Our attack results show that modern SIMT-based GPU architectures are vulnerable to timing side-channel attacks. The rest of the chapter is organized as follows. In Section 2.2, we discuss related work. In Section 2.3, we provide an overview of our target GPU memory architecture and our AES GPU

8 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

Figure 2.1: The attack environment implementation. In Section 2.4, the architecture leakage model is first presented, followed by our attack method that exploits the leakage for complete AES key recovery. We discuss potential countermeasures in Section 2.5. Finally, the chapter is summarized in Section 2.6.

2.2 Related Work

Timing attacks utilize the relationship between data and the time taken in a system to access/process the data. Multiple attacks have been demonstrated successfully by exploiting cache access latencies, which leak secrets through generating either cache contention or cache reuse [35, 5, 6]. In order to create cache contention or cache reuse, attackers need to have their own process(spy process) coexisting with the targeted process (victim process) on the same physical machine. This way the spy process can evict or reuse cache contents created by the victim process to introduce different cache access latencies. We refer to this class of attacks as offensive attacks. Another kind, non-offensive attack, has also been demonstrated successfully by Bernstein et al. [31]. Unlike offensive attacks, Bernstein’s timing attack does not interfere with the victim process. His attack exploits the relationship between the time for an array lookup and the array index. The attack strategy commonly deployed in CPU-based timing attack methods consist of one block of ciphertext, and profiling the associated time to process that block. However, on a

9 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

GPU it would be highly inefficient to only perform one block encryption and produce one block of ciphertext at a time, given the GPU’s massive computational resources. In a real world scenario, the encryption workload would contain multi-block messages, and on each data sample, the GPU timing attackwould produce many blocks of ciphertext. The key difference is that the GPU scenario will only collect a single timing value for the multiple blocks. Although many successful attack methods have been demonstrated on CPU platforms, these methods cannot be directly applied to the GPU platform due to a lack of accurate timing information and nondeterminism in thread scheduling. Our timing attack method targets a GPU and is non-offensive like Bernstein’s. We exploit the inherent parallelism present on the GPU, as well as its memory behavior, in order to recover the secret key.

2.3 Background

In this Section, we discuss the memory hierarchy and memory handling features of Nvidia’s Kepler GPU architecture [43]. Note, not all details of the Kepler memory system are publicly available - we leverage information that has been provided by Nvidia, as well as many details of the microarchitecture we have been able to reverse engineer. We also describe the AES implementation we have evaluated on the Kepler and the configuration of the target hardware platform used in this work.

2.3.1 GPU Memory Architecture

2.3.1.1 Target Hardware Platform

Our encryption server is equipped with an Nvidia Tesla K40 GPU. The Kepler-family device includes 15 streaming multiprocessors (SMXs). Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units (LSU). In CUDA terminology, a group of 32 threads are called a warp. Each SMX has four warp schedulers and eight instruction dispatch units, which means four warps can be scheduled and executed concurrently, and each warp can issue two independent instructions in one GPU cycle [43].

2.3.1.2 GPU Memory Hierarchy

Kepler provides an integrated off-chip DRAM memory, called device memory. The CPU transfers data to and from the device memory before and after it launches the kernel. Global memory, texture memory, and constant memory reside in the device memory. Data residing in the

10 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT device memory is shared among all of the SMXs. Each SMX has L1/shared, texture, and constant caches, and they are used to cache data from global memory, texture memory, and constant memory, respectively. These caches are placed very close to the physical cores, so they have much lower latency than the corresponding memories. Texture memory/cache are optimized for spatial memory access patterns. Constant memory/cache is designed for broadcasting a single constant value to all threads in a warp. Global memory, together with the L1 and L2 caches and coalescing units (units that can coalesce multiple global memory accesses from a warp into memory transactions) provide for fast general-purpose memory accesses. The hierarchy of the L1/L2 caches and global memory is similar to that found on the modern multi-core CPUs. The sizes of the L1 and L2 caches on a GPU are much smaller that those found on a CPU. However, GPU caches have much higher bandwidth, which is needed to support a large number of cores.

2.3.1.3 Memory Request Handling

On a Kepler GPU, a global memory load instruction for a warp generates 32 memory requests if none of the threads is masked. All 32 memory requests are sent to coalescing units, which reorders pending memory accesses, trying to reduce memory requests down to a number of unique cache line requests. These cache line requests are issued to L1-cache controller, one per cycle, and the process is referred to as memory issue serialization [44]. If the requested data is present in the L1-cache, the data is loaded into the specified register and the cache line request is resolved in one GPU cycle by the LSU. On a miss, the request is queued in a Miss Status Holding Register (MSHR), one per cycle. If any incoming cache line request matches an outstanding cache line miss queued in the MSHR, the request is merged into a single entry in the MSHR. All requests queued in the MSHR are processed in FIFO order. These requests are then forwarded to the next level memory controllers (L2 or device memory). Upon receiving the requested data, the LSU needs to load this data into the register file and release the MSHR entry, one per cycle. This process is referred to as writeback serialization [44].

2.3.2 AES GPU Implementation

In this chapter, we evaluate 128-bit Electronic Codebook (ECB) mode AES encryption based on T-tables, which operates on blocks of 16 byte data, using a secret key of 16 bytes. The encryption implementation we use was ported from OpenSSL 0.9.7 library into CUDA. We trans- formed an entire block encryption into a single GPU kernel, so that each thread in the GPU can

11 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

Figure 2.2: The GPU AES implementation used in this work. process one block encryption independently, as shown in Figure 2.2. The encryption key scheduling step expands the 16-byte secret key into 160-byte round keys for ten rounds of operation. In the initial round, the 16-byte plain text is XORed with the first round key to generate the initial state. In the T-table version of AES, SubByte, ShiftRow and MixColumn operations are integrated to perform lookups on four T-tables. Rounds 1-9 simply need to perform T-table lookups and add round keys. In the last round, a special T-table only integrates the SubByte with the ShiftRow and does not involve a MixColumn operation. Our attack is focused on the last round of AES, whose operations are shown in Equation

(2.1), where each T table lookup returns a 4-byte value that is indexed by a one byte value, c0−15 are

12 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

the output cihpertext bytes and {t0, t1, ··· , t15} are the input bytes to the last round.

c0 = T 4[t3]0 ⊕ k0

c1 = T 4[t6]1 ⊕ k1

c2 = T 4[t9]2 ⊕ k2

c3 = T 4[t12]3 ⊕ k3

c4 = T 4[t7]0 ⊕ k4

c5 = T 4[t10]1 ⊕ k5

c6 = T 4[t13]2 ⊕ k6

c7 = T 4[t0]3 ⊕ k7 (2.1) c8 = T 4[t11]0 ⊕ k8

c9 = T 4[t14]1 ⊕ k9

c10 = T 4[t1]2 ⊕ k10

c11 = T 4[t4]3 ⊕ k11

c12 = T 4[t15]0 ⊕ k12

c13 = T 4[t2]1 ⊕ k13

c14 = T 4[t5]2 ⊕ k14

c15 = T 4[t8]3 ⊕ k15 Each generation of a ciphertext byte involves a table lookup (which returns a 4-byte value), byte positioning (taking one byte out of four bytes done by byte masking and shifting), and add-key. When implemented on Nvidia GPUs, each generation of a ciphertext byte will be implemented in CUDA using load and store instructions, in addition to logic instructions. Although the order of table lookups for the cipher bytes is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, the order of CUDA load and store instructions for all cipher bytes may be different depending on how the program is compiled. For example, the CUDA compiler, nvcc, by default assume −O3 optimization is enabled, which reorganizes the CUDA instructions to avoid data dependency stalls and thus can hide some latency of memory access instructions. When the optimization is disabled with a −O0 flag, the table lookup for each byte is directly translated into CUDA instructions and the order would be the same. In this AES GPU-based implementation, one GPU thread will perform AES encryption on one block of data. For a 32-block message, one warp of 32 threads can launch 32 encryptions in parallel. As the number of blocks per message increases, the GPU throughput will increase.

13 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

Our server system is dedicated to performing AES encryptions, so we have configured the GPU to achieve high throughput. As the constant cache stores data from constant memory that can be shared by threads in a warp, for AES encryption we use this space to store round keys. All threads in a warp will access the same round key at the same time. This allows the constant cache to broadcast the value to all of the threads in the executing warp. Although the T-tables are also constant values, they will not benefit as much from using constant memory because each thread will generate different memory accesses and the constant cache would have to return them sequentially, wasting valuable resources. Therefore, we chose to place T-tables in the global memory, reducing the number of memory requests by leveraging the Coalescing units. Also, the T-table data can be shared across SMXs through the L2 caches and across warps in the same SMX through the L1 caches. The L1 cache and the shared memory share the 64KB physical memory. Nvidia allows developers to determine this division on a per kernel basis. There are three available options: 16KB shared memory and 48KB L1 cache, 32KB shared memory and 32KB L1 cache, and 48KB shared memory and 16KB L1 cache. For our AES encryption kernel, the threads in a warp do not share any data, so the shared memory would not be used during the encryption. Thus, we want to minimize the size of shared memory and maximize the use of the L1 cache. The best configuration is to use 16KB shared memory and 48KB L1 cache. During server initialization, the five T-tables are copied into the global memory, and the round keys are copied into constant memory. Constant data remains in the device memory until the application exits. During encryption, all memory requests to global memory access the L2 cache. However, by default, all global memory load/store operations bypass the L1 cache without being cached. We found that enabling the L1 cache can increase encryption performance. In this work, we will always configure the server program with L1 caching enabled.

2.4 Correlation Timing Attack

We design a correlation timing attack that exploits the relationship present between the kernel execution time and the number of unique cache line requests generated during kernel execution. The attack will use one ciphertext byte and one key byte guess to compute the number of unique cache line requests that would be generated for the targeted ciphertext byte during table lookup. In attack, we run encrypt many messages and collect the timing information for each message (referred to as a trace, or data sample). We use the calculated number of cache line requests to correlate with timing samples. If we guessed the right key byte, we should expect to find a strong correlation

14 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT between the timing and the correct number of unique cache line requests; otherwise, if we guessed the wrong key byte, the resulting correlation should be low. In this section, we first explore the architecture leakage present on an Nvidia Kepler-family K40 device. To assess timing leakage vulnerabilities, we evaluate the success rates for correlation timing attacks using both clean measurements and noisy measurements. For the clean measurements, attackers are able to measure the warp execution time within a kernel, so the sources of inaccuracy are due to GPU internal hardware. When using noisy measurements, attackers are only able to measure when a message is received and returned by the server. In this case, the noise sources will also include processing on the server CPU, introducing some non-deterministic delay in our measurements. We consider the quality of timing data we collect, to better understand how noisy measurements can impact key recovery.

2.4.1 SIMT Architecture Leakage

With SIMT execution on a GPU, when a warp issues a load instruction, 32 memory requests by 32 threads are generated and sent to coalescing units (assuming all of the threads are active). These memory requests translate to unique cache line requests, and are merged with existing cache line requested in the MSHR. The time taken to serve all 32 requests from a warp is proportional to number of unique cache line requests that are sent to the L1 cache controller due to both memory issue serialization and writeback serialization. To determine if there is a linear relationship between an SIMT load instruction’s execution time and the number of unique cache lines accessed, we develop a test kernel shown in Kernel 1. In this kernel, we measure the execution time for a warp of 32 threads to perform the load and store instructions in the kernel. Each thread is assigned an index from array indices, uses the index to load a 4-byte element from a big array A, and stores the element into the result variable. With SIMT execution of 32 threads, the array indices determine the total number of unique cache lines being referenced during the kernel execution, which is essentially sampling data array A. For example, if all the indexes are the same, Kernel 1 will only need one unique cache line. The array indices are created using Algorithm 2. In Algorithm 2, given a specified number of unique cache lines needed (e.g., 6), the array indices will generate the first six indexes with a stride of the cache line size, accessing six distinct cache lines, and the remaining 26 indexes will all be the same as the sixth one. With Algorithm 2, we can sweep the number of unique cache lines from 1 to 25 and generate corresponding indices arrays to use in Kernel 1.

15 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

Kernel 1 The kernel to measure memory access time index ← indices[tid]

time ← CLOCK() result[tid] ← A[index]

time ← CLOCK() − time

Algorithm 2 Generating memory access indices that will result in a pre-set number of cache lines access for Kernel 1 numCacheLine ← userInput indices ← [] curCacheLineIdx ← 0 for i = 1:25 do indices[i] ← curCacheLineIdx ∗ stride if curCacheLineIdx ¡ numCacheLine - 1 then curCacheLineIdx ← curCacheLineIdx + 1 end if end for

SHUFFLE(indices) return indices

16 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

With GPU’s serialization memory handling, we expect the execution time to be linearly proportional to the number of unique cache line requests. In Figure 2.3, we plot timing data for memory accesses while varying the number of cache line requests, under three strides, 32, 64, and 128 bytes. We can see that they are linearly proportional, and that the slope of the linear lines indicates how much execution timeis consumed for each unique cache access. A similar result was also reported in prior work [44]. Lines for stride size 64 and 128 bytes are exactly the same, implying that the cache line size is actually 64 bytes. From the Nvidia online literature [11], the cache line size of the L1 data cache of Kepler is 128 bytes. We confirmed this with Nvidia. What we also learned is that there are microarchitetural features in the L1 cache that are responsible for this behavior. As a result, we have elected to use 64 bytes as the cache line size in our attacks on the K40 device.

840 32 64 128 820

800

780

760 Average Timing(GPU cycles) 740

720 0 5 10 15 20 25 Number of Cache Line Requests

Figure 2.3: Nvidia GPU: Timing for 1 to 25 cache line requests, under stride 32, 64, and 128 bytes.

The Pearson Correlation value [45] of the execution time and the number of unique cache lines is found to be around 0.96. This strong correlation value suggests that the execution time can

17 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

x 104 1.844

1.842

1.84

1.838

1.836

1.834

1.832 Average timing (Nano seconds) 1.83

1.828 0 5 10 15 20 25 Number of unique cached lines

Figure 2.4: AMD GPU: Timing for 1 to 25 cache line requests. leak information about the array indices used in Kernel 1. Not only does the Nvidia Kelper GPU exhibit this type of leakage, but AMD GPUs also show the same leakage. We performed the same timing analysis, running programs written in OpenCL on an AMD R9-290X GPU, with the stride set at 64 bytes (the AMD GPU L1 cache line size) and the result is shown in Figure 2.4. We find its correlation value to be around 0.93 Since SIMT and memory coalescing are crucial features for high performance GPU, this kind of correlation will persist in various GPUs, including the Nvidia and AMD GPUs that we have experimented with. Disabling either of them would significantly degrade the performance. We therefore expect to find this correlation on GPUs from other manufacturers as well. Inspecting both Figure 2.3 and Figure 2.4, it is very clear that the execution time is directly proportional to the number of cache line requests on both families of GPUs. Given a fixed kernel, the key information (determines the number of cache line requests) can therefore be leaked by the execution time of the kernel. This keen observation inspires us to carry out a correlation timing attack

18 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT on AES implementation running on the start-of-the-art GPUs.

2.4.2 AES Encryption Leakage

As shown in Figure 2.1, the attacker and victim computers are connected via a network. This setup is the same as the one described by Bernstein et al. [31], except that the encryption is performed on the GPU. The goal of the attacker is to recover the 16-byte secret key that is used by the encryption server by using the known ciphertext and detailed timing collected. Depending on what timing data is collected, noise in this setup can be minimized if the execution time of kernels is measured. However, the measurement noise (i.e., inaccurracies) produced in a more practical setting will not inhibit the attack. As shown by Brumley et al. [46], the attacker can simply collect a larger number of traces and average out this noise. Suppose that the attacker sends a 32-block message to the server, and the server launches a warp of 32 threads to encrypt the received data. After some time, the attacker receives the 32-block encrypted message, along with the timing information for the warp execution, which are stored as one timing trace (sample) as shown below:

1 2 32 1 {c0−15, c0−15,···,c0−15,T }

There are ten rounds in AES encryption, and each round has 16 table lookups for each block of data. The index for each table lookup determines which cache line will be loaded. Thus, the entire encryption time depends on the indices for the 160 table lookups. We collected one million 32-block messages and associated timings, and recorded all indices used for the 160 table lookups during each block encryption. From all the indices used in the warp, we produce the number of unique cache line requests. We plot the average execution time associated with the number of unique cache line requests, shown in Figure 2.5, as well as the sample counts that are used to calculate the average time in Figure 2.6. Although the line in Figure 2.5 does not appear as linear as the one shown in Figure 2.3, it is clear that as the number of cache line requests increases, the average time also increases. The correlation between the number of cache line requests and the recorded execution time is 0.0596. In a real attack, it is impossible to compute all of indices used during one encryption without knowing the entire 16-byte key due to the strong cryptographic confusion and diffusion functions. It will be computationally infeasible to enumerate the entire key space (2128 ≈ 3.4×1038). However, in the last round, each lookup table index can be computed from one byte of the key and the

19 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

x 104

2.674

2.672

2.67

2.668

2.666 Average Timing(GPU cycles)

2.664

1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 Number of Cache Line Requests

Figure 2.5: The average recorded time versus the total number cache line requests in a mes- sage encryption with one million samples corresponding byte of ciphertext, independently from other ciphertext bytes. Thus, we can examine how much leakage can be observed in one byte. From Equation 2.1, we can write each byte of ciphertext as follows (byte positioning is ignored for simplicity):

cj = T 4[ti] ⊕ kj (2.2)

Using an inverse lookup table, we can find the ith byte of the input state to the last round, ti, if we know the true round key, kj:

−1 ti = T 4 [cj ⊕ kj] (2.3)

Given the GPU’s SIMT execution model, for a 32-block message, we have 32 threads

20 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

x 104 4

3.5

3

2.5

2

1.5 Number of Samples 1

0.5

0 1080 1100 1120 1140 1160 1180 1200 Number of Cache Line Requests

Figure 2.6: Sample counts versus the number cache line requests with one million samples. running simultaneously, and therefore:

1 −1 1 ti = T 4 [cj ⊕ kj] 2 −1 2 ti = T 4 [cj ⊕ kj] ...

32 −1 32 ti = T 4 [cj ⊕ kj]

1 2 32 The values of table lookup indexes ti , ti , ..., ti determine the number of unique cache lines that will be generated. Since each element in the T 4 table is 4 bytes and the size of a cache line is 64 bytes, there are 16 T 4 table elements in one cache line (assuming the T 4 table is aligned in memory). Therefore, the memory access requests can be turned into cache line requests by dropping

21 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

1 2 32 the lowest 4 bits of ti , ti , ..., ti , and so we have the following cache line requests:

1 1 hti i = ti >> 4 2 2 hti i = ti >> 4 ...

32 32 hti i = ti >> 4

1 2 32 The number of unique cache line is the number of unique values among hti i, hti i, ··· , hti i. This process of calculating the number of unique cache lines accessed from ciphertext bytes is implemented in Algorithm 3.

Algorithm 3 Calculating the number of unique cache line requests in the last round for a given key byte guess.

kj ← guess cache line cnt ← 0 holder ← [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] %% comment: i means thread id for i = 0:31 do −1 holder[T 4 [cipher[i][j] ⊕ kj] >> 4] + + end for for i = 0:15 do if holder[i] != 0 then cache line cnt + + end if end for return cache line cnt

We generated one million 32-block messages and received one million encrypted messages, along with their associated timings. For each 32-block encrypted message, we used Algorithm 3 to calculate the number of unique cache line requests generated for T 4[t3] table lookups, assuming we know the value of k3. Figure 2.7 shows the timing distribution over the number of cache line requests. We find the Pearson Correlation value to be 0.0443. We also fit a linear line with a slope of 14 cycles and offset of 26503 cycles among the timing distribution, where the slope is taken as the signal for the leaking timing channel.

22 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

Figure 2.7: Timing distribution over the number of cache line requests, calculated for one million encrypted messages, using the true value of the 3rd key byte.

Since we use only one table lookup out of all 160 table lookups for one block encryption, we should expect the Pearson Correlation to be bounded by the previously calculated correlation value for all 160 table lookups in Figure 2.5. Although the correlation value gets small, it is still significantly higher than the correlation value when using the wrong value for the 3rd key byte. If we assume the 3rd key byte to be 0, we find its correlation to be 0.0012. That is 36.9 times lower than the correlation value calculated using the right key. Although the correlation value is small, the linear relationship between the number of unique cache line requests and the encryption execution time suggests that the encryption time is leaking information about individual 9th round cipher state. Since the 9th round cipher state can be computed using ciphertext (known to the attacker) and key bytes, the encryption time ultimately is leaking individual key bytes.

23 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

2.4.3 Correlation Timing Attack on GPU AES Implementation

As we can see from Equation (2.1), each table lookup for each ciphertext byte is indepen- dent from another, and each key byte is used exclusively for its corresponding ciphertext byte. Thus, it allows us to attack one key byte at a time in a divide-and-conquer manner. For each possible value of the key byte, we use Algorithm 3 to calculate the number of cache line requests for each 32-block message. Since timing is linearly proportional to the number of cache line requests in the last round, we computed the Pearson Correlation of the timing versus the number of cache line requests. When we guess the correct key byte, we have the correct number of unique cache line requests, and the resulting correlation should be the highest; on the other hand, if we guess the wrong key byte, the resulting correlation should be low. Therefore, the key byte guess with the highest correlation value among all possible values should be the correct key. In this section, we test our Correlation Timing Attack method on the targeted Nvidia Kepler GPU. All of our experiments discussed in this section use one million traces, which can be collected within 30 minutes.

2.4.3.1 Attack Using Clean Measurements

In this experiment, we would like to demonstrate the feasibility of our attack first. Therefore, we minimize the noise by time-stamping when the kernel starts and ends as part of the AES kernel. This provides us with clean timing traces. The result of our attack is shown in Figure 2.8. The correct value for each key byte is circled. The correct key bytes stand out in the plots, when compared to the other 255 possible key values. This means we have successfully recovered all 16 bytes of the last round key. We also analyze the success rate of k5 to see the number of traces needed for reaching different success probability (success rate). The result is shown in Figure 2.9, which includes both the measured and predicted success rates. Each point on the measured success rate in the graph is the average from 100 timing attack trials using different timing traces. The predicted success rate is calculated using the methodology presented by Fei, et al. [47], in which the Signal to Noise Ratio (SNR) is obtained from real measurements (Figure 2.7) and used to predict the success of recovering correct key value. The predicted success rate precisely predicts the one we measured.Since the predicted success rate tracks well with our measured results and computing the predicted success rate takes less than 30 minutes while computing the measured success rate using 100 trials takes around 1500 minutes, we will use the predicted success rate hereafter. From Figure 2.9, both measured and predicted success rate

24 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

x 10−3 k0 x 10−3 k1 x 10−3 k2 k3 15 15 20 0.06

10 10 0.04 10 5 5 0.02

0 0 0 0 −5 −5 −0.02 100 200 300 100 200 300 100 200 300 100 200 300

k4 x 10−3 k5 k6 k7 0.06 10 0.04 0.06

0.04 0.04 5 0.02 0.02 0.02 0 0 0 0 −0.02 −5 −0.02 100 200 300 100 200 300 100 200 300 100 200 300

x 10−3 k8 x 10−3 k9 k10 k11 4 15 0.03 0.06

2 10 0.02 0.04

0 5 0.01 0.02

−2 0 0 0

−4 −5 −0.01 −0.02 100 200 300 100 200 300 100 200 300 100 200 300

x 10−3 k12 x 10−3 k13 k14 k15 6 10 0.03 0.06

4 0.02 0.04 5 2 0.01 0.02 0 0 −2 0 0 −4 −5 −0.01 −0.02 100 200 300 100 200 300 100 200 300 100 200 300

Figure 2.8: Correlation attack result under clean measurements for 32-block messages. reaches 50% using as few as 20,000 traces. With 70,000 traces, the predicted success rate converges to 1. Since the success rate is directly depending on the SNR value, we show SNR value for each byte in Table 2.1. As validated in Figure 2.8, the correlation value for the correct key in each key byte varies. Thus, the SNR value for each key byte would also be different, as shown in Table 2.1, because there exists a linear relationship between the SNR and the correlation when the correlation value is small [47]. Although the same lookup operation and same attack analysis are applied to each key byte, some key bytes, such as k5, k8, k12, and k13, have much smaller SNR values compared to others. This observation leads us to discover the effect of optimization during GPU compilation of

25 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 Predicted Success Rate

Success Rate Of Recovering One Key Byte Measured Success Rate 0 0 1 2 3 4 5 6 7 8 9 10 Number of Samples x 105

Figure 2.9: Success rate of k5 using clean measurements.

k0 0.01 k4 0.0399 k8 0.0034 k12 0.0050

k1 0.01 k5 0.0064 k9 0.0105 k13 0.0082

k2 0.0168 k6 0.0305 k10 0.0190 k14 0.0214

k3 0.0395 k7 0.0379 k11 0.0399 k15 0.0392

Table 2.1: The Signal to Noise Ratio (SNR) for each key byte. the program on the timing attack. To explore this issue deeper, we first examined how the server program is compiled for the GPU. We compiled our server program with the highest level optimization (-O3 by default in nvcc), which reorganizes some of the CUDA instructions in the kernel to avoid data dependency stalls. Before the table lookups for c5, c8, c12, and c13, other stalled load and store instructions may be congesting the GPU hardware resources, creating variable wait times for the table lookups for c5, c8, c12, and c13. We inspected the executable Shader ASSembly (SASS) code using the Nvidia

26 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT dissembler. The code is shown in Listing 2.1, and includes parts of the last round operation. Line

2a38 performs the table lookup for c5, and the value loaded is not used until line 2a78, when the GPU would stall if the requested data is not available at line 2a78. Also, there are multiple load and stores instructions before line 2a38, which may be congesting memory system and causing the duration of the load instruction for c5 to be nondeterministic. Thus, we see very little correlation and a low SNR for those key bytes. If we disable optimization during compilation, the CUDA instructions do not get reordered and each table lookup is stalled due to the data dependency and the timing actually is more determin- istic or predictable. This is shown in Listing 2.2. Line 9ef0 is the load instruction for c5, and its requested data is needed immediately in the following instruction. The same happens to c4 in line 9fa0. Overall the non-optimized program runs much slower with a lot of stalls, 120,000 GPU cycles vs 27,000 GPU cycles for the optimized program. Without the optimization, the loads of individual ciphertext bytes do not interfere with each other, and thus, we see a variance of 2354 GPU cycles square in timing data vs 128,000 GPU cycles square with optimization. The resulting execution timings of the load instructions is directly proportional to the number of unique cache lines accessed. Therefore, by performing the same attack with unoptimized server code, we would expect to have almost the same correlation value in each key byte, and the correlation is higher (around 0.06). The result is shown in Figure 2.10.

Listing 2.1: Optimized SASS Codes

1 /*2a00*/ ST.E.U8 [R6+0x3], R18; 2 /*2a08*/ ST.E.U8 [R6+0x2], R14; 3 /*2a10*/ IADD.X R13, RZ, c[0xe][0x24]; 4 /*2a18*/ IMAD.U32.U32 R16.CC, R17, R0, c[0xe][0x20]; 5 /*2a20*/ LD.E R11, [R10]; 6 /*2a28*/ LD.E R2, [R2]; 7 /*2a30*/ IMAD.U32.U32.HI.X R17, R17, R0, c[0xe][0x24]; 8 /*2a38*/ LD.E R12, [R12]; 9 /*2a40*/ LD.E.64 R14, [R8+0x8]; 10 /*2a48*/ LD.E R16, [R16]; 11 /*2a50*/ LOP.AND R19, R22, 0xff; 12 /*2a58*/ LOP.AND R22, R26, 0xff; 13 /*2a60*/ LOP32I.AND R3, R11, 0xff000000; 14 /*2a68*/ LOP32I.AND R2, R2, 0xff0000; 15 /*2a70*/ SHR.U32 R11, R28, 0x15;

27 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

16 /*2a78*/ LOP.AND R10, R12, 0xff00; 17 ...

Listing 2.2: Non-Optimized SASS Codes

1 ... 2 /*9ef0*/ LD.E.64 R4, [R6]; 3 /*9ef8*/ LOP.AND R4, R4, 0xff00; 4 /*9f08*/ LOP.AND R5, R5, RZ; 5 /*9f10*/ LOP.XOR R4, R22, R4; 6 /*9f18*/ LOP.XOR R5, R23, R5; 7 /*9f20*/ MOV32I R6, 0x0; 8 /*9f28*/ MOV32I R7, 0x0; 9 /*9f30*/ MOV R6, R6; 10 /*9f38*/ MOV R7, R7; 11 /*9f48*/ LOP.AND R8, R38, 0xff; 12 /*9f50*/ LOP.AND R9, R39, RZ; 13 /*9f58*/ SHF.L.U64 R3, R8, 0x3, R9; 14 /*9f60*/ SHL R0, R8, 0x3; 15 /*9f68*/ MOV R8, R0; 16 /*9f70*/ MOV R9, R3; 17 /*9f78*/ IADD R6.CC, R6, R8; 18 /*9f88*/ IADD.X R7, R7, R9; 19 /*9f90*/ MOV R8, R6; 20 /*9f98*/ MOV R9, R7; 21 /*9fa0*/ LD.E.64 R6, [R8]; 22 /*9fa8*/ LOP.AND R6, R6, 0xff; 23 ...

Although we see much more consistency and higher correlation values in the attack result when the server code is not optimized, it is unlikely a high performance encryption engine would use unoptimized code. Therefore, we focus on using optimized server code to test our attack. While running the optimized server code, we are still able to recover all of the key bytes, and we have a better understanding of how optimization can begin to thwart timing attacks due to interference between loads.

28 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

k0 k1 k2 k3 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k4 k5 k6 k7 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k8 k9 k10 k11 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k12 k13 k14 k15 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300

Figure 2.10: No optimization: Correlation attack result for 32-block messages.

2.4.3.2 Attack Using Noisy Measurement

In practice, it is more common for a server CPU to time-stamp the incoming and outgoing messages instead of within GPU kernels. With the same number of traces (one million) but different timing collection methods, we are able to recover 10 out of the 16 key bytes, as shown in Figure 2.11. Changing from clean measurement to noisy measurement, we see the variance in timing data increases from 128 thousands to 5.8 millions GPU cycles, which means it introduce a lot of noise. The correlation exhibited in each key byte has been reduced by more than 3X, as compared to our previous results that assumed more accurate timing measurements. Although adding noise in the timing hampers the attack slightly, it does not thwart the

29 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

x 10−3 k0 x 10−3 k1 x 10−3 k2 x 10−3 k3 5 5 5 10

5 0 0 0 0

−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k4 x 10−3 k5 x 10−3 k6 x 10−3 k7 10 5 5 10

5 5 0 0 0 0

−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 x 10−3 k9 x 10−3 k10 x 10−3 k11 5 5 5 10

5 0 0 0 0

−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 x 10−3 k13 x 10−3 k14 x 10−3 k15 5 5 5 10

5 0 0 0 0

−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300

Figure 2.11: Correlation attack result with noisy measurements for 32-block messages. attack. With a large number of traces, we can achieve a 100% success rate at 3 millions traces. The attacker can use filtering to further clean up the timing information, and reduce the number of traces needed to achieve a 100% success rate. We applied a P ercentile F ilter as described by Crosby et al. [48] to our timing data. For data filtering, the attacker sends the same 32-block messages 100 times, so that she can obtain 100 timing samples, along with one 32-block encrypted message. Through experiments, we found that using the 40%percentile time among 100 timing traces produces the best attack result, as shown in Figure 2.13. By applying the simple noise reduction method, we are able to obtain even better results

30 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 Unfiltered Success Rate Success Rate Of Recovering One Key Byte Filtered Success Rate

0.5 1 1.5 2 2.5 3 3.5 4 Number of Samples x 106

Figure 2.12: Predicted Success rate for 0th key byte using filtered vs unfiltered data. than obtained using clean measurements. Because even for clean measurements, the timing infor- mation also suffers from GPU-internal noise sources, such as uncertainty by the warp scheduler. By using noise filtering, most of noise sources are filtered out and result in much cleaner timing information. The success rate (Filtered Success Rate) shown in Figure 2.12 converges to 1 at 40,000 traces, much earlier than the unfiltered scenario which requires 3 million traces. The simple filtering method significantly improves the attack effectiveness.

2.4.4 Attack on Highly Occupied GPU

Our experimental results suggest that GPU architectures with SIMT processing and a coalescing unit will produce a linear relationship between the number of unique cache line requests and the execution time. This relationship makes the GPU highly vulnerable to Correlation Timing Attacks. We can consider adding noise that would confuse attackers, but it does not fully thwart a timing attack. If we collect a large number of traces, attackers are still able to recover the secret

31 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

k0 k1 k2 k3 0.02 0.02 0.04 0.1

0.05 0.01 0.01 0.02 0 0 0 0 −0.05

−0.01 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 k4 k5 k6 k7 0.1 0.02 0.05 0.1

0.05 0.05 0.01 0 0 0 0 −0.05 −0.05

−0.1 −0.01 −0.05 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 k9 k10 k11 10 0.02 0.04 0.1

0.05 5 0.01 0.02 0 0 0 0 −0.05

−5 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 k13 k14 k15 10 0.02 0.04 0.1

0.05 5 0.01 0.02 0 0 0 0 −0.05

−5 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300

Figure 2.13: Correlation attack result using filtered timing information for 32-block messages. information. Our ability to extract the secret information from larger messages (e.g., 1024 blocks) is critical, because larger messages can better utilize the high throughput of the GPU. However, this might be unfavorable to the attacker. Unlike threads in a warp, threads in different warps are not synchronized, which implies some of warps may finish before others. Therefore, the longest warp execution time dominates the time measurement that the attacker is using. Although the attacker does not know which 32-block encrypted message is dominant, she can divide a 1024-block message into 32 32-block messages and treat them as 32 traces with the same time value. One of 32 traces,

32 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT produced by the dominant warp, will have the true timing, and others might be wrong since the 31 other warps that produce the 31 others traces finished encryptions earlier than the dominant warp. Thus, the attacker can treat other 31 traces as noise to be added to the calculation. We collected one million traces using the filtering method discussed above. The results are shown in Figure 2.14. Most key bytes are still recoverable, but the key bytes with weaker correlation, such as k12, is completely buried. With more traces, k12 will also be recovered.

x 10−3 k0 x 10−3 k1 x 10−3 k2 k3 5 10 10 0.04

5 5 0.02 0 0 0 0

−5 −5 −5 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 k4 x 10−3 k5 k6 k7 0.04 5 0.02 0.04

0.02 0.01 0.02 0 0 0 0

−0.02 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 x 10−3 k9 k10 k11 5 10 0.02 0.04

5 0.01 0.02 0 0 0 0

−5 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 x 10−3 k13 k14 k15 5 10 0.02 0.04

5 0.01 0.02 0 0 0 0

−5 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300

Figure 2.14: Correlation attack result using filtered timing information for 1024-block mes- sages.

Since we treat the other 31 traces as noise during the calculation, we expect to see a success

33 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 32−Block Success Rate

Success Rate Of Recovering One Key Byte 1024−Block Success Rate 0 0 0.5 1 1.5 2 2.5 3 Number of Samples x 106

Figure 2.15: Predicted success rate for 0th key byte using 1024-block data vs 32-block data. rate of close to 100% for 15 million traces, as shown in Figure 2.15. Although increasing the number of blocks in each message may weaken the signal for each key byte, with larger number of traces, we can still recover all 16 key bytes.

2.4.5 Discussion

Moving from using clean measurements to noisy measurements, we see a lot of noise being included in our timing data, and consequently, the correlation value is suffering. In many real world situations, attackers would even not be able to get a time-stamp on the server. Thus, attackers would have to time-stamp their own packets as it being sent and received through the network. Such timing information can be much less accurate than the values in the noisy measurements. In the network setting, we observe the variance of the timing data to be 1.233e11 CPU cycles square, comparing to 1.464e8 in noisy measurements and 3.20e6 in clean measurements. As discussed in the prior work [48], network noise can be filtered to make remote timing attack possible.

34 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT

2.5 Countermeasures

A large number of defense techniques have been proposed to avoid timing attacks on CPU platforms [19, 20, 29, 49, 50, 51, 52, 53]. Given the lack of study of side-channel vulnerabilities on GPU devices, there has been no prior work on GPU countermeasures. In this section, we discuss several potential mitigation methods. Our attack exploits knowledge about the deterministic behavior of load instructions on the SIMT architecture. One method to prevent attacks is to table lookup operations in the AES implementation, as suggested by Osvik et al. [5]. We could also map the lookup tables to the GPU register file, since the register file is large enough hold a 256-byte Sbox table. Our attack is possible because attackers are able to map a table lookup index to a cache line, so the attack would be infeasible if we randomize the mapping between a table lookup index to a cache line. The similar idea is also presented in the prior work [50], in which memory-to-cache mapping randomization is used on CPU platform. With this technique, given 32 table lookup indices, attackers would not be able to map them to cache lines, and would thus not be able to calculate the number of unique cache line requests. One possible implementation would be to randomize entries in the security data (T 4) in the memory, and create a new index lookup table which maps an access index to the randomized index in the memory. Without knowing the mapping in the index lookup table, attackers would be not able to map an index to a cache line.

2.6 Summary

The execution time of a kernel is linearly proportional to the number of unique cache line requests generated during the kernel execution on modern GPU architectures. This property can be exploited to extract secret information such as encryption keys. In this chapter, we exploit this property on a Nvidia GPU platform and successfully recover all 16 bytes of AES-128. Although we perform attacks on a Nvidia GPU platform, these attacks can be carried out on others GPU platforms, given that the SIMT feature and coalescing units commonly exist on GPUs.

35 Chapter 3

Information Leakage in Shared Memory Banks

3.1 Introduction

In Chapter 2, we identify the information leakage of memory coalescing unit in a GPU, and we derive a memory-based side-channel attacks and use AES encryption as an example of target. In this chapter, we introduce a new class of timing side channels based on Shared Memory bank conflicts on the GPU. We develop a differential timing attack that exploits this timing side-channel. To demonstrate the attack, we again use the AES implementation (with T-tables stored in the Shared Memory unit) on a GPU, and successfully recover all the key bytes. The GPU on-chip Shared Memory is an important hardware unit for alleviating heavy traffic to the off-chip device memory. It is designed to store data that are shared and frequently accessed by many running threads. To support SIMT execution and deliver high memory throughput in modern GPUs, the Shared Memory is divided into multiple memory banks (versus a monolithic bank), allowing multiple concurrent paths to the Shared Memory. With multiple memory access requests for different memory banks being serviced in parallel, the Shared Memory bandwidth is significantly increased. However, when multiple memory requests compete for the same bank, they have to be serviced in a serial fashion, as each memory bank provides a single access port. We refer to such case of multiple accesses competing for a single shared memory bank port as a bank conflict. Additional requests that try to access data in the same memory bank will have their requests queued and delayed.

36 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

This scenario will result in a detectable delay, as compared to multiple memory accesses that resolve to different banks with no bank conflicts. Not only GPUs, modern high-performance CPUs (e.g., Intel’s SandyBridge and ARM’s Cortex-A) are also designed with multi-banked L1 and L2 caches. Yarom et al. [34] and Jiang et al. [54] investigate how sensitive information can be leaked when a cryptographic application runs on a CPU with multi-banked caches. A GPU generates a much more complex access pattern to Shared Memory banks, and we identify the memory bank conflict-based timing channel and exploit it for a successful timing attack. The contributions in this chapter include:

1. We identify a new memory resource that can leaks memory access pattern of an application.

2. We propose a differential timing attack methodology and successfully recover all AES key bytes.

3. We quantify the effectiveness of our attack methodology using the success rate as a metric.

4. We extend our timing analysis onto other Nvidia GPU architectures: Maxwell, Pascal, Turing, and Volta. We explore how non-blocking execution can hide timing leakage in the Shared Memory and be used to prevent our attack.

5. We propose a multi-key protection mechanism and evaluate its effectiveness in mitigating side-channel leakage and performance overhead.

This chapter is organized as follows: in Section 3.2, we provide background on the Advanced Encryption Standard (AES) algorithm, as well as the GPU memory hierarchy and execution model. In Section 3.3, we discuss our threat model. In Section ??, we explore timing variation due to Shared Memory bank conflicts, i.e., discovering the memory bank timing channel. In Section 3.5, we describe our differential timing attack targeting table-based cryptographic algorithms, and attack an AES encryption running on an Nvidia Kepler GPU. In Section 3.5.5, we apply our attack in more realistic settings. In Section 3.6, we extend our timing analysis on other GPU architectures, and explore how its non-blocking execution mode can hide timing leakage in the Shared Memory. In Section 3.7, we discuss feasible countermeasures to prevent the attack, and focus on a multi-key implementation of AES encryption. Finally, we summarize the chapter in Section 3.8.

37 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

3.2 Background

We begin by describing the AES implementation evaluated on the targeted GPU platform, as well as the memory hierarchy and execution model of Nvidia Kepler GPUs, a widely used and energy-efficient GPU microarchitecture [11].

3.2.1 AES Encryption

In this chapter, we evaluate the timing leakage vulnerability of a table-based cryptographic algorithm on a GPU. we use the same example, 128-bit ECB mode AES encryption, as in Chapter 2 for the attack demonstration. The proposed attack strategy can also be applied to other table-based cryptographic algorithms such as Blowfish [55]. The performance of AES is critical in the era of big data, where confidentiality is needed for storing and transmitting large amounts of data. Thus, high data throughput is desired. Performing AES encryption on a GPU can deliver an order of magnitude higher throughput than that on CPUs [56] since AES encryption can be easily parallelized and GPUs can exploit high degrees of execution parallelism. In order to demonstrate the generality of attack, we port the implementation of AES from a standard and widely used library, the OpenSSL 0.9.7 library, into CUDA code. Note that the ported implementation of AES is similar to the ones evaluated for performance in many other studies [56, 57]. We discuss different implementations of AES that are immune to our attack but incur performance degradation in Section 3.7. To port the implementation, we need to decide where to store T-tables in the GPU memory hierarchy and how to assign encryption jobs to GPU threads. In Chapter 2 and our prior work [58], we stored T-tables in the Global Memory unit, but the implementation becomes vulnerable to coalescing attacks. Since T-tables are constant data and shared by all threads, they are a good candidate to store in the Shared Memory unit. Multiple studies on GPU implementations of AES have demonstrated the advantages of using the Shared Memory unit for storing T-tables [59, 60, 56, 61, 57, 62, 63], and our work adopts this implementation. To assign encryption jobs to GPU threads, we transform the AES encryption procedure into a single GPU kernel, where each GPU thread processes one block encryption, independently. Each block consists of 16 bytes. The AES algorithm is composed of nine rounds of SubByte, ShiftRows, MixColumn, and AddRoundKey operations, followed by the last round with only three operations (omitting the MixColumn one). For faster processing, the first three operations are integrated into T-table lookups

38 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS in the first nine rounds. In the last round, a special T-table (T 4) is referenced, then followed by byte masking. Each encryption round requires one 16-byte round key. The ten round keys are generated by the key scheduler using one 16-byte user-specified master key. Knowing any round key, an attacker can compute the original 16-byte master key. Our attack strategy targets the last-round key. A code snippet of the last round operations generating the first four bytes of ciphertext is shown in Listing 3.1.

Listing 3.1: AES Last Round Code Snippet

1 O0 = (T4[(In0 >> 24) & 0xff] & 0xff000000)ˆ

2 (T4[(In1 >> 16) & 0xff] & 0x00ff0000)ˆ

3 (T4[(In2 >> 8 ) & 0xff] & 0x0000ff00)ˆ

4 (T4[(In3 ) & 0xff] & 0x000000ff)ˆ k0;

Variable O0 is the first four bytes of the 16-byte ciphertext. Each variable, In0 to In3, contains 4 bytes of the input state for the last round. A selected byte of each variable is used to index into the T-table to obtain a four-byte output, of which only one byte contributes to the final ciphertext. k0 is the first four bytes of the last round key. From the original algorithm, the last-round can be simplified to use byte-wise operations, as shown below:

cj = SBox[si] ⊕ rkj (3.1)

where the input byte position, i, for the SBox operation, differs from the output ciphertext byte position, j, due to the ShiftRow operation. Each byte of the last round input state, si, can be calculated once the corresponding cipher and key bytes are known by:

−1 si = SBox [cj ⊕ rkj] (3.2)

3.2.2 Nvidia GPU Memory Hierarchy

In this chapter, we describe our attack on an Nvidia Kepler K40 GPU in detail, though we extend our analysis onto other Nvidia GPUs with different architectures, demonstrating the broad application of our approach. The GPU devices used in this chapter are listed in Table 3.1. These Nvidia GPUs have a similar memory hierarchy, except some of them have a dedicated Shared Memory unit, whereas the Nvidia Kepler GPU does not. We will describe the major differences in these memory architectures in Section 3.6 and discuss how the differences can impact the effectiveness of

39 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Architecture Kepler Maxwell Pascal Turing Volta Device TESLA GTX 1660 Tesla V100 GTX 950m TITAN X Model K40c Ti PCIE

Table 3.1: List of tested Nvidia GPUs our attack. In this section, we will focus on the memory hierarchy using the Nvidia Kepler GPU as an example, shown in Figure 3.1.

Figure 3.1: Nvidia Kepler GPU Memory Hierarchy

On the Nvidia Kepler GPU, there is an off-chip DRAM memory (device memory) which is partitioned into global, texture, and constant memory regions. Data in those memories are shared among all threads running on all 15 Streaming Multiprocessors (SMXs). Each SMX (each with 192 single-precision floating point cores) also has L1, texture, and constant caches. Data in those caches are private to the threads running on the SMX. In addition, there is a Shared Memory for each SMX, and only the block of threads that allocated specific data in Shared Memory can access that data. Also, each GPU thread owns an exclusive set of 255 registers to store the current thread state. On the NVIDIA Kepler GPU, the Shared Memory and the L1 cache reside in the same physical memory storage, with the total size at 64 KB. The individual size of the Shared Memory and L1 cache is configurable. In our case, we allocate 48 KBs as the Shared Memory and 16 KB as the L1 cache. Note that the size configuration does not affect the attack, nor the results presented in this chapter. The Shared Memory is divided into 32 banks, and it has a configurable bank line size (annotated as bank size in the Nvidia documentation [11]): four bytes or eight bytes. Since using a bank size of eight bytes will lead to fewer bank conflicts and improve an application’s performance,

40 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS we set the bank size to eight bytes for faster AES encryptions. The memory address breakdown for the Shared Memory is shown in Figure 3.2, where the three least-significant bits are used for the bank offset, and the next five bits (bits 3-7) are the bank index and are used for selecting the bank where a line is retrieved for kernel computation.

Figure 3.2: Memory address to Shared Memory bank mapping

When multiple memory requests address different Shared Memory banks (i.e., bits 3-7 are different), they can be serviced in a single GPU cycle, providing much higher memory bandwidth than that of a monolithic cache bank design. However, a bank conflict occurs whenever multiple memory requests access the same bank (with the same bank index, but different tag values). Thus, there will be a noticeable timing difference between memory requests with and without bank conflicts.

3.2.3 Single Instruction Multiple Threads Execution Model

With the SIMT execution model, one GPU instruction is executed by at least a warp of 32 threads, and each thread has its own set of registers. However, all threads within a warp must be synchronized at an instruction boundary, which means that no thread in a warp can execute the next instruction until all threads complete the current instruction. For memory instructions, each thread will generate a memory request. A warp of threads will generate 32 memory requests for one memory instruction. Under the SIMT model, the execution time of this memory instruction will be determined by the Shared Memory bank that receives the highest number of bank conflicts (i.e., the largest number of requests resolving to the same bank). In other words, the execution time of a GPU memory access instruction becomes highly dependent on the memory addresses issued and whether that address results in a bank conflict with other accesses. An attacker can exploit this dependency to recover the secret key of the cryptographic operations running on the GPU.

41 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

3.3 Threat Model

Our threat model includes co-residence of the adversary and victim on one physical machine. We use this threat model for evaluation of our attack. However, we do not anticipate any issue for this attack to work in a cloud environment. The threat model assumes that the adversary is a regular user without root-level privileges, and the underlying operating system is not compromised. The adversary can measure the execution time of a GPU encryption kernel in a direct or indirect manner. For a direct measurement, the victim may expose the timestamp when a GPU kernel is launched and ended. For an indirect measurement, the victim can use non-privileged APIs to query the status of the GPU and infer the start and stop timestamps of the GPU kernel. A similar technique is described by Naghibijouybari et al. [64]. For the purposes of our evaluation, we will assume the victim exposes the timestamp whenever a GPU kernel is launched and when it finishes, providing direct measurements. The threat model also assumes that the adversary can observe the ciphertexts.

3.4 Cache Bank Conflicts-Based Side-Channel Timing Channel

In this section, we conduct experiments to examine the impact of bank conflicts on GPU program execution time, i.e., discovering the timing side-channel. We develop a kernel that uses a warp of threads to issue loads to Shared Memory. Depending on the address of the data that each thread accesses, some number of bank conflicts will occur, resulting in different execution times for the load operations. We perform timing analysis on an Nvidia Kepler K40 GPU. All of the micro-benchmarks presented are designed specifically for the microarchitecture of the Kepler memory system. Later, we show the same timing analysis on the other architectures, such as Maxwell and Pascal, Volta and Turing, which feature a range of memory hierarchies that differ from the Kepler architecture. We develop a memory access pattern for a warp of threads to generate a specific number of bank conflicts, produced by selecting the address that each thread accesses. Using a high-resolution (a cycle-accurate timer) time-stamping mechanism, we can study the impact of bank conflicts on the kernel execution time. We have developed Microbenchmark 1, which is shown in Listing 3.2.

Listing 3.2: Microbenchmark 1

1 register uint32_t tmp, tmp2, offset = 64; 2 __shared__ uint32_t share_data[1024 * 4];

42 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

3 ... 4 int tid = blockDim.x * blockIdx.x + threadIdx.x; 5 tmp = clock(); 6 tmp2 =share_data[tid * stride + 0 * offset]; 7 tmp2 += share_data[tid * stride + 1 *offset]; 8 ... 9 tmp2 += share_data[tid * stride + 39 * offset]; 10 times[tid] =clock() - tmp; 11 in[tid] = tmp2;

The purpose of the microbenchmark is to run 32 concurrent threads in a warp, with each thread generating a sequence of memory accesses. We measure the execution time of the warp. In Listing 3.2, the variable share data points to a continuous 16KB of Shared Memory space, where each element of the array is one word (4 bytes). In Line 4, the thread ID is obtained. In Lines 6-9, each thread is accessing 40 memory locations in the sequence, with an offset of 64 words between two adjacent memory addresses (offset). Note the memory address distance between two threads is the stride, which can be fine-tuned to produce a different number of bank conflicts among a warp of threads. Inspecting Listing 3.2 and Figure 3.2, we see that two adjacent memory addresses in a thread will have the same bank index and bank offset (64 words = 28 bytes), and therefore, all the memory addresses requested by a single thread access the same bank. Each thread accesses a single memory region, and the distance between memory regions accessed by the different threads is a single, or multiple strides. By selecting the value of the stride, we can create bank conflicts among threads in a warp. We run this kernel 320,000 times and collect 320,000 timing samples (10,000 timing samples for each stride value, ranging from 1 to 32). Based on our experiments, 10,000 timing samples are enough to produce a timing distribution. However, these timing distributions can shift depending on the system load during the experiments. The timing distribution for these timing samples is shown in Figure 3.3. We observe that there are only five distinct timing distributions for the 32 stride values. Clearly, some stride values have the same timing behavior, and we suspect that those stride values result in the same number of bank conflicts. We next calculate the number of bank conflicts for each stride value. Recall that in our testing platform, the memory address breakdown is shown in Figure 3.2. Given a word index for the shared data array, we can calculate the bank index by dropping the first least-significant bit, and then perform a modulo-32 operation, as described in the formula below:

43 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

idxB = mod(idxM >> 1, 32) (3.3)

where idxB is the bank index, and idxM is the array index. The right shift operator, >>, drops the least-significant bit, and the mod is the modulo operation. As an example, assuming a stride value 16, for a memory access instruction issued across a warp of 32 threads, we will generate the following 32 memory indices: {0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496}. Using Equation (3.3), we have the following bank access indices for the warp: {0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24}. Thus, these requests are for four banks, and each receives eight concurrent requests, i.e., eight bank conflicts produced when the stride is 16. Similarly, we calculate the number of bank conflicts for each stride value in the range of 1 to 32 words, and they end up in five groups, as shown in Figure 3.3. Each group corresponds to a different number of bank conflicts and associated average execution time. We also plot the average execution time for a group (for selected stride values) versus the number of Shared Memory bank conflicts in Figure 3.3. We can easily identify a linear relationship. The slope of the linear line is 392, with an offset of 1002 GPU cycles. Since we are performing 40 sequential Shared Memory loads, the result implies that the average penalty per bank conflict is 9.8 GPU cycles, which is also the strength of the timing channel signal in the Shared Memory banks. Although the penalty for GPU shared memory bank conflicts is not as large as the CPU cache miss penalty (another well-studied cache side-channel, with the difference between a hit and a miss around 100 cycles [6, 33, 9]), it can still be a source of information leakage. Countermeasures resistant to the original cache timing attacks may not work for the bank conflict timing channel. Next, we will demonstrate the feasibility of exploiting this fine-grained timing channel for key retrieval through statistical methods.

3.5 Differential Timing Attack

In this section, we devise a differential timing analysis attack to exploit the timing channel in Shared Memory banks. We start by attacking an AES algorithm, because its table lookup operations are key-dependent memory accesses. In our AES implementation, the lookup table is word aligned, similar to the shared data array used in Microbenchmark 1. Therefore, we expect the execution time of one table lookup operation of a warp of threads to be linearly dependent on the number of bank

44 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Figure 3.3: The number of bank conflicts vs. the associated timing for 32 stride values. conflicts generated by the threads. The execution time of one entire encryption is also dependent on the number of bank conflicts created by the table lookup operations. Since the index for a table lookup operation is related to the round key, with the correct key guess, we can predict the number of bank conflicts that will occur during one round of AES encryption across a warp using Equation (3.3). By using many different blocks of plaintext, the correlation between the average encryption timing and the number of Shared Memory bank conflicts for the correct key guess should be high, and the correlation for incorrect key guesses should tend to be lower. This is the basic principle for a differential timing attack, similar to the traditional differential power attack (DPA) [1]. Next, we present the details of our attack methodology on AES. We specifically look at the mapping between the AES lookup tables and Shared Memory banks, as well as collect data and recover the last round AES encryption key.

45 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

3.5.1 Mapping Between the AES Lookup Tables and GPU Shared Memory Banks

As described in Section 3.2, since we are attacking the last round of the AES encryption, we only need to examine the T4 lookup table mapping in the Shared Memory. Note that attacking more rounds (more than three) becomes infeasible due to the algorithm-inherent statistical confusion and diffusion properties. There are 256 4-byte elements in the T4 lookup table. Equation (3.3) is used to calculate the Shared Memory bank index, given a T-table lookup index, where idxB is the bank index, and idxM is the T-table lookup index. Because the bank size is 8-bytes in our Nvidia Kepler K40 GPU, we apply a right shift operator, >>, to drop the least-significant bit.

3.5.2 Collecting Data

The data collection procedure is similar to the experiments that we performed in Section ??. Instead of 40 memory load instructions, each thread performs an actual AES encryption using a random input data block. We record both the encryption time and the ciphertext for a warp of 32 threads. Each data sample composes of 32 16-byte ciphertexts and a timing value, as shown in the following format:

[{C0, C1, ..., C31}, t] where each Ci is a 16 − byte ciphertext produced by

i i i thread i, consisting of 16 bytes {c0, c1, ..., c15}, and t is the total encryption time for this warp

We consider the encryption time measured from the GPU side and CPU side, respectively. The encryption time measured from the GPU side contains much fewer noise sources than that measured from the CPU side, because of the non-deterministic data transfer time between the GPU and CPU, as well as other required initialization procedures in the GPU device for running a kernel. However, the frequency of the CPU is much higher than that of GPU. Hence, the CPU-side encryption time provides more accurate measurements of the encryption. Moreover, measurements on the CPU side represent a more realistic scenario, as the adversary is just a passive observer and normally does not hold the GPU timer.

3.5.3 Calculating the Shared Memory Bank Index

For the last round of AES, with the output ciphertext known, the input state byte can be calculated using Equation (4.2). For a warp of 32 threads, 32 such table lookups are running

46 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS concurrently, and therefore we have:

0 1 31 −1 0 1 31 {si , si , ..., si } = SBox [{cj , cj , ..., cj } ⊕ rkj] (3.4)

0 0 where cj is the cipher byte produced by thread 0, si is the lookup table index for thread th 0, and rkj is the j last round key byte, which is common to all the threads. These lookup table 0 1 31 indices, {si , si , ..., si }, are exactly the Shared Memory indices. We use Equation (3.3) to further calculate the bank indices used by all threads in the warp for this table lookup instruction, and then derive the number of bank conflicts.

Figure 3.4: Calculation from ciphertexts to number of bank conflicts

Using the example shown in Figure 3.4, assuming we have encrypted 32 16-byte plaintexts using 32 threads (i.e., a warp) and obtain 32 16-byte ciphertexts, and we are targeting to recover the 0th last round key byte. First ( A ), we select the 0th cipher byte from the 32 ciphertexts. Second ( B ), we convert the selected cipher bytes to the last rounds states using a guessed key byte value. Lastly ( C ), we calculate the accessed Shared Memory bank indices and the number of bank conflicts that occurred. Note that if the key guess value is not correct, the number of bank conflicts calculated will be incorrect, and would not correlate with the observed timing.

3.5.4 Recovering Key Bytes

Using the collected data, we can launch a correlation timing attack. As shown in Listing 3.1, for the last round of AES each T-table lookup uses one byte in the 16-byte state, and therefore, each round key byte can be attacked independently. For each data sample we collected, we calculate the number of bank conflicts for the table lookup instruction that is using the jth last round key byte, as shown in the example in Figure 3.4. For each key byte value guessed (ranging from 0 to 255), we can calculate the correlation between the average timing and the number of bank conflicts, and use

47 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS the correlation value to differentiate the correct key byte from other incorrect key guesses. For the data collected, the number of bank conflicts between the 32 threads falls in the range of [2, 4]. The power of a correlation timing attack lies in the linearity of the timing model, i.e., the total execution time should consist of a deterministic component, linearly dependent on the number of bank conflicts, and an independent Gaussian random variable contributed by the other nine rounds. During an actual AES execution, the timing distribution does not conform to the ideal model, and therefore a correlation timing attack may not be more effective than a differential timing attack, which only considers two values in terms of the number of bank conflicts. Thus, we adopt a differential timing attack approach, and calculate the two average timing values, one for the group of data samples that generate two bank conflicts, and the other for the group of data samples that generate four bank conflicts. The Difference-of-Means (DoMs) between these two groups should be about two times the bank conflict penalty, i.e., around 19 cycles. Thus, for each sample we collected, we first calculate the number of bank conflicts as shown in Figure 3.4. Second, we classify its corresponding timing into one of two groups based on the number of bank conflicts. Finally, we compute the DoM between these two groups. If the correct key value were used, we would see a DoM value of around 19 cycles. Otherwise, the DoM should be close to zero.

Figure 3.5: 15th key byte recovery using GPU timing information.

48 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

We first apply the attack method to the 15th key byte. The result is shown in Figure 3.5. The upper plot is using one-hundred thousand samples, and the lower plot is using 1 million samples. The correct key byte value (198) is highlighted in red. In the upper figure, the DoM for the true key value is 15.99 GPU cycles, which is about 4 GPU cycles less than the predicted signal, 19.6 GPU cycles, and the DoMs for other values are much smaller, between -2.3 to 2.3 GPU cycles. By increasing the sample size to 1 million, the DoM for the true value remains about the same, while the DoMs for wrong values are reduced to a range between -0.8 and 0.8 GPU cycles. We apply this attack methodology and recover the other key bytes. The result is shown in Figure 3.6. All 16 true key byte values clearly stand out in the plots, which means we have successfully recovered all key bytes. Although the same attack runs on all key bytes, some key bytes have much smaller peak timing differences, such as k0 and k1, as compared to other key bytes. We observe that the key bytes that are used closer to the end of encryption tend to have larger and distinct timing differences, e.g., k15. We speculate that the reduced signal in k0 is caused by instruction-level parallelism. Although the architectural details of the Nvidia Kelper GPU are not public, we suspect that the Nvidia Kepler GPU can continue issuing independent instruction(s) in each cycle until all resources are consumed, and therefore, it hides the latency due to Shared Memory bank conflicts by issuing and executing other independent instructions. However, if the GPU stalls due to Shared Memory bank conflicts, the penalty is exposed in the total execution time, and therefore, we observe a stronger signal (e.g., key byte 15). To verify our speculation, we slightly modify Microbenchmark 1, as shown in Listing 3.3, to remove the accumulation instruction after each load instructions, and therefore all the load instructions are independent on each other and can be scheduled in a non-blocking fashion.

Listing 3.3: Microbenchmark 2

1 ... 2 tmp2 =share_data[tid*stride+0*offset]; 3 tmp3 = share_data[tid*stride+1*offset]; 4 tmp4 = share_data[tid*stride+2*offset]; 5 ... 6 tmp41 = share_data[tid*stride+39*offset]; 7 times[tid] = clock() - tmp; 8 in[tid] = tmp2+tmp3+tmp4+...+tmp41;

49 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Key Byte 0 Key Byte 1 Key Byte 2 Key Byte 3

10

0 Key Byte 4 Key Byte 5 Key Byte 6 Key Byte 7

10

0 Key Byte 8 Key Byte 9 Key Byte 10 Key Byte 11

10

0 Key Byte 12 Key Byte 13 Key Byte 14 Key Byte 15

10

0 0 128 255 0 128 255 0 128 255 0 128 255

Figure 3.6: Recovery of all 16 key bytes using 10 million samples.

We run Microbenchmark 2 10,000 times for each stride value in the range of [1, 32] and collect the timing information. We calculate the average timing for each stride value and the number of bank conflicts, and obtain a similar linear relationship, shown in Figure 3.3. However, the slope becomes 177 GPU cycles per 40 load instructions, and therefore the per-conflict penalty is reduced to 4.4 for Microbenchmark 2 from 9.8 GPU cycles for Microbenchmark 1. In the modified kernel, each of 40 load instructions loads data into a different register, and at the very end, the addition is performed. The execution of Microbenchmark 2 is non-blocking, while the execution of Microbenchmark 1 is in a blocking mode due to tight data dependency. We show the SASS code (assembly code for the GPU kernel) for the two microbenchmarks in Listing 3.4 and Listing 3.5, respectively. In Microbenchmark 1, there is a strong data dependence between instructions, and the loaded data (Line 1 of Listing 3.4) is used by a later operation (Line 4) that is three instructions away from the load operation. Such read-after-write (RAW) data dependencies between Line 4 and Line 1 introduce blocking, and Instruction 4 cannot proceed until Instruction 1 is

50 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS completed, exposing a possible delay that Instruction 1 may experience in the total execution time. This is seen in Figure 3.7(a), where the total execution time is the execution time of Instructions 1 and 4. In Microbenchmark 2, we generate a sequence of independent loads, as shown in Listing 3.5. Instructions can be executed in a non-blocking fashion. Therefore, the delay caused by Instruction 1 may be hidden by executing later instructions. We can see this in Figure 3.7(b), where the total execution time no longer depends on Instruction 1.

Listing 3.4: Original SASS Code for Mi- crobenchmark 1

1 LDS R4, [R3+0x1b00]; 2 IADD R7, R8, R5; 3 LDS R6, [R3+0x1c00]; 4 IADD R7, R7, R4; (a)

Listing 3.5: Modified SASS Code for Mi- crobenchmark 2

1 LDS R15, [R38+0x1b00]; 2 LDS R14, [R38+0x1c00];

3 LDS R13, [R38+0x1d00]; (b)

Figure 3.7: Blocking Mode vs. Non-Blocking Mode

For our AES algorithm implementation, each lookup operation in the last round (16 lookups in total) is independent of one another. Since k0 is being used in the first lookup operation, its delay due to bank conflicts is obscured by executing other independent lookup operations. Thus, we see a weaker signal for key byte k0. While k15 is the last one to be processed, and the delay due to bank conflicts is completely exposed in the execution time, i.e., k15 has the strongest signal. Results in Figure 3.6 show that signals (the average penalty for one bank conflict) for all the key bytes range from 0.8 GPU cycles to 8.9 cycles, depending on how the conflict penalty is hidden by non-blocking execution mode.

51 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

3.5.5 More Realistic Attack Scenarios

We have demonstrated a successful attack using timing information taken from the GPU side, which contains much less noise than the timing information measured from the CPU side. Also, we evaluate our attack methodology when running AES encryption with 32 parallel threads. It is important to understand the scalability of the attack (i.e., how the number of traces needed for a successful attack grows as we increase the number of threads). In this section, we evaluate the effectiveness of our attack using CPU timing information, and evaluate the impact of noise in the measurement when encryption is run with a larger number of threads. These changes more accurately reflect a typical execution environment for a timing side-channel attack on a GPU.

3.5.5.1 Using CPU Timing Information

To collect timing data from the CPU side, we record the kernel execution time using an x86 instruction rdtscp. Since the CPU runs at a higher frequency than the GPU, CPU timings have higher resolution (4.8 times higher). However, the CPU timing information is much noisier than the GPU timing information. Using the CPU timestamping mechanism, we can record the time from when a GPU kernel is launched until it exits. Thus, CPU timing information inevitably includes extra timing noise due to the kernel launch overhead generated by memory transfers between the CPU and the GPU. Alternatively, using a GPU timestamping mechanism, we can record timestamps after the kernel launch is complete, only capturing when the AES encryption begins and ends inside the GPU kernel, avoiding extra noise introduced during the kernel launch. We perform the same differential timing attack against our 32-thread implementation, and we can still recover all key bytes using 1 million samples, but with a weaker signal, as shown in Figure 3.8. Using the CPU timing information, the difference of means for the correct key value is around 94 CPU cycles (20 GPU cycles) between the two bins for the correct key guess. We also quantify the effectiveness of our attack by introducing a metric, called success rate, which captures the probability of success of an attack given the number of traces. With the success rate metric, we can compare the effectiveness of the attack across different platforms and different side-channel attacks. There are other existing metrics [65, 66, 67], but they quantify the information leakage from the side-channel instead of the effectiveness of the attack. We perform a number of attacks to obtain the empirical success rates of recovering the 15th key byte when using the CPU timing information, and also when using the GPU timing information. The results are presented in Figure 3.9. Using the GPU timing data yields a stronger side-channel timing signal, and

52 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

00

0

0

0

00 0 0 00 0 00 0

Figure 3.8: 15th key byte recovery using the CPU timing information. therefore only 120,000 samples are needed to achieve a 100% success rate, while 900,000 samples are needed when using the CPU timing.

3.5.5.2 Increasing the Number of Threads

Next, we evaluate the scalability of our attack by using an 8192-thread AES implementation using 16 blocks of 512 threads (each block contains 16 warps), which is several times more than the maximum number (2880) of threads that can run in parallel on the Nvidia Kepler K40 GPU. In realistic scenarios, we would want to keep the entire device busy with encryptions/decryptions, producing much higher throughput versus using only 32 threads. Note that in such a real attack scenario, the attacker cannot manipulate the kernel code running on the GPU (they cannot easily insert timestamps before and after the encryption). Instead, they would have to rely on the timing information measured from the CPU side. When a higher number of threads are running, the measured execution time of the kernel is dominated by the slowest

53 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

1

0.8

0.6

0.4 Success Rate

0.2

GPU CPU 0 0 2 4 6 8 10 Number of Samples 105

Figure 3.9: 15th success rate: CPU timing vs GPU timing

SMX (and the corresponding blocks running on it). However, the details of the GPU scheduler are not public, so we do not know exactly how blocks and warps are distributed and scheduled on the GPU SMXs. We choose to select one warp to attack, and use the entire kernel execution time, which represents a real black-box passive attack scenario. We anticipate much higher noise when using this attack model, because the selected warp may not run on the slowest SMX, and even if it does, there are other warps competing for the same SMX resources. Attributing any variation in the kernel execution time to this warp is not accurate. This non-deterministic process adds a larger amount of noise into our data samples. Other activities, such as saving a thread’s register state, can also contribute to the noise in timing. We use the same attack methodology described earlier, collecting 1 million samples of the 8192-thread AES encryption, and we apply our differential timing attack to recover the 15th key byte. The result is shown in Figure 3.10. The timing difference for the 15th true key value is 67.8 CPU

54 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS cycles, which barely stands out among the other wrong key values. This indicates there is a high degree of noise in the data samples that we collect that result in a significantly lower signal-to-noise ratio (SNR).

00

0

0

0

00 0 0 00 0 00 0

Figure 3.10: 15th key byte recovery using 1 million samples with full GPU capacity

Through both scenarios, we demonstrate that our attack can sustain the noise. Using our success rate metric, we show that by increasing the number of samples, our attack can overcome a higher degree of noise.

3.6 Timing Analysis on Other Architectures

So far in this chapter, we have focused our analysis on the Kepler architecture. We now extend our timing analysis onto other architectures: Maxwell, Pascal, Volta, and Turing. The memory hierarchies of both Maxwell and Pascal are quite different from that of the Kepler architecture. Both Maxwell and Pascal architectures have a dedicated Shared Memory unit, whereas on the Kepler

55 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS architecture, the Shared Memory and L1 cache share the same physical unit. However, similar to the memory hierarchy in the Kepler architecture, the Shared Memory and L1 cache share the same physical unit on both Volta and Turing architectures. All of these newer architectures have the same number of Shared Memory banks as the Kepler architecture, but fix the bank size to 4 bytes. Based on our initial timing analysis on these four architectures, the Shared Memory bank penalty is only 2 GPU cycles, which is significantly lower than the 9.8 GPU cycles found on the Kepler architecture. In theory, we should still be able to leverage the timing leakage as before. However, to our surprise, we could recover key bytes on both the Volta and Turning architectures, but failed to recover any key bytes on both the Maxwell and Pascal architectures. We summarize the attack results in Table 3.2. We speculate that the non-blocking execution mode completely hides the latency due to Shared Memory bank conflicts, as discussed in Section 3.5.4, where we observed a significant reduction in the signal-to-noise ratio. Many of intermediate states (such as register spills) in the AES implementation are stored in Global Memory. On both Maxwell and Pascal architectures, the Shared Memory and L1 cache no longer share the same physical memory space, and hence Global Memory and Shared Memory loads/stores can be processed concurrently, which is not the case in the Kepler architecture. Since the latency for an off-chip Global Memory access (load/store) is longer than an on-chip Shared Memory access. By issuing a Global Memory access in parallel with a Shared Memory access, regardless of the number of bank conflicts the Shared Memory access induces, its penalty is completely hidden by the latency of the Global Memory access, as shown in Figure 3.11. However, on both Volta and Turing architectures, the Shared Memory and L1 are unified again, similar to that on the Kepler architecture, and therefore, we were able to recover key bytes on both Volta and Turing architectures.

Figure 3.11: Non-Blocking Mode in Maxwell and Pascal

56 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

To test our reasoning, we design several benchmarks to test the effects of a Global Memory load (L1 cached) on the timing characteristics of a Shared Memory load. We first develop a kernel that generates a warp of threads, selecting the memory addresses so that the warp will induce a pre-determined number of bank conflicts. The memory microarchitecture in all newer architectures has 32 Shared Memory banks, where each bank is 4-bytes wide, so accessing two memory addresses with the same address values for bits 2-7 will cause a Shared Memory bank conflict. We provide the C code in Listing 3.6.

Listing 3.6: C Code for Generating Memory Addresses

1 for (j = 0; j < num_conflicts + 1; j++) 2 input[j] = j*128; 3 for (; j < warp_size; j++) 4 input[j] = j * 4;

The first loop in Listing 3.6 generates memory addresses for some threads in a warp and results in num conflicts bank conflicts. The second loop generates memory addresses for the remaining threads in the warp, and will not cause any bank conflicts. To determine the Shared Memory bank conflict penalty, we simply record time for a Shared Memory load instruction using this set of memory addresses, as shown in Listing 3.7.

Listing 3.7: Benchmark for Measuring the Bank Conflict Penalty

1 t0 = clock(); 2 __threadfence_block(); 3 shared_memory_array[input[tid]]; 4 __threadfence_block(); 5 t1 = clock();

shared memory array is a 256-entry one-byte wide array that is allocated in the Shared Memory. tid is the thread ID, and each thread uses its ID to specify an index in the input. The threadfence block() function is a blocking instruction, which prevents threads to proceed without completing all prior instructions. We first run these benchmarks on the Pascal architecture and collect 10,000 timing samples for each number of bank conflicts [0, 7]. We show the time distribution for each bank conflict value in Figure 3.12(a). We include a linear regression line for the number of bank conflicts versus the mean of its timing distribution. As we can see in Figure 3.12(a), the slope of the regression line is

57 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS approximately 2 GPU cycles per conflict. We also calculate the correlation coefficient between the number of bank conflicts and the recorded time. The correlation coefficient is 0.95, which indicates a strong linear relationship between the number of bank conflicts and the execution time. However, if we insert a Global Memory load instruction before the Shared Memory load instruction (as shown in Listing 3.8), the linear relationship disappears (resulting in a 0.06 correlation value), as shown in Figure 3.12(b). Although we have added a Global Memory load instruction, the average time for each distribution only increases by 12 GPU cycles.

Listing 3.8: Benchmark for Measuring the Bank Conflict Penalty with a Global Memory Load Inserted Before a Shared Memory Load

1 t0 = clock(); 2 __threadfence_block(); 3 global_memory_array[tid]; 4 shared_memory_array[input[tid]]; 5 __threadfence_block(); 6 t1 = clock();

Furthermore, the effect is the same when the Global Memory load instruction is inserted after the Shared Memory load instruction, as shown in Figure 3.12(c). The execution time for the Global Memory load instruction dominates the total execution time, so regardless of the instruction order, as long as they are issued in parallel, the execution time of the Global Memory can completely hide the latency due to Shared Memory bank conflicts. By inserting a threadfence block() function call between the Global Memory and the Shared Memory load instructions, we can again reveal the bank conflict penalty, as seen in Figure 3.12(d). With the fence function call, the total execution time is simply the sum of the execution time for each load instruction. We run these benchmarks on all architectures: Kepler, Maxwell, Volta and Turing, and summarize the benchmark results in Table 3.2. The results show that a unified L1 and Shared Memory unit is more vulnerable to our attack, while the separate L1 and Shared Memory units can hide the timing leakage due to a Shared Memory bank conflict. Our experimental results suggest that we could potentially exploit non-blocking execution mode to seal the timing leakage in the Shared Memory, while improving the overall performance of an application. We have observed that there are at most seven bank conflicts in our AES implementation. As shown in Figure 3.11, by replacing the Global Memory load instruction with three 5-or-more-cycle instructions, such as add or sub instructions, and issuing them together with the leaky Shared Memory

58 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Penalty per Unified # Key Byte Conflict G+S Load S+G Load G+F+S Architecture L1+Shared Recovered (GPU Corr Corr Load Corr Memory Cycle) Kepler Yes 16 9.8 0.95 0.95 0.94 Maxwell No 0 2 0.00 -0.01 0.98 Pascal No 0 2 0.06 0.05 0.99 Turing Yes 16 2 0.99 0.99 1 Volta Yes 10 2 1 1 1

Table 3.2: Benchmarking results. Number of key byte recovered using 1 million samples. load instruction, we could potentially hide the latency due to bank conflicts from a Shared Memory load instruction.

3.7 Discussions and Countermeasures

Although the penalty for each bank conflict is small, we are still able to exploit this fine- grained timing channel to recover confidential information on Kepler, Volta and Turing architectures. We choose to attack the AES encryption algorithm to demonstrate the feasibility and scalability of our attack. The attack should apply to other table-based cryptographic algorithms or even public-key ciphers on GPUs. Although there have been a large number of countermeasures proposed to prevent timing attacks on CPUs [26, 27, 19, 20, 23, 24, 25, 51, 29, 22, 30], none of them can defend against our attack, because our attack exploits a very different timing channel than the common cache timing channel [6, 33, 68, 32]. More recently, RCoal [28] was proposed to defeat the timing attack using the memory coalescing unit on a GPU [?], but the countermeasure addresses the timing leakage only in the memory coalescing unit. Hence, it does not apply to our attack. To thwart the attack described in this chapter, the countermeasure has to specifically target the Shared Memory banks. We could avoid using Shared Memory to prevent the attack, but would incur high performance degradation for some applications. However, avoiding Shared Memory bank conflicts can both improve an application’s performance, as well as mitigate a differential timing attack. This can be done by reducing the Shared Memory data usage, such that the size of our data is no larger than 256 bytes (bank size × number of banks, 8 × 32). In this way, no bank conflicts would

59 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Correlation Coef: 0.95 (Baseline) Correlation Coef: 0.06 (Global+Shared)

262.5 260 260.0

250 257.5

255.0 240 252.5

230 250.0

247.5 Execution Time in GPU Cycles GPU in Time Execution Cycles GPU in Time Execution 220

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of Bank Conflicts Number of Bank Conflicts

(a) (b)

Correlation Coef: 0.05 (Shared+Global) Correlation Coef: 0.9 (Fencing)

262.5

540 260.0

257.5 530

255.0 520

252.5 510 250.0

500 247.5 Execution Time in GPU Cycles GPU in Time Execution Cycles GPU in Time Execution

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Number of Bank Conflicts Number of Bank Conflicts

(c) (d)

Figure 3.12: Timing analysis of Shared Memory load instruction occur. For the AES encryption algorithm, we can change the AES implementation to use the SBox version. This implementation uses only a 256-byte table, spanning 32 banks. There will be no bank conflicts for 32 threads. However, an SBox-based implementation may not be as efficient as a T-table based implementation, and is not compatible with many existing cryptographic software libraries either. The security and performance of this implementation are demonstrated by Lin et al. [69]. As shown in the timing analysis on Maxwell and Pascal architectures in Section ??, we can leverage non-blocking execution mode to seal the timing leakage in Shared Memory. This can be done since the penalty for each bank conflict is only 2 GPU cycles. However, it cannot be done easily on the Kepler architecture, as it would require the Kepler to support independent and parallel

60 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS instruction execution to hide more than 40 GPU cycles of delay caused by bank conflicts in the Shared Memory. Another implementation technique called scatter-gather has been examined by Lin et al. [69] to prevent AES information leakage through the Shared Memory unit. This technique could also be applied to other table-based cryptographic algorithms. The technique requires modification to the table lookup procedure such that every Shared Memory access of a thread would touch all Shared Memory banks, and therefore, every Shared Memory access would result in a constant number of bank conflicts. This modification ultimately comes with some performance overhead. In the following, we discuss a multi-key implementation as an alternative approach that can effectively prevent our attack and does not need to modify the original implementation.

3.7.1 Multi-Key Implementation As Countermeasure

An alternative approach would be to introduce multiple keys on the GPU. Using multiple keys during encryption has been deployed in commercial systems. By recovering one key, the adversary still cannot break the entire encryption system without knowing other keys. However, leveraging multiple keys during encryption will incur a much higher cost to secure storage, due to extra key management (mapping the keys to threads), and so will incur performance degradation. So far in all our experiments, all threads in our encryption kernel are using the same key. This assumption significantly simplifies the attack strategy, as the attacker needs to guess only one key byte value to compute the shared memory bank conflicts among all 32 threads. In this section, we will evaluate the effectiveness of our attack against GPU AES implementations when multiple keys are being used. Specifically, we evaluate our attack when two, three, and four different keys are used in a block of threads. We consider two scenarios. In the first scenario, the attacker knows the mapping between encryption keys and GPU threads, while in the second scenario, the attacker does not know the key mapping. Under both scenarios, we demonstrate that the implementation that uses a few different keys is secure from our attack. All experiments are running on the Nvidia Kepler K40 GPU.

3.7.1.1 Encryption Key Mapping

Key mapping describes the way multiple keys are distributed to different blocks of input data. In our AES implementations, each GPU thread performs encryption on one block of data, and the block to GPU thread mapping is fixed and known. Thus, our job is to associate keys to threads.

61 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS

Note that we are working with symmetric block ciphers (e.g., AES), so the same key should be used to decrypt ciphertext on the receiver side, which increases the complexity of key management between the sender and receiver. To map a key to a thread, we use the value of the thread id (within the block of threads) and compute the modulo of the total number of keys as the index to an array of keys. For example, when the implementation uses two different keys, threads with an even ID will be assigned the first key and threads with an odd ID will be assigned the second key.

3.7.1.2 Unknown Key Mapping

In these experiments, we assume the adversary does not know the key mapping, nor the number of different keys used during encryption. He/she can only apply the attack methodology as if all threads are using the same key as before. The result is shown in Figure 3.13.

0 0

0

0 00

0 0 000 00 0

0 00

0 00 00 0 00 00

Figure 3.13: 15th key byte recovery using 10 million samples with GPU timing information.

From Figure 3.13, when two different keys are used, we can see that, without modifying

62 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS the attack methodology, that these two key values are recovered in the 15th key byte. However, the DoMs for both key values are reduced from 17 (original implementation) to 3 GPU cycles. When attacking the 2-key implementation, we assume all of the threads are using the same key. Thus, we use the wrong key value to compute the associated access bank indices for 16 out of 32 threads. However, if the number of bank conflicts is computed using 16 correctly-computed bank indices, this will contribute to finding the correct key value. Otherwise, the incorrect values will just contribute more noise. Thus, we see a significant drop in the DoMs when attacking the 2-key implementation. As the total number of keys used in the implementation increases, the noise increases and the DoM value for the true key value decreases. When attacking the 4-key implementation, we cannot recover any key with 10 million samples, as shown in Figure 3.13. We can always increase the number of samples to compensate for the noise, but it becomes impossible to attack the 32-key implementation. Recall that the number of bank conflicts falls into the range of [2, 4]. To correctly compute two bank conflicts, we need to know memory accesses from at least three threads. However, in the 32-key implementation, each thread has its own key, and we can only obtain the memory accesses by correctly guessing three key values assigned to those threads. By assuming all threads are using the same key, we cannot obtain the correct memory accesses. Therefore, we cannot correctly compute the number of bank conflicts in the 32-key implementation.

3.7.1.3 Known Key Mapping

In this experiment, we assume the adversary knows the total number of keys deployed in the implementation and the key mapping. The assumption allows us to guess multiple key byte values instead of one, and we can correctly compute the bank conflicts among all 32 threads, while the attack complexity will increase as we have to guess multiple bytes. When attacking a 2-key implementation, we are able to retrieve two key values with a DoM value of 16 GPU cycles using 1 million samples. The result is what we have expected, since guessing two keys at the same time would allow us to compute the correct number of bank conflicts, which is the same as if we were attacking the 1-key implementation. However, guessing multiple key bytes at the same time increases the computational complexity exponentially. The complexity of attacking a 2-key implementation is 216. This approach will become computationally infeasible once the number of different keys used in the implementation reaches 5. As we demonstrate in two scenarios, when 32 threads are using 32 different keys, it becomes impossible for our attack to succeed. Thus, deploying multiple different keys in encryption

63 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS can help to prevent side-channel timing leakage.

3.7.1.4 Multi-Key Performance

While we show that multi-key implementations can resist our attack, multi-key implemen- tations come with a performance penalty. When all threads are using a single key, the key can be loaded from memory in a single request and broadcast to all threads. When multiple keys are used, multiple memory requests are needed to load these keys. Thus, multi-key implementations will incur some performance degradation. We evaluate the performance overhead for implementations, varying the number of keys from 1 to 32. We use the performance of our 1-key implementation as the baseline, and compare others to it. The result is shown in Figure 3.14.

60

50

40

30

20

10

0 Performance over 1-key implementation in percentage in 1-keyimplementation over Performance 0 5 10 15 20 25 30 Number of different keys used

Figure 3.14: Performance degradation when using multiple keys.

As the number of keys increases, the performance decreases. For the 32-key implemen- tation, we see a performance degradation of 58% on the Kepler architecture and 48% on the Volta

64 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS architecture. Although the multi-key implementation can be used to prevent our attack, we also need to consider the resulting performance impact.

3.8 Summary

In this chapter, we explore a new class of microarchitectural side-channel vulnerability in GPUs. Shared memory banking is used to improve the performance of memory access, but this same feature introduces a timing channel. We have developed a novel differential timing attack to exploit bank conflict-based timing channels, and for evaluation, we successfully recovered the encryption key for several GPU AES implementations. We anticipate our attack is applicable to other table-based cryptographic algorithms such as Blowfish. We investigate more realistic attack scenarios and quantify the effectiveness of our attack. We consider the attacks on Nvidia Kepler, Maxwell, Pascal, Volta, and Turing devices. We also evaluate the effectiveness of multi-key implementations as a countermeasure and their associated execution time overhead.

65 Chapter 4

Information Leakage in L1 Cache Banks

4.1 Introduction

In Chapter 3, we find the information leakage in the Shared Memory unit due to bank conflicts. In this chapter, we exploit the similar information leakage in L1 cache on CPU platforms. In this chapter, we demonstrate a new cache bank timing attack against cryptographic softwares implemented on processors that have on-chip cache banks. As an example, we successfully attack the 128-bit AES encryption by exploiting a very fine-grained L1 cache bank timing channel. Cache banks were introduced to address the issue of high bandwidth demand for accessing the L1 cache in superscalar processors as well as to reduce the power consumption. It has also been adapted in embedded processors for mobile devices and general purpose GPUs. Rather than a monolithic piece of microarchitectural module, L1-cache is composed of multiple cache banks, which allow multiple concurrent accesses to different cache banks at one time. This microarchitectural support for parallel cache accesses has improved the performance significantly. However, when two or more accesses target the same bank, a bank conflict arises and they would be processed in a serialized manner. The subtle timing difference between parallel and serial cache bank accesses can be exploited to leak sensitive information. Bernstein et al. [31] first mentioned the potential information leakage through the cache bank timing channel, but without real exploit. Only recently Yarom et al. [34] are able to recover partial bits of each multiplier in a scatter-gather implementation of 4096-bit RSA encryption via the cache bank timing channel. However, due to a high amount of noise presented in this timing channel, Yarom et al. rely on the Flush+Reload attack for aligning timing traces and synchronizing the trace collection procedure. In addition, they collect 1,000 timing traces for the same ciphertext

66 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

RSA decryptions to average out the noise, which increase the attack complexity a lot. Unlike the public key cipher RSA, symmetric block ciphers such as AES algorithm is very fast and present challenges to synchronize the victim and spy processes. Relying on the Flush+Reload attack to capture timing traces of an AES encryption run seems impossible. We design a very different timing trace capturing method from CacheBleed [34]. We need to make the spy process long enough and the span completely covers one AES run. We also only measure the execution time of the spy to infer cache bank conflicts. The chapter is organized as follows. In Section 4.2, we provide the background on AES encryption and Intel cache architectures, as well as discuss existing related work. In Section 4.3, we describe the attack scenario and propose several new key recovery methods. In Section 4.4, we discuss countermeasures against known cache side-channel attacks, and mitigations against our attacks. Finally, we summarize in Section 4.5.

4.2 Background

4.2.1 AES Encryption

Similar other two proposed attacks, in this chapter, we take the 128-bit ECB mode AES encryption as example. We use the implementation from the latest version of OpenSSL 1.1.0-pre6, operating on blocks of 16-byte data using a 16-byte secret key. The AES algorithm composes of ten rounds of SubByte, ShiftRows, MixColumn, and AddRoundKey operations with the last round omitting MixColumn operation. The OpenSSL implementation is based on four T-tables, where the first three operations are integrated into table lookups, followed by the AddRoundKey operation. Originated from one secret key, ten round keys are generated by the key scheduling process. Knowing any round key, one can compute the original 16-byte secret key. Thus, our attack targets at retrieving the last-round key. The code snippet of the last round operations generating the first four bytes of ciphertext is shown in Listing 4.1.

Listing 4.1: AES Last Round Code Snippet

1 s0 = 2 (Te2[(t0 >> 24) ] & 0xff000000)ˆ 3 (Te3[(t1 >> 16) & 0xff] & 0x00ff0000)ˆ 4 (Te0[(t2 >> 8) & 0xff] & 0x0000ff00)ˆ 5 (Te1[(t3 ) & 0xff] & 0x000000ff)ˆ

67 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

6 rk0;

The vector variable s0 is the first four bytes of the 16-byte ciphertext. Each of variables, t0, t1, ..., t3, contains 4 bytes of the output state after the first nine rounds of iterations. Certain byte of each variable is used to look up a corresponding T table to obtain a four-byte output, of which only one byte contributes to the final ciphertext because there is no MixColumn operation in the last round. rk0 is the first four bytes of the last round key. From the original algorithm, the last-round operation can be simplified as follow:

9 cj = SBox[si ] ⊕ rkj (4.1)

where the input byte position (i) for the SBox lookup operation differs from the output ciphertext byte position (j), due to the ShiftRow operation. Each byte of the 9th-round output state, 9 si , is the index for T-table lookup , and can be calculated once the corresponding cipher and key byte are known by: 9 −1 si = SBox [cj ⊕ rkj] (4.2)

4.2.2 Intel Cache Architecture

On-chip cache memory is widely used to keep up with the CPU computation speed. It exploits both spatial and temporal locality in program and data to deliver high throughput. Modern CPUs also feature multiple levels of caches to balance storage size and access latency. In our target platform, the Intel Sandy Bridge CPU, the highest level L1 cache is the fastest, closest to the core, and also the smallest one (64 KB). Both L1 and L2 caches are private to each core. The lowest level L3 cache has 8 MB storage and is shared among all cores. Each cache is organized into multiple addressable sets, with each set consisting of multiple cache lines (called ways) for associativity. The smallest unit in the cache is one fixed-size cache line of 64 bytes, which corresponds to one 64-byte aligned data block in the main memory. The ways in a cache sest do not need address, i.e, fully associative. The cache set index is partial (lower) bits of the physical memory block address, with the rest upper bits as tag. During a memory access, the execution core tries to locate the needed cache line in the L1 cache. If the cache line is found there (a L1 cache hit), the data is fetched and the memory request is served. Otherwise (a L1 cache miss), the request is forwarded to next lower level L2, L3, or the main memory until the needed cache line is found. A cache miss at different level of cache hierarchy would introduce different latency, for example, 4 CPU cycles for a L1 cache miss (L3 cache hit) and

68 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

200 CPU cycles for a L3 cache miss (memory access). Thus, by measuring the time of a memory access instruction, we can determine if the needed cache line is in the cache hierarchy and at which level. When a new cache line is brought from the main memory or a lower level cache to a higher level cache, the destination cache has to evict one cache line from the cache set to accommodate the new cache line according to certain replacement policy. By manipulating the eviction policy in the memory hierarchy, attackers can control the cache state. Before we dive into discussing existing cache timing attacks, we need to understand how T-tables are stored in the cache. In our target platform, the size of a cache line is 64 bytes. Each T-table element is 4 bytes. Thus, each cache line can hold 16 T-table elements. There are 256 elements in each T-table, so each T-table is stored in 16 cache lines. To maximize the spatial locality, these cache lines are stored in consecutive cache sets, as shown in Figure 4.1. Note for each cache, only way is occupied by T-table entries.

Set 0 T0[0-15] Set 16 T1[0-15] Set 32 T2[0-15] Set 48 T3[0-15] Set 1 T0[16.-31] Set 17 T1[16.-31] Set 33 T2[16.-31] Set 49 T3[16.-31] Set 15 T0[240-255] Set 31 T1[240-255] Set 47 T2[240-255] Set 63 T3[240-255]

Figure 4.1: T-tables Cache Layout

The goal of many cache timing attacks is to identify the cache set index that AES has used in its last round table lookup operations. Once the cache set index is identified, together with the corresponding ciphertext, the attacker can recover the last round key by using Equation 4.1.

4.2.3 Cache Timing Attacks

In this section, we will review related cache timing attacks that exploit the time variation due to resources contention or reuse [20] in the cache hierarchy. Prime+Probe Attacks: The attack exploits resource contention. Its goal is to identify if a cache set has been used during the execution of the victim process. It involves two processes: victim and spy. The spy process first primes the entire cache by accessing its own dummy data (to fill all the ways in each cache set). Then it allows the victim process to run. After the victim process is

69 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS done, the spy probes (accesses + measures time) certain primed cache lines. If the victim used data is mapped to the monitored cache set, one of those prior primed cache lines would be evicted, and therefore the probing time would be longer than when the victim did not use any data that maps to the monitored cache set. Thus, the attack can reveal the cache set index that is being used by the victim. The attack has been demonstrated in both L1 [5] and L3 caches [70] to successfully recovered the AES encryption key. Cache Collision Attack [35]: The attack exploits resource reuse within the cryptographic process. The attacker only needs to record each encryption time and its associated ciphertext. If ci ⊕ cj == ki ⊕ kj, there is a cache hit for the later table lookup and therefore the execution time is shorter. By plotting the histogram of collected timing sample based on the value of ci ⊕ cj, this attack can reveal the value of ki ⊕ kj.

4.2.4 Countermeasures against Cache Timing Attacks

To prevent cache timing attacks against AES encryption, Intel has added AES-NI extension to the x86 instruction set architecture, which provides hardware-assisted AES encryption and elimi- nates key-dependent memory access. However, other table-based algorithms and micro-architectures (e.g. Nehalem) without AES-NI are still vulnerable to attacks described above. Many countermea- sures against cache timing attacks have been proposed in recent years. These countermeasures implement mainly one of three principles: cache partitioning, cache line pinning, and memory to cache mapping randomization. Partitioning & Pinning: The idea of cache partitioning and line pinning is to eliminate the cache contention effect by dedicating a small part of cache to one security domain and restricting state manipulation on those resources from other security domains. There are system-level [33, 71] and architecture-level [19] approaches to completely prevent contention-based cache timing attacks, which can not prevent cache collision attack though. Randomization: Randomizing the mapping from memory address to cache set index can also prevent contention-based cache timing attacks and mitigate reuse-based ones. The randomization introduces uncertainty in the attacker’s observations. The architectural countermeasures such as the prior work [20, 50] have demonstrated to completely prevent contention-based cache timing attacks and significantly mitigate reuse-based cache timing attacks. However, as we will show in the following sections, timing attacks that exploit cache bank conflicts are orthogonal to the above mentioned existing countermeasure and will still apply

70 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS effectively to the AES encryption, presenting a new realistic threat.

4.2.5 L1 Cache Bank and CacheBleed Attack

Modern CPUs further divide the L1-cache into cache banks to increase the cache bandwidth in order to support high performance features such as Hyper-Threading and out-of-order execution. In Intel L1 cache banks, each cache line is divided into multiple fixed and equal-size parts, distributed among banks. Thus, each cache bank can serve one memory request at a time. However, when two memory requests are issued to the same bank, one of them will be served first and the other will be stalled, which is called a cache bank conflict [72]. There will be noticeable timing difference when a memory access results in a cache bank conflict with another concurrent memory access. Intel does not disclose the number of cache banks in Sandy Bridge architecture. However, the Intel manual [72] states that addresses with bits 2-5 differing would not experience a cache bank conflict, which may be used to index cache banks. Through our experiments, we hypothesize there are 16 cache banks in Sandy Bridge architecture. For our attack, we assume 16 cache banks on Sandy Bridge architecture. T-table Cache Bank Layout. Each entry in the cache bank is 4 bytes, which is equivalent to the size of one T-table entry. Thus, one cache line content (each 16 consecutive T-table entries) is now distributed across 16 cache banks, and each cache bank holds T-table entries with a stride of 16, as shown in Figure 4.2. The cache bank index comes from the original cache line offset, independent of the cache set index bits.

Bank 0 Bank 1 Bank 2 . . . Bank 15 Set 0 T0[0] T0[1] T0[2] . . . T0[15]

Set 1 T0[16] T0[17] T0[18] . . . T0[31] ......

Set 63 T3[240] T3[241] T3[242] . . . T3[255]

Figure 4.2: T-tables Cache Bank Layout

71 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

The goal of our attack is to reveal the cache bank index used in the table lookup operation in the last round. Each cache bank now contains entries from all the four T-tables. This presents challenges for attacks, because when the adversary monitors one cache bank (vertically on Figure 4.2), he cannot attribute accesses to the monitored cache bank to certain operations as each cache bank contains entries from all the tables. In previous cache timing attacks, the adversary monitors one cache set (horizontally), and the attacker can attribute the accesses to several lookup operations because the cache set is only for one specific table. As we will show later, we can still retrieve the sensitive key byte value from the identified cache bank index effectively. CacheBleed Attack: Yarom et al’s latest CacheBleed attack [34] demonstrates the fea- sibility of exploiting the cache bank conflicts and launches a successful attack on the OpenSSL implementation of RSA. In their attack, the attacker measures the execution time of accessing all the entries of a selected cache bank. If the victim process accesses the same cache bank, the attacker will notice some delay in the measured execution time. Since each 4096-bit RSA decryption is very long - can take 10 ms, the adversary can continuously collect a sequence of timing samples to capture the cache bank access pattern of the decryption. Due to the high noise presented in this timing channel, the authors collect 1,000 runs for each decryption and average them, with the aid of the Flush+Reload method to align each sequence. Challenges for Applying CacheBleed Attack to AES. Unlike the RSA decryption, the 128-bit AES encryption is very fast - takes around 500 CPU cycles. Taking a sequence of timing sam- ples to capture the cache bank access pattern of the decryption becomes almost infeasible. Moreover the Flush+Reload attack for aligning each sequence of timing samples requires an assumption of the OpenSSL library being shared between the victim and spy. If some countermeasures are applied, the Flush+Reload would not work anymore, nor the attack built on top of it. We design the attack on AES quite differently from CacheBleed, which is imple and does not require the Flush+Reload technique. The spy process runs much longer than the encryption, such that the execution time of the spy can reflect the penalty that is correlated with the number of cache bank conflicts due to the encryption. Thus, the execution time of the spy process tends to be lower when the table lookup operations in the last round do not use any entry in the monitored bank than when the operations use entries in the monitored bank.

72 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

4.3 Cache Bank Timing

In this section, we will first discuss our threat model, and then analyze the cache bank timing channel, followed by three attack methods on AES encryption.

4.3.1 Threat Model

We assume both the spy and victim processes are running on the same machine. The attacker is allowed to run multiple instances of the spy process, and each instance is running on a different physical core, so that all cores are covered by the attacker. Although we do not have the control on which core that the victim process is running on, with each core having one spy process monitoring, the victim process will be running with one spy process on the same core but in a different hardware thread. In another word, these two processes are sharing the same L1 cache. We also assume there is no library being shared between the spy and victim processes, and the attacker can see every ciphertext produced by the victim process.

4.3.2 The Cache Bank Timing Channel

We first develop some testing programs to analyze the cache bank timing channel. We write the spy process code similar to the CacheBleed attack [34], shown in Listing 4.2. The spy process accesses the 64 entries of one cache bank in a sequence. The stride of the address offset is set so that each memory reading instruction is accessing one cache set. This is different from the traditional Prime+Probe attack where the spy process primes all the ways in one cache set. Here only one way of each cache set in the L1 cache (banks) is accessed by the spy. Because the time for accessing 64 entries is much shorter than one encryption, we repeat such code for 20 times so as to prolong the spy process. The two rdtscp instructions are used to measure the execution time.

Listing 4.2: Spy Code

1 rdtscp 2 ... 3 addl 0x000(%r8), %eax 4 addl 0x040(%r8), %ecx 5 ... 6 addl 0xf80(%r8), %edx 7 addl 0xfc0(%r8), %edi 8 ...

73 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

9 rdtscp

We run the same spy code on our target platform under four scenarios, where the victim process varies in different scenarios. The distribution of the measured spy process time under each scenario is shown in Figure 4.3, using 1 million timing samples. Idle. Under this scenario, only the spy process is running on the core without any victim process. The timing measurement shows the execution time distribution of the spy process without interference from other processes. Pure Computations. Under this scenario, both spy and victim processes are running on the same core but in two different hardware threads. The victim process is performing computation only involving registers but not memory accesses. The timing distribution moves slightly right to the Idle one, which shows the penalty due to some on-core execution units being shared between two hardware threads. Encryption. The victim process is running regular AES encryption, generating plaintext, performing encryption, and saving the ciphertext to the file. We see the timing distribution further shifts to the right. Pure Conflicts. Under this scenario, the victim process is performing the same spy code, so the spy process experiences a cache bank conflict for every memory access. Therefore, we see the mean of the timing distribution is twice of the mean of the Idle one. The variance of the distribution is also much higher, showing more uncertainties in cache bank conflicts. This timing distribution provides the upper bound for the timing measurements. Figure 4.3 shows two clear distinguishable timing distributions when the spy process is running with an encryption process versus when the victim process is idle or performing computations. In addition, from the “Idle” distribution and the “Pure Conflicts” distribution, we can see the strength of the observable timing channel is roughly 1 cycle (1200 cycles difference for 1280 cache bank conflicts). Thus, even though the attacker does not know which core the victim is running on, one of the spy will be able to capture the cache bank conflicts due to the encryption in the timing measurements.

4.3.3 Attacking AES Encryption

In this section, we proceed to attack the AES encryption while the spy process is monitoring cache bank index 8. Recall in Section 4.2.5, we show that each cache line is divided into multiple parts, and so entries of all T-tables are distributed in 16 cache banks in a sequential order. Each bank

74 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

Figure 4.3: T-tables Cache Bank Layout contains 64 cache sets, covering all T tables partially. Since the spy process only monitors one cache bank, when the victim accesses any of the 64 T-table entries that mapped to the monitored bank, the execution time of the spy process will be extended by one cache bank conflict. Timing Channel Visibility. To better understand the visibility of the bank access pattern of the encryption from the spy timing measurement (timing channel), we examine the relationship between the number of accesses by the victim to a cache bank and the spy process timing measurement. We extract all indices used in 160 T-table lookup operations and convert them into cache bank access indices. Thus, we can calculate the frequency of each cache bank access. We average the timing measurements that have the same number of accesses to a bank, and we do this for all banks. The result is shown in Figure 4.4. For the non-monitored banks, the average time stays roughly the same as the number of accesses on those banks by the victim process increases. This indicates that even though the victim process accesses those banks, they happen in parallel with the spy process accessing another bank, i.e., no cache bank conflict. However, for the monitored cache bank (index 8 here), as the number of accesses to it by the victim process increases, the average time of the spy process increases. The result shows that the timing measurement of the spy process is correlated with the number of cache bank conflicts caused by the encryption.

75 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

1340

Monitored Bank Not Monitored Banks 1338

1336

1334

1332 Time (CPU Cycles)

1330

1328 9 10 11 12 13 14 15 16 17 18 Number of cache bank accesses

Figure 4.4: Average encryption time distribution of each

0.8

0.6

0.4

0.2

0 PolyFit Slope (CPU Cycles)

-0.2

-0.4 0 5 10 15 Bank Index

Figure 4.5: PolyFit slopes for each bank

76 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

To measure the visibility of the timing channel, we perform PolyFit on our timing mea- surement and show the slopes of them in Figure 4.5. The figure indicates that on average, we can see a penalty of approximately 0.6 CPU cycle per cache bank conflict due to the encryption.

4.3.3.1 Key Recovery - Difference of Mean Method

Since the timing measurement is correlated with the number of cache bank conflicts due to the encryption, the most effective method is to predict the number of accesses to the monitored cache bank by the victim process under each key guess and calculate the correlation between the guesses and timing measurements. However, with the complex confusion and diffusion properties of the AES, it is impossible to predict the total number of memory accesses during the entire encryption without knowing all the 16 key bytes. Therefore, we simply target the last round and treat activities in other rounds as noise. In the last round, there are 16 lookup operations for the four T tables. For each individual table lookup operation, if it accesses the cache bank being monitored by the spy process (called access), there is a bank conflict and the spy process execution time may be longer. Otherwise, it is called non-access and the spy process execution time is shorter. This is the principle of the simplest difference-of-mean attack method. We attack the key byte one by one through attacking each last-round table lookup operation. For a selected operation, we calculate the accessed cache i i bank index using a key guess (inv sbox[Kguess ⊕C ] mod 16). Based on the bank index, we classify i each timing sample into the access bin or non-access bin. The value of Kguess that produces the largest timing difference between the average time of the access bin and the non-access bin will be the correct key byte value. However, the timing channel is very weak, i.e, the cache bank conflict penalty is only 0.6 CPU cycles. It requires a large number of samples to combat the noise. From our experiments, we can recover only some key bytes with more than 100 millions samples.

4.3.3.2 Key Recovery - Frequency Method

The problem of using the difference of mean method is that the visibility of our timing channel is too small (the signal level of the timing side channel is low), and the average number of cache bank conflicts between access and non-access bins is only one. We design an attack method, called frequency method, to improve the effectiveness of the attack by increasing the average number of bank conflicts between the access bin and the non-access bin.

77 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

Without loss of generality, we show the method to recover the 16th key byte. Figure 4.6 shows the timing distribution of the spy process when the victim is encrypting different messages. We can envision it as a combination of many small timing distributions for no cache bank conflict, one conflict, two conflicts, etc. Furthermore, for attack we can envision it consists of two small timing distributions, one for non-access (there is no cache bank conflict on the monitored bank) and the other for access (one or more bank conflicts happened). We would like to use the timing samples for the non-access one for attack. Since there is no way to know the exact distributions, we can only use some threshold to divide the samples for the timing distribution. Based on theoretical analysis and experiments, we split the timing samples at the 30% quantile into two portions, shown as the vertical line in Figure 4.6. We perform two processing steps on the lower portion. First, we identify a set of candidate cipher byte values that correspond to access on the monitored cache bank. Second, we use the candidate cipher byte values to retrieve the correct key byte value. Step 1 - Recover the monitored cipher values: The monitored cipher values are equiva- lent to the monitored table indices. As shown in Figure 4.2, the monitored cache bank 8 contains

T-table indices of {8, 24, 40, ..., 248} from each tables. Because ci = sbox[sj] ⊕ ki where si is the table index and ki is a constant value, we can convert the monitored table indices to the monitored cipher values for the ith cipher byte. For example, for the 16th cipher byte, the monitored cipher values are {1, 7, 45, ..., 241, 245} when the true k0 = 197. As we are using the lower portion of the timing distribution for attack (majority of them should be non-access), the monitored ciphertext byte values should appear less frequently. For all the timing samples, we classify them into 256 bins according to the values of the 16th cipher byte, and count the number of timing samples in each bin. The result is shown in Figure 4.7 (all possible values for one byte are 0 to 255). We circle the true monitored cipher values for the 16th byte in the figure. Among the 16 cipher values with the lowest frequency, 15 of them are the monitored cipher values. The reason that we can identify the monitored cipher values is that there is a linear relationship between the number of times that the victim accesses the monitored cache bank and the spy process timing measurement. By splitting the timing distribution at the 0.3 quantile, for each encryption in the lower portion there are at least 19 out of total 160 table lookups that do not access the monitored cache bank, since the probability of all 19 lookups that do not access the monitored 15 19 cache bank 8 is ( 16 ) = 0.29. For 160 table lookups in an encryption there are 10 accesses to each cache bank on average. By splitting the timing distribution at the 30% quantile, encryptions in the lower portion contain 9 fewer cache bank accesses to the monitored cache bank than the average. Thus, the frequency of the

78 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

105

0.3 quantile

2

1.5

1 Frequency

0.5

0 1260 1280 1300 1320 1340 1360 1380 1400 1420 1440 1460 Timing (CPU Cycles)

Figure 4.6: The spy process timing distribution

1050

1000

950

Frequency 900

850

800 50 100 150 200 250 Cipher Values

Figure 4.7: Recovering monitored cipher values for the 16th byte

79 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS monitored cipher values appearing is lower than other cipher values in the biased lower portion of the timing distribution. In fact, splitting the timing distribution at different quantile would yield different result quality. It seems that splitting at the smaller quantile would be the better, but we also need to consider that the number of samples in the lower portion decreases too as the splitting quantile become smaller. We perform experiments to measure the quality of the result using different splitting quantiles, and the result is shown in Figure 4.8. We define the quality of the result as the number of monitored cipher values that are among the 16 cipher values with the lowest frequency.

7

6

5

4

3

2 Number of monitored cipher values

1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Splitting Quantile

Figure 4.8: Number of the monitored cipher values are among the lowest 16

As shown in Figure 4.8, the quality of the result is the best in the middle range of the splitting quantile. The quality of the result decreases toward both ends of the range [0, 1]. Step 2 - Recover the key byte value: For the attacker, he will just use the 16 cipher values with the lowest frequency, as shown in Figure 4.7. For each of those 16 cipher values, we generate a set of 16 key candidates base on ki = sbox(sj) ⊕ ci and the known monitored table indices (sj). If the cipher value were one of the true monitored cipher values, the set must contain the true key value. In the end, we will have 16 sets of key candidates, with each set containing 16 key candidates. Assume those 16 cipher values are the true monitored cipher values, then the true key value will

80 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS appear in every set. Thus, the true key value is the most frequently appeared one in those sets. We show the key recovery result for 16th key byte in Figure 4.9, where the true key values appears 13 times. We apply this method and successfully recover all key bytes using 1 million samples. The robustness of this attack method is that we do not need to recover all 16 monitored cipher values in order to recovery the key. Using 5 true monitored cipher values is sufficient to recover the true key byte value.

14

12

10

8

6 Frequency

4

2

0 0 50 100 150 200 250 300 Key values

Figure 4.9: 16th key byte recovery

Success Rate: We evaluate the effectiveness of our attack by computing its success rate of recovering one key byte given a number of samples. The result is shown in Figure 4.10, labeled Method 1. Each success rate is an average of 1,000 attacks result. With 150,000 samples, we can reach approximately 50% success rate, and with 500,000 samples, we can easily achieve 100% success rate.

4.3.3.3 Key Recovery - Non-Access Statistic

Although intuitive, the previous attack method requires two steps, which may introduce noise and make the attack less effective. In this section, we propose another attack using statistic just in one step and compare these two attacks.

81 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

1

0.9

0.8

0.7

0.6

0.5

0.4 Success Rate

0.3

0.2

0.1 Method 1 Method 2 0 0 1 2 3 4 5 6 7 8 9 10 Number of samples 105

Figure 4.10: 16th key byte success rate

Similar to the frequency method, we split the timing distribution at the 30% quantile and use the samples within the lower portion. We target the 16th byte and use a key guess to compute 4 guess 4 the corresponding last round table lookup index, based on s9 = inv sbox[c16 ⊕ k16 ], where s9 is the table lookup index. If the table lookup index is in the monitored table indices, we increment the counter for the key guess by 1. We perform the same procedure for all possible key values (256 in total) for one encryption. Since the monitored cipher values appear less frequent than other cipher values, the counter for the correct value should be the lowest one after we consider all encryptions in the lower portion of the timing distribution. The result is shown in Figure 4.11. With 500,000 samples, we can clearly see the correct key value for the 16th key byte. We apply the same method to all other key bytes, and the result is shown in Figure 4.12. Success Rate: We also calculate the success rate for this attack and show it (Method 2) in Figure 4.10. We can achieve 50% success rate using 75,000 samples and 100% success rate using 200,000 samples. The performance of this attack is two times better than the frequency attack.

82 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS Frequency

50 100 150 200 250 Key values

Figure 4.11: 16th key byte recovery

4.4 Countermeasures

We demonstrate the feasibility of attacking table-based AES encryption by exploiting the cache bank timing channel. Implementations using Intel AES-NI instructions are not susceptible to our attack. However, there are many platforms existing without AES-NI instructions. Bit-slicing implementations can avoid key-dependent memory access and our attacks would not work on them too. However, bit-slicing implementations would introduce high execution overhead. Existing countermeasures against cache timing attacks do not prevent our attack. The granularity of cache partitioning, pinning, and randomization countermeasures is on cache line level, i.e, the cache set index for each horizontal line in Figure 4.2. These countermeasures can prevent the sharing sensitive cache lines with malicious processes. However, our attack is monitoring cache bank (vertically). The cache banks will always consist entries from the victim data and spy data, and our attack will work despite the previous countermeasures. Several countermeasures can be explored against our attack. One is to set the quota for each bank access within a period, once the quota is reached, the cache bank access should be delayed for the process. Another one is to prevent processes from different security domains to be scheduled onto the same core, so that they do not share L1-Cache hardware resource at all.

83 CHAPTER 4. INFORMATION LEAKAGE IN L1 CACHE BANKS

8000

7000

6000

8000

7000

6000

8000

7000

6000

8000

7000

6000 50 150 250 50 150 250 50 150 250 50 150 250

Figure 4.12: All key bytes recovery 4.5 Summary

In this chapter, we demonstrate the feasibility of attacking table-based AES encryption by exploiting a new cache bank timing channel. Although the timing signal from the cache bank side-channel is very small, our attack methods still work and successfully recover the key. This will be a realistic threat to any processors with cache banks, including embedded processors for mobile devices [73] and GPUs [11] since cache banks is a crucial feature for lower power consumption and high performance.

84 Chapter 5

The Countermeasure - MemPoline

5.1 Introduction

As we have seen in Chapter 2, 3,and 4, the same algorithm implemented on different archi- tectures can be vulnerable to different memory-based side-channel attacks. Protecting applications against different memory-based side-channel attacks is challenging and can be costly, thus calling for more general countermeasures that work across architectures. Hardware countermeasures modify the cache architecture and policies, and can be effi- cient [19, 20, 21, 29, 28]. However, they are invasive and require hardware redesign, and often times only address a specific attack. Software countermeasures [23, 24, 25, 36] require no hardware modification and make changes at different levels of the software stack, e.g., the source code, binary code, compiler, or the operating system. They are favorable for existing computer systems with the potential to be general, portable, and compatible. The software implementation of Oblivious RAM (ORAM) scheme shown in the prior work [37] has been demonstrated to be successful in mitigating cache side-channel attacks. The ORAM scheme [38, 39] was originally designed to hide a client’s data access pattern to the remote storage from an untrusted server by repeatedly shuffling and encrypting data. Raccoon [37] re- purposes ORAM to prevent memory access pattern from leaking through cache side-channel. The Path-ORAM scheme [39] uses a small client-side private storage to store a position map for tracking real locations of the data, and assumes the server cannot monitor the access pattern. However, in side-channel attacks, all access patterns can be monitored, and indexing to a position map is considered insecure against memory-based side-channel attacks. Instead of indexing, Raccoon [37], which focuses on control flow obfuscation and uses ORAM for storing data, streams in the position

85 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE map to look for the real data location, so that it provides a strong security guarantee. However, since it relies on ORAM for storing data, its memory access runtime is O(N) given N data elements, and the ORAM related operations can incur more than 100x performance overhead. We propose a software countermeasure, MemPoline, to address the side-channel security issue of Path-ORAM [39] and the performance issue of both the prior work [39, 37]. MemPoline adopts the ORAM idea of shuffling, but implements a much more efficient permutation scheme to provide just-in-need security level to defend against memory-based side-channel attacks. Specifically, we use a parameter-based permutation function to shuffle the memory space progressively. We just need to keep the parameter value private (instead of a position map) to track the real dynamic locations of data. Thus, in our scheme, the memory access runtime is O(1), significantly lower than O(log(N)) of Path-ORAM [39] and O(N) of Raccoon [37]. We apply our countermeasure MemPoline to both T-table implementation of AES and sliding-window implementation of RSA. We evaluate our countermeasure against various memory- based attacks, including Flush+Reload [6], Evict+Time [32], Cache Collision [35], and CacheBank attacks [34, 74], on CPUs, and the memory coalescing [75] and shared memory [76] attacks on GPUs. Results show that our countermeasure can effectively mitigate all these known memory- based side-channel attacks with significantly less performance degradation than other ORAM-based countermeasures. The contributions in this chapter include:

• We propose a novel efficient and effective technique to randomize a protected memory space at run-time.

• Based on the technique, we propose a software countermeasure against memory-based side- channel attacks to obfuscate a program’s memory access pattern.

• We apply our countermeasure to multiple ciphers on different platforms (CPUs and GPUs) and evaluate the resilience against many known memory-based side-channel attacks, both empirically and theoretically.

The rest of the chapter is organized as follows. In Section 5.2, we describe existing memory-based side-channel attacks and countermeasures. In Section 5.3, we present our threat model. In Section 5.4, we illustrate our approach, MemPoline, a sophisticated and efficient shuffling technique, which effectively resists memory-based side-channel attacks. In Section 5.5, we evaluate

86 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE both the security and performance impact of our countermeasure. The chapter is summarized in Section 5.6.

5.2 Background and Related Work

When the memory access footprint of an application is dependent on the secret (e.g., key), side-channel leakage of the footprint can be exploited to retrieve the secret. In this section, we give background on microarchitecture of the memory hierarchy. We discuss existing memory-based side-channel attacks and how they infer the memory access pattern from various side-channels exploiting different resources. We classify countermeasures in different categories. We describe two well-known cryptographic algorithms, AES and RSA, which will be our targets for applying the countermeasure.

5.2.1 Microarchitecture of the Memory Hierarchy

Computer systems rely on the off-chip main memory for storage and CPU or GPU cores for computation. However, there is a speed gap between storage and computation. Cache, a critical on-chip fast memory storage, is deployed for performance to reduce the speed gap, by utilizing both spatial and temporal locality exhibited in program codes and data. Modern CPUs are often equipped with multiple levels of caches to balance the storage size and access latency. As caches store only a portion of memory content, a memory request can be served directly by the cache hierarchy in case of cache hits, otherwise by the off-chip memory (cache misses). The timing difference between a cache hit and miss forms a timing channel that can be exploited by the adversary to leak secret. The structure of a cache is similar to a 2-dimensional table, with multiple sets (rows) and each set consisting of multiple ways (columns). A cache line (a table cell) is the basic unit with a fixed size for data transfer between memory and cache, normally 64 bytes in modern CPUs. Each cache line corresponds to one memory block. When the CPU requests a data (with the memory address given), the cache is looked up for the corresponding memory block. The middle field of a memory address is used to locate the cache set (row) first, and the upper field of the memory address is used as a tag to compare with all the cache lines in the set to identify a cache hit or miss. With highly parallel computing resources such as GPUs and multi-thread CPUs, modern computer architecture splits some on-chip storage into multiple banks, allowing concurrent accesses to these banks so as to increase the data access bandwidth. For example, in modern Intel processors,

87 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE the L1 cache includes multiple banks and each cache line is divided into multiple equal-sized parts, distributed among banks. The on-chip shared memory of many GPUs is also banked. With massive parallelism on GPUs, a microarchitecture, memory coalescing unit, com- monly exists on various GPUs. A memory coalescing unit groups concurrent global memory access requests (e.g., in a warp of 32 threads under the single-instruction-multiple-thread execution model on Nvidia Kepler) into distinct memory block transactions, so as to reduce the memory traffic and improve the performance. However, in Chapter 2, we haveshown that it can also leak memory access pattern of a running application.

5.2.2 Data Memory Access Footprint

Program data is stored in memory, and we use memory addresses to reference them. If the content-to-memory mapping is fixed, when a secret determines which data to use, by learning the memory access footprint through various side channels, the adversary can infer the secret. Different microarchitectural resources on the memory hierarchy use a different portion/field of the memory address to index themselves, for example, different levels of caches (L1, L2, and LLC), and cache banks. When observing victim’s access events on the different resources to infer memory access, the memory access footprint retrieved also has different levels of granularity. We focus on memory-based side-channel attacks that exploit sensitive data memory access footprint to retrieve the secret. For example, sensitive data includes the SBox tables of block ciphers such as AES, DES, and Blowfish, and the lookup table of multipliers in RSA. As these microarchitectural resources are shared, the adversary does not need root privilege to access them and can infer the victim memory access footprint by creating contention on the resources. In view of this attack fundamental, countermeasures are proposed to prevent the adversary from learning the memory access footprint. In Figure 5.1, we classify typical existing memory-based side-channel attacks and countermeasures according to the level of mapping they are leveraging and addressing, respectively. Attack. Memory-based side-channel attacks can be classified into access-driven and time- driven. For a time-driven attack, the adversary observes the total execution time of the victim under different inputs and uses statistical methods with a large number of samples to infer the secret. For an access-driven attack, the adversary intentionally creates contentions on certain shared resources with the victim to infer the memory access footprint of the victim. It consists of three steps: 1. preset - the adversary sets the shared resource to a certain state; 2. execution - the victim runs; 3. measurement -

88 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

Countermeasures Content Attacks 1 Raccoon, MemPoline A Flush+Reload, Flush+Flush

2 Cloak 1 B Prime+Probe, Evict+Time 3 CATlysts, StealMem C CacheBleed, etc. 4 RFill, NoMo D Coalescing Attack 5 RCoal Memory E CacheCollision Attack Address F Shared Memory Attack 2 5 , 2 , 3 4 — — — F D A C , A B , , E GPU B GPU Memory L1 Cache L3 Cache L1 Cache Shared Coalescing Bank Line Line Memory Unit

Figure 5.1: Overview of memory-based side-channel attacks and countermeasures the adversary checks the state of the resource using timing information. Figure 5.1 lists five microarchitectural resources, three of CPUs - L1 cache line, L3 cache line, and L1 cache bank, and two of GPUs - memory coalescing unit and shared memory, and various attacks utilizing these vulnerable resources. The GPU memory coalescing attack [75] and shared memory attack [76], Evict+Time [32], CacheCollision [35] are time-driven. All other attacks, including Flush+Reload [6], Flush+Flush [9], Prime+Probe [32, 33], CacheBleed [34], are access-driven. They differ in the way of presetting the shared resource, and how to use the timing information to infer victim’s data access. Countermeasure. Existing countermeasures are built on top of three principles to prevent information leakage: partitioning, pinning, and randomization. Partitioning techniques [29, 19, 23, 24, 25], including StealMem [24] and NoMo [29], split a resource among multiple software entities (processes), so that one process does not share the same microarchitectural resource with another, and therefore no side-channel can be formed. Pinning techniques [19, 22, 26, 27], including CATlysts [22] and Cloak [26], preload and lock one entity’s security sensitive data in the resource prior to computations, so that any key-dependent memory access to the locked data will result in a constant access time. Randomization techniques, such as RFill [20], RCoal [28], and Raccoon [37], randomize the behavior of the memory subsystem resources so that the adversary cannot correlate the memory access footprint to content used in the computation. Hardware countermeasures [20, 28] randomize the mapping between the memory address and on-chip microarchitectural resources. For example, RFill [20] targets the L1 cache and RCoal [28] targets the memory coalescing unit and randomizes its grouping behavior. Our approach, MemPoline, is in the same category of software ORAM [37, 39], which randomizes the content to memory address mapping.

89 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

Compared to the prior ORAM work, PathORAM [39] and Raccoon [37], our countermea- sure, MemPoline, is much more efficient and also achieves high security. We modify and relax the ORAM constraint for mitigating memory-based side-channel attacks. One key insight is that these memory-based attacks have limited bandwidth and resolution comparing to probing the memory bus, therefore we do not need to shuffle the entire data for every memory access. For example, when monitoring a cache bank, an adversary cannot monitor multiple cache banks at the same time and cannot differentiate the data accessed within the same cache bank. Thus, the adversary needs several samples before the actual memory access pattern can be recovered. With the relaxed ORAM constraint, we can achieve O(1) performance overhead per memory access while preventing memory-based side-channel attacks. For the performance comparison, we apply our countermeasure to Histogram program, one of performance benchmarking programs used in Raccoon, and show an over 7000 times better performance than ORAM used in Raccoon.

5.2.3 Vulnerable Ciphers

In this chapter, we apply our countermeasure to two vulnerable ciphers.

5.2.3.1 AES

AES is the standard encryption algorithm, and we use the T-table implementation in OpenSSL 1.0.2n for evaluation. It supports multiple bit-length and can have different modes. The most common 16-byte Electronic Codebook (ECB) mode AES encryption consists of nine rounds of SubByte, ShiftRow, MixColumn, and AddRoundKey operations, and the last round of three operations without the MixColumn one. In T-table-based implementation, the last round function th can be described by ci = Tk[sj] ⊕ rki, where ci is the i byte of the output ciphertext, rki is th th i byte of the last round key, sj is the j byte of the last round input state (j is different from i due to the ShiftRow operation), and Tk is the corresponding T-table (publicly known) for ci. The master key can be recovered once the last round key is known. Memory-based side-channel attacks can reverse-engineer the last round key by inferring the victim’s memory access pattern to the public-known T-tables, with sj inferred and ci known as the output.

5.2.3.2 RSA

RSA is an asymmetric cipher with two keys, one public and one private. The major computation operation is modular exponentiation, r = bemod m. In decryption, the exponent e is

90 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE the private key. For the sliding-window implementation of the RSA algorithm in GnuPG-1.4.18, the exponent is broken down into a series of zero and non-zero windows. The algorithm processes the windows one by one from the most significant one. For each exponent window, a squaring operation is performed first (multiplication by two same inputs). If the window exponent is non-zero, another multiplication routine is executed with a pre-calculated multiplier selected using the value of the current window. For a window of n-bit, there are 2n−1 multiplier values as only odd numbers are used in conditional multiplications. Tracking which multiplier has been used leads to the recovery of the window exponent value, and the table of the 2n−1 multipliers is the vulnerable sensitive data.

5.3 Threat Model

Our threat model includes co-residence of the adversary and victim on one physical ma- chine. We use this threat model for both attack implementations and evaluation of our countermeasure. However, we do not anticipate any issue for this countermeasure to work in a cloud environment. The adversarial goal is to recover the secret key of a cryptographic algorithm using memory-based side-channel attacks. The threat model assumes the adversary is a regular user without the root privilege, and the underlying operating system is not compromised. The adversary cannot read or modify the victim’s memory, but the victim’s binary code is publicly known (the common case for ciphers). The adversary can interact with the victim application. For example, the adversary can provide messages for the victim to encrypt, receive the ciphertext, and also time the encryption. However, there should not be any confidential information disclosed in any direct channel, e.g., the secret key of the encryption algorithm or intermediate computation results. In this work, we focus on protecting secret-dependent data memory access. Instruction memory is also vulnerable to side-channel timing attacks, which will be considered in future work.

5.4 Our Countermeasure - MemPoline

5.4.1 Design Overview

The high-level idea of our countermeasure, MemPoline, is to progressively change the organization of sensitive data in memory from one state to another directed by an efficient parameter- based permutation function, so that it decorrelates the microarchitectural events the adversary

91 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE observes and the actual data used by the program. Here the sensitive data refers to data whose access patterns should be protected, instead of data itself. To obfuscate memory accesses, the data layout in memory should undergo randomization through permutation. However, the frequency of permuting and the implementation method have a significant impact on both the security and performance of the countermeasure. We implement permutation gradually through subsequent swappings instead of at once - only bouncing the data to be accessed around before the access (load or store). Once the layout of the data reaches a permuted state, we update the parameter and continue migrating the data layout to the next permuted state. This procedure allows us to slowly de-associate any memory address from actual data content. Thus, the countermeasure can provide the security level to defend memory-based side-channel attacks with a significant performance gain over the ORAM-based countermeasure. The insight for such efficient permutation is that the granularity of cache data that a memory-based side-channel attack can observe is limited and therefore can be leveraged to reduce the frequency of permuting to be just-in-need, lowering the performance degradation.

Figure 5.2: Actions in MemPoline

The countermeasure consists of two major actions at the user level: one-time initialization and subsequent swapping for each data access (between the accessed data and another data unit selected by the random parameter), as shown in Figure 5.2. During initialization, the original data is permuted and copied to a dynamically allocated memory (SMem). Such a permuted state is labeled by one parameter, a random number r, which is used for bookkeeping and tracking the real memory address for data access. For example, the data element pointed to by index i in the original data structure is now referred by a different index in the permuted state, j = fperm(i, r) in SMem, where r is a random value and fperm is an explicit permutation function. The memory access pattern in SMem can be obfuscated through changing the value of r. If the value of r were fixed, the memory access pattern would be fixed. This would only

92 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE increase the attack complexity as the adversary needs to recover the combination of r and the key value instead of just the key value. The side-channel information leakage may be the same and therefore the number of traces needed for a successful attack would be the same. On the other hand, if the value of r were constantly updated every time one data element is accessed, the memory access pattern would be truly random. Such updating frequency could provide the same level of security guarantee as the ORAM [38, 39], however, also inheriting excessive performance degradation. Our countermeasure sets the frequency of changing the value of r to a level that balances the security and performance, and implements permutation through subsequent swappings rather than one-time action. This way, the security level for defending against memory-based side-channel attacks is attained with much better performance compared to ORAM. Next, we define the data structures of SMem in view of the memory hierarchy structure and set up auxiliary data structures. Then we illustrate the two actions of our countermeasure.

5.4.2 Define the Data Structures

SMem is a continuous memory space allocated dynamically. We define the basic element of it for permutation as limb, with its size equal to that of a cache bank, which is commonly 4 bytes in modern processors. We now assume SMem is 4-byte addressable memory space. Considering the cache mapping of SMem, we can view SMem as a two-dimensional table, where rows are cache lines, and columns are banks and each cell is a limb. As the observation granularity of memory-based side-channel timing attacks is either cache line or cache bank, when we move a limb around, both the row index and column index should be changed to increase the entropy of memory access obfuscation. We divide limbs into multiple equal-sized groups, and permutations take place within each group independently. To prevent information leakage through monitoring cache lines or cache banks, groups should be uniformly distributed in rows and columns, i.e., considering each row (or column), there should be equal number of limbs from each group. Figure 5.2 shows an example SMem, where the number of groups is equal to the number of columns, groups are formed diagonally, and the number of limbs in a group equals to the number of rows. With this well-balanced grouping, when a limb moves around within its group directed by the parameter-based permutation function, it can appear in any cache line or cache bank, obfuscating the memory access and therefore mitigating information leakage. Note that in modern computer systems, the cache line size is the same throughout memory hierarchy: RAM, Last-Level-Cache (LLC), L2, L1, and even memory coalescing unit. Therefore, we can mitigate information leakage of different

93 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE memory hierarchy level simultaneously.

In SMem, for each group, the initialization sets it in a permuted state, described by r1.

During program execution, as the permuted state gradually updates to r2, at any time, the group is in a mixed state as some limbs are in r1 and others are in r2. Once the entire group reaches r2 state, r1 is obsolete and is updated with r2, and a new random number will be generated for r2. Along the temporary horizon, we define the progression from a starting permuted state r1 to another permuted state r2 as an epoch. For a limb originally indexed by i, the new location in SMem can be found by fperm(i, r1) if it is in r1 state, otherwise, the new location is fperm(i, r2). To keep track of which permuted state the limb, i, is located in, a bitmap is allocated during the initialization and keeps updating. When bitmap[fperm(i, r1)] is 1, the i is in the r1 permuted state; otherwise, it is in the r2 permuted state.

5.4.3 Initialization - Loading Original Sensitive Data

We load the original sensitive data to SMem for two reasons: compatibility and security. The original sensitive data in a vulnerable program may be statically or dynamically allocated. To make our countermeasure compatible to both situations, we load original data to a dynamically allocated region SMem. It will only incur overhead for statically allocated data. There is also an advantage of dynamically allocated memory in terms of security. Given that the layout of the original data is publicly known, an adversary is able to track each permutation step and recover the final permuted state if she can monitor the cache activities during the initialization. If the original data is statically allocated and defined, its offset within the binary file is fixed, which in turn maps to a fixed cache region. In contrast, address of dynamically allocated memory is determined at runtime, so identifying the cache region belonging to the dynamically allocated memory will require the adversary to scan the entire cache and observe multiple related cache accesses. Using dynamically allocated memory can protect the initial permutation from being monitored by the adversary. The original sensitive data in memory is byte addressable. For program data access, the unit can be multi-byte, which should be aligned with the limb size (determined by the cache bank size). For example, for T-table based AES, the data unit size is four bytes, fitting in one limb; for SBox-based implementation, the unit is one byte, and three bytes are padded to make one limb. Therefore, each data unit occupies one or multiple continuous limbs. The overall size of SMem depends on the size of the original sensitive data. For a static

94 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE type data, each data unit has the same length. For a dynamic type data structure each data unit may have variable length, such as an array of pointers, pointing to memory space with different sizes. To optimize the bookkeeping overhead, all data units will occupy the same number of limbs in SMem. Thus, the size of SMem is the number of data units multiplied by the size of the largest data unit. To map a data unit indexed by i to a location in SMem, we need to figure out its coordinate in SMem, i.e., the row and column, and the group ID can then be derived. Note that, different from previous ORAM approaches, our MemPoline does not rely on an auxiliary mapping table to determine a location for i as the mapping table is also side-channel vulnerable. Instead, we develop functions to associate i with a memory address through private random numbers. For simplicity, we assume each data unit occupies one limb in SMem, and we will extend the approach to general cases where a data unit occupies two or more limbs, e.g., the table of multipliers in the sliding window implementation of RSA. We start by filling SMem row by row in the same manner as how a consecutive data structure is mapped to memory, as shown as the white table in Figure 5.2, where the data unit index i directly translates to the limb memory address. In each cell, the number in the middle is the original data index and the number at the top-right corner is the SMem address. When permuting, the content moves around in SMem. For the given example in Figure 5.2, the 32 limbs (eight rows and four columns) are divided into four diagonal groups. In each group, a specific random number, r1, is chosen to perform permutation. The permutation function is exclusive OR, satisfying i1 ⊕ r1 = j1 and i1 ⊕ r2 = j2. The content in address j1 and j2 will swap. For each group of eight limbs as shown in Figure 5.2, four swapping are performed directly by its corresponding initial r1. The entire SMem is now in the r1 permuted state. To handle the case when a data unit occupies multiple limbs, we treat the data unit i as a structure consisting of multiple limbs (assuming n is the number of limbs in one data unit). The loading and initial permutation operations are still performed at the granularity of limb, and one data access now translates to n limb accesses. After permutation, these limbs are scattered in SMem and are not necessarily consecutive. The individual limbs will be located and gathered to form the data unit requested by the program execution.

5.4.4 Epochs of Permuting

After initialization, the program execution is accompanied by epochs of permutations of SMem, distributed into each data access. For each data access, given the index in the original data

95 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

structure, we locate the limbs in SMem, and move some data units in the r1 permuted state to r2. The procedure is shown in Listing 5.1.

Listing 5.1: Locating the original index i in SMem

1 mp_locate_and_swap(i): 2 j1 = r1_index(i) 3 j2 = r2_index(i) 4 //3rd argument == false, fake swap 5 //3rd argument == true, real swap 6 oblivious_swap(j1, j2, bitmap[j1] == 1) 7 random_perm(group_index(i)) 8 j2 = r2_index(i) 9 return address at j2

Locating Data Element. The data unit indexed by i in the original data structure exists in

SMem with two possible states, either in the r1 permuted state at j1 = r1 index(i) = fperm(i, r1) or in the r2 permuted state at j2 = r2 index(i) = fperm(i, r2), depending on the value of bitmap[j1], −1 where bitmap[j1] = 1 indicates i = fperm(j1, r1) in the r1 permuted state and bitmap[j1] = 0 −1 indicates i = fperm(j1, r2) in the r2 permuted state. Permuting. Once the data element is located, we swap its content and the content at j2 = r2 index(i) in SMem if bitmap[j1] is 1. The swapping operation disassociates cache events related to accessing j1 from i. If bitmap[j1] is 0, we perform a fake swap procedure to disguise the fact that i is in j2. This conditional swapping operation is defined as oblivious swap in Listing 5.1. In addition to swapping the accessed element, we perform another random pair of permuta- tion by swapping j3 and j4 in the same group as shown in Figure 5.2. This procedure, random perm shown Listing 5.1, serves two purposes. First, it guarantees at least one data unit will be moved to r2 permuted state per memory access. Second, it adds noise to the memory access pattern. Specifically, we randomly select the data unit indexed by u such that bitmap[r1 index(u)] is 1. If there is no such index in the group, all data units in the group must be in r2 permuted state.

Thus, we assign the value of r2 to r1 and generate a new random value for r2. Once u is selected, we swap content in j3 = r1 index(u) and j4 = r2 index(u).

96 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

5.4.4.1 Parameter-Based Permutation Function

We use the xor function (⊕) as the parameter-based permutation function to move two data elements in the r1 permuted state to the r2 permuted state at a time while leaving other data elements untouched.

At the beginning of an epoch, all the data units are in permuted state r1. If an access requests for data unit i1 comes up, we first identify the location of it in SMem is j1 = i1 ⊕ r1. As it is requested now, it is time for it to be updated to r2 permuted state and relocated to j2 = i1 ⊕ r2. The data unit that stays in j2 is still in r1 state and its original index should satisfy i2 ⊕ r1 = j2 = i1 ⊕ r2.

By swapping the content at j1 and j2 in SMem, both data units i1 and i2 are moved to r2 permuted state and located at i1 ⊕ r2 and i2 ⊕ r2, respectively. In the following, we prove why this swapping implements permuting without affecting other data units.

Given r1, r2 as random numbers with the same in size (bit length), i1, i2 as indices in the original data structure (d). i1 and i2 are located at j1 = i1 ⊕ r1 and j2 = i2 ⊕ r1 in SMem (D) respectively. That is

D[i1 ⊕ r1] == d[i1]

D[i2 ⊕ r1] == d[i2]

With the swap operation, we will move i1 to j2 = i1 ⊕r2 and i2 to j1 = i1 ⊕r1. Therefore,

i1 ⊕ r2 == i2 ⊕ r1 (5.1)

By xoring both sides of Equation 5.1 by (r1 ⊕ r2), we have

i1 ⊕ r2 ⊕ (r1 ⊕ r2) == i2 ⊕ r1 ⊕ (r1 ⊕ r2) (5.2)

i1 ⊕ r1 == i2 ⊕ r2 (5.3)

After the swap operation,

D[i1 ⊕ r1] == d[i2]

D[i2 ⊕ r1] == D[i1 ⊕ r2] == d[i1]

By Equation 5.3, we have

D[i1 ⊕ r1] == D[i2 ⊕ r2] == d[i2]

97 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

5.4.5 Security Analysis

Since our countermeasure uses a parameter-based permutation function, the range of the parameter value determines the total number of permuted states in SMem. If we change the parameter value for every memory access, we can prove the security of SMem as strong as Path-ORAM proposed in the prior work [39] for defending against memory-based side-channel attacks. When a victim performs a load/store operation on a data element indexed by i in the original data structure, an adversary can observe the corresponding cache line (or bank), linej, being accessed. However, if the data element is remapped to a new random cache line linek, observing linek is statistically independent of observing linej. linek can be any one of cache lines with a uniform probability of 1/L, where L is the number of cache lines, guaranteed by our balanced grouping. Thus, the adversary cannot associate the observed cache line linek with the data element. If the layout of the sensitive data in memory is described by a parameter-directed permuta- tion state, changing the parameter value for every data access means that all data elements are shuffled even though most of them are not used by this data access. This operation would be a O(N) runtime. Given the limited granularity of side-channel information observed by the adversary, we actually do not need to change the parameter value for every data access. For example, when one cache line contains multiple data elements, access to any of data elements in the cache line will let the adversary observe an access to the cache line, but the adversary cannot determine which data element. Thus, for memory-based side-channel attacks, they require multiple observations to statistically identify the accessed data element. For example, Flush+Reload technique, the most accurate implementation needs more than a few thousands observations to statistically identify accessed 16 T-table elements in AES. As long as we can change all data elements from one permuted state to the next one before they can be statistically identified, we are able to hide the access pattern from leaking through the side-channel. As shown in the empirical result, no data element is identifiable by all memory-based side-channel attacks that we evaluated when our countermeasure is applied.

5.4.6 Operations Analysis

Table 5.1 gives an overview of all major operations happening in MemPoline. For the initialization step, a memory space will be allocated, and original data is loaded to it. The data layout progressively migrates from one permuted state to the next one upon every memory access, and this step incurs the major overhead. For locating a limb, it would require extra two memory reads to

98 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

User Actions Operations Description Calling Frequency Memory Access 1. Allocate memory n Writes Init One Time 2. Move data to SMem with ini- n Reads + n Writes tial permutation Memory Access 1. Locating element Per access 2 Reads (Read/Write) 2. Permute Per access 3 Writes 3. Generate new random value Per (group size)/2 accesses (group size)/2 Writes

Table 5.1: Operations in MemPoline API Description struct mp* mp init(uint16 t max elm size, uint16 t n elm) allocate data structure for the SMem void mp save(uint16 t i, void* elm ptr, uint16 t elm len, struct mp* d) store data element i to SMem uint32 t* mp locate and swap(uint16 t i, struct mp* d) locate and permute the data element i void mp free(struct mp* d) free SMem

Table 5.2: APIs for MemPoline the bitmap. For every permuting/swap operation, it requires extra three memory writes: 2 writes to update the data in SMem and one write to update the bitmap. For all limbs within the group to migrate to the new permuted state, it requires a number of writes that equals to the half of the group size to update the bitmap. The bitmap access complexity is O(1), and the data index i is protected, there is no information leakage when the bitmap is looked up.

5.4.7 Implementation - API

Application source code has to be changed to store data in SMem. MemPoline provides developers four simple APIs for initializing, loading, accessing (locating and swapping), and releasing SMem as shown in Table 5.2. First, developers define and allocate SMem using mp init. Second, developers copy sensitive data structure to be protected, such as SBox and multiplier lookup table, to the allocated memory space using mp save. Developers can locate data elements and perform swapping by using mp locate and swap. Finally, developers can release the allocated memory space using mp free. In this work, we apply these APIs to AES and RSA to protect the T-tables and multiplier lookup table, respectively, and also evaluate the security and performance impact of our approach.

99 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

5.4.7.1 Source Code Transformation for AES

We add constructor and destructor to allocate and deallocate SMem using mp init and mp free respectively. Because T-tables are of static type, we need to copy its data to the SMem inside the constructor function call. We also need to replace the original T-table lookup operation with mp locate and swap function call as shown in Listing 5.2, where T e0 is the original T-table, and ST e0 is of type struct mp and contains all data in T e0. With the modified code, the assembly code size increases by 11.6%.

Listing 5.2: Transforming AES T-table lookup operation to secure one

1 Te0[(s0 > 24)] 2 *mp_locate_and_swap((s0 >> 24), STe0) 3

5.4.7.2 Source Code Transformation for RSA - Sliding Window Implementation

Unlike AES, the multiplier lookup table is dynamically created, so we do not need to add constructor and destructor. Instead, we replace the allocation and initialization with mp init, loading pre-computed multipliers with mp save element, multipliers lookup operation with mp locate and swap, and deallocation with mp free as shown in Listing 5.3. With the modified code, the assembly code size only increases by 0.4%.

Listing 5.3: Transforming RSA to secure one

1 mpi ptr t b 2i3[SIZE B 2I3]; 2 pdata *b2i3s = mp_init(sizeof(mpi_limb_t)*n_limbs, n_elems); 3 4 MPN COPY (b 2i3[i], rp, rsize); 5 mp_save(i, rp, sizeof(mpi_limb_t)*rsize, b2i3s); 6 7 base u = b 2i3[e0 - 1]; 8 base_u = mp_locate_and_swap(e0 - 1, b2i3s); 9 10 mpi free limb space (b 2i3[i]); 11 mp_free(b2i3s); 12

100 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

5.5 Evaluation

In this section, we will first perform a case study on AES, when the countermeasure Mem- Poline is applied. We evaluate both the security of the countermeasure against various memory-based side-channel attacks and its performance impact. We will then study applying the countermeasure to RSA.

5.5.1 Experimental Setup

As our countermeasure is very general, against various attacks on different platforms, we conduct experiments on both CPUs and GPUs. The CPU system is a workstation computer equipped with an Intel i7 Sandy Bridge CPU, with three levels of caches, L1, L2, and L3 with sizes of 64KB, 256KB, 8MB, respectively, and a DRAM of 16GB. Hyperthreading technology is enabled. We evaluate standard cipher implementations of two crypto-libraries, namely OpenSSL 1.0.2n and GnuPG-1.4.18. We focus on AES key recovery from the OpenSSL implementation. We use RSA in GnuPG-1.4.18 to demonstrate that our countermeasure can be easily applied to other algorithms with security sensitive data. The GPU platform we chose is a server equipped with an Nvidia Kepler K40 GPU. We adopt the standard CUDA porting of OpenSSL AES implementation as the one used in the prior work [75, 77].

5.5.2 Security Evaluation of AES

We evaluate the security of our countermeasure by applying it to T-table based AES on both CPU and GPU platforms. Here, security refers to the side-channel resilience of MemPoline against various attacks, compared to the original unprotected ciphers. We anticipate our MemPoline addresses information leakage of different microarchitectural resources. Specifically, we have evaluated six memory-based side-channel attacks, targeting L1 cache line, L3 cache line, and L1 cache bank of CPUs, and memory coalescing and shared memory units on GPUs. We evaluate the security at two levels, both with and without the countermeasure. First, we will use the Kolmogorov–Smirnov null-test [78] to quantify the essential side-channel information leakage that can be observed using attack techniques, from the evaluator point of view - assuming the correct key is know. Second, we will perform empirical security evaluation by launching all these attacks and analyzing with a large number of samples, from the attacker point of view.

101 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

5.5.2.1 Essential Leakage Quantification

Memory-based side-channel attacks on AES monitor the access pattern to a portion (one cache line/bank) of T-tables during the last round. For the original implementation where the mapping of the T-table to memory address and cache is fixed, adversaries know what values the monitored cache line/bank contains. When adversaries detect an access by the victim to the monitored cache line/bank in the last round, the resulting ciphertext must use the values, a set of sj, in the monitored cache line/bank. With the ciphertext bytes {ci|0 ≤ i ≤ 15} known to the adversary, there is information leakage about the last round key, {rki|0 ≤ i ≤ 15}, with the relationship shown below:

rki = ci ⊕ sbox[sj] (5.4)

Flush+Reload (F+R): This is an access-driven attack, which consists of three steps. The state of the shared cache is first set by flushing one cache line from the cache. The victim, AES, then runs. At last the spy process reloads the flushed cache line and times it. A shorter reload time indicates AES has accessed the cache line. If there is information leakage in L3 cache line, the attack can correctly classify ciphertexts/samples that have accessed the monitored cache line based on the observed reload timing. In another word, we can use the correct key byte and compute all cache lines accessed by 40 T-table lookup operations of one AES run. We can classify the observed reload timing based on whether or not the monitored cache line has been accessed by one AES run. If these two timing distributions are distinguishable, the attack can observe the information leakage. We collect 100K samples and show the result in Figure 5.3, where the x-axis is the observed reload timing in CPU cycles, and the y-axis is the cumulative density function (CDF). For the original implementation shown in Figure 5.3(a), the access and non-access distributions are visually distinguishable. However, for the secure implementation with MemPoline applied, these two distributions are not distinguishable as shown in Figure 5.3(b), which means there is no information leakage observed by Flush+Reload attack when our countermeasure is applied. This distinguishability between two distributions can be measured by the Kolmogorov–Smirnov (KS) null-test [78]. If the null hypothesis test result, p-value, is less than a significance level (e.g., 0.05), the distributions are distinguishable. Using the stats package in Python, we compute the p-value for both non-secure and secure implementations against an F+R attack, which are 0 and 0.27, respectively, indicating there is no leakage of the secure implementation. We analyze other known memory-based side-channel timing attacks and also use the KS null test for them. The attacks differ in the type (access-driven vs. time-driven), the observing granularity, and also the distributions of timing observations.

102 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

1 1 Access 0.9 0.9 Non-access Access 0.8 0.8 Non-access 0.7 0.7

0.6 0.6

0.5 0.5 cdf cdf

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 Timing Timing

(a) Original (Non-Secure) Implementation (b) Secure Implementation

Figure 5.3: Information Leakage: Flush+Reload

Evict+Time (E+T): This attack is time-driven and can observe information leakage in both L1 and L3 cache lines. The adversary evictes one of T-table cache lines with its own data and runs the victim (AES). The victim’s execution time is measured. Therefore, the two distributions are for the measured encryption timing based on whether or not the evicted cache line has been accessed by one AES run. Cache Collision (CC): This attack is also time-driven and can observe information leakage in both L1 and L3 cache lines. It is based on the fact that if two different cipher bytes have used the same cache line in the last round the total execution time is statistically shorter. The distributions are for the measured encryption time based on whether or not the two cache lines accessed by two cipher bytes (e.g., 2nd and 14th) in the last round are the same. L1 Cache Bank (CB): This attack is access-driven and at a different observing granularity: cache bank. The measured time is the adversary’s bank accessing time while the victim is running AES encryption. The distributions are for the measured bank access time based on whether or not the AES accesses the same bank as the adversary in the last round. Memory Coalescing Unit Attack (CU): This attack is on a GPU, utilizing the on-chip memory coalescing unit that consolidates concurrent memory requests before sending them out to the cache hierarchy. The attack can observe information leakage if there is a linear relationship between the number of coalesced cache lines and the total execution time. Because the KS null-test only takes two distributions, we can run it on any two points of the linear line (corresponding to two numbers of coalesced cache lines), or two segments divided by the median number. Shared Memory Attack (SM): This attack is also on a GPU, utilizing the banked shared

103 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

Attacks Implementation p-value Attacks Implementation p-value F+R Secure 0.27 CB Secure 0.53 Non-Secure 0 Non-Secure 0 E+T Secure 0.96 CU Secure 0.76 Non-Secure 0 Non-Secure 0 CC Secure 0.48 SM Secure 0.95 Non-Secure 0.01 Non-Secure 0

Table 5.3: Kolmogorov–Smirnov null-test p-values memory unit. Similar to CU, the foundation of the attack is exploiting a linear relationship between the number of bank conflicts and the total execution time. We can compute the KS p-value similarly by choosing any two distributions. We show the KS null-test p-values result for these attacks on AES in Table 5.3. The p-values for non-secure implementations are all close to zero (lower than the significance level) while the p-values for secure implementations are larger than the significance level. The result demonstrates that our countermeasure MemPoline successfully obfuscates memory accesses without information leakage.

5.5.2.2 Empirical Attacks

We perform attacks to recover the key. Given the result of leakage quantification in Section 5.5.2.1, we expect that we cannot recover the key from the secure implementations, while the original implementations should be vulnerable. For all the attacks on the secure implementations, we cannot recover the key even with 232 samples (about 256GB data of timing and ciphertexts). Attack failure with these many samples demonstrate the implementations with the countermeasures on are secure. For the F+R attack on the original non-secure implementation, we can reliably recover the key using less than 10K samples, as shown in Figure 5.4(a), which uses the appearing frequency of the correct key value as the distinguisher. For comparison across attack trials that use different number of samples, we normalize the appearing frequency of each key value based on its mean value. Figure 5.4(b) shows that the attack does not work even with 4 billion samples on the secure implementation. We summarize the key recovery result in Table 5.4 for all attacks on original implementations. We can also see different effectiveness of different attacks due to their observation granularity and attack resolution, e.g., F+R

104 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

# Samples used Observing Attacks for key recovery granularity Cache Line (L1, F+R 10K L3) Cache Line (L1, E+T 130K L3) CC 1.5M Cache Line (L1) CB 2M Cache Bank CU 1M Cache Line (GPU) SM 2M Cache Bank (GPU)

Table 5.4: AES key recovery of original implementations is more effective than E+T, CC, and CB on CPUs. Attacks on GPUs (CU and SM) are much less effective which require millions of samples.

0000

0000

00000

0000

0000

0 200000 400000 600000 800000 1000000 00 0 0 0 0 0

Number

(a) Non-Secure (b) Secure

Figure 5.4: Flush+Reload attack result

Security When r2 Parameter is Fixed. Our countermeasure features epochs of permuting, which involves random number regeneration and swappings, incurring overhead. If SMem only shuffles once, i.e., with an unknown r1, the performance is much better. However, such design makes the implementation vulnerable to memory-based side-channel attacks.

In this section, we analyze the situation with only one r1. In this case, all data units would stay in the r1 permuted state. Each data unit ties to a unique combination of its original index and r1 value. With r1 value being fixed, the unique combination eventually will leak in the side-channel.

105 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

Algorithmic Generating Mem Permuting Random Accesses Value RSA (1 6048 8754 265 Decryption) AES (100 Encryptions) 4000 4000 456 (per T-table)

Table 5.5: Summary of operations overhead for AES and RSA.

We run the Flush+Reload attack to monitor one cache line in SMem with r1 value being

fixed. The attack result is shown in Figure 5.5. The adversary has to guess r1 in addition to guessing the key value. Since the r1 value is from 0 to 255, the adversary can try all r1 values and recover the key as shown in Figure 5.5(b) although the attack does not succeed if the r1 guess is wrong shown in Figure 5.5(a). The attack complexity increases, but the effectiveness (number of samples needed to recover the key) is similar to the one in Figure 5.4(a).

00

00

000

00

00

0 00000 00000 00000 00000 000000 0 200000 400000 600000 800000 1000000

Number

(a) r1 = 0 (b) r1 = 18

Figure 5.5: Flush+Reload attack result when r1 value is fixed

5.5.2.3 Application to other algorithms

We also evaluate a patched sliding-window implementation of RSA algorithm against F+R attack. For the purpose of security evaluation (rather than attack), we share the dynamically allocated memory used by the multipliers with the adversary. With the shared memory, we can use

106 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

F+R technique to monitor the usage of one multiplier (Multiplier 1). Otherwise, the attacker would need to use Prime+Probe technique to monitor the usage of the multiplier, which would still work but contain more noise than F+R [33]. We follow the similar victim model presented in the prior work [33, 34]. Specifically, we repeatedly run the RSA decryption of a small message encrypted with a 3,072 bit ElGamal public key. The attack will record the reload time of the monitored multiplier and the actual multiplier (calculated from the algorithm) accessed after every multiplication operation. If the attack can observe any leakage, the attack should be able to differentiate samples that access the monitored multiplier (one distribution) from ones that do not (the other distribution) based on the observed reload time. We use the KS null-test [78] to verify the leakage. The p-values for the original implementation and the secure implementation are 0 and 0.77, respectively. Once the countermeasure is applied, the two timing distributions are indistinguishable.

5.5.3 Performance Evaluation

Our countermeasure is at the software level and involves an initialization and run-time shuffling, incurring performance degradation. However, unlike other software-based countermea- sures [23, 24, 25], which affect the performance system-wide, the impact of our approach is limited to the patched application. The computation overhead strongly depends on the memory access pattern of the program. Beside the initialization step, the major source of runtime overhead is mp locate and swap function call. In this function, there are two major actions: Permuting limbs and Generating new random value. In Table 5.5, we show a summary of how frequent these two actions are performed in AES and RSA. The calling frequency is determined by the number of algorithmic access requests to the sensitive data (T-table for AES and multipliers for RSA), which translates to additional execution time. Function Runtime. We repeatedly run the mp locate and swap function call with a random input and the function takes 669 CPU cycles on average. Locating the limb action takes 22 CPU cycles, and generating a new random value action takes 78 CPU cycles. The permuting action consists of two operations: swap and random permute. The swap operation takes 22 cycles, and the random permute operation takes 567 cycles. Each memory access would result in an overhead of 669 CPU cycles. Considering the Amdahl’s law with other computation (without data access) and cache hits, the overall slowdown of the program can be much less significant.

107 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE

RSA Runtime. We measure the runtime performance impact for the RSA algorithm, which consists of fewer memory accesses but heavy logical computation. We run the RSA decryption of one 1K file for 10,000 times. The mean execution time for the original code is 0.0190 seconds and for the patched code is 0.0197 seconds, which is only a 4% performance degradation. The sliding-window implementation of the RSA algorithm has an insignificant number of accesses to the protected memory in comparison to other computations. AES Runtime. We also measure the runtime overhead for AES by encrypting one 16M file for 10,000 times. Note that we use a larger file size because AES is so much faster than RSA in encryption. The mean execution time for the original code is 0.132 seconds and for the patched code is 1.584 seconds. This is a 12x performance slowdown. Unlike the RSA algorithm, memory accesses to the sensitive data are a major portion of AES codes. Any additional operation depending on such inherent memory accesses will introduce a significant amount of penalty, especially when the T-table implementation of AES is very efficient. Comparison to other work. Our performance is significantly better than any other ORAM-based countermeasures. The countermeasure, proposed by Maas et al [79] that used a hardware implementation of ORAM, imposes 14.7x performance overhead. Raccoon [37] is a software-level countermeasure that adopts the software implementation of ORAM for storing the data. In some of its benchmark, it experiences more than 100x overhead just due to the impact of ORAM operations. For example, Histogram program shows 144x slowdown when it runs on 1K input data elements. We apply our countermeasure to Histogram program and observe 1.4% slowdown with 1K input data elements.

5.6 Summary

Any application with secret-dependent memory accesses can be vulnerable to memory- based side-channel attacks. Using ORAM scheme can completely hide the memory access footprint as shown in the software ORAM-based countermeasure [37]. However, there can be more than 100x performance overhead due to ORAM related operations. Our countermeasure pursues just- in-need security for defending against memory-based side-channel attacks with a significantly better performance than other ORAM-based countermeasures. Our software-based countermeasure progressively shuffles data within a memory region and randomizes the secret-dependent data memory access footprint. We apply the countermeasure to AES and RSA algorithms on both CPUs and GPUs. Both empirical and theoretical results show no information leakage when the countermeasure is

108 CHAPTER 5. THE COUNTERMEASURE - MEMPOLINE enabled under all known memory-based side-channel attacks. We see a 12x performance slowdown in AES and 4% performance slowdown in RSA.

109 Chapter 6

Conclusion

In Chapter 2, for the first time in the literature, we demonstrate how GPUs are also vulnerable to memory-based side-channel attacks. We analyze the memory coalescing unit in a GPU and identify the linear relationship between the number of unique cache lines accessed and the execution time of the global memory load instruction, which is the timing leakage. We show that not only the cache structure can leak memory access pattern but also other memory resources, even with much more subtle timing leakage. Ultimately, we can demonstrate a full AES key recovery through the vulnerable memory coalescing unit. In Chapter 3, we identify another vulnerable memory resource, Shared Memory unit, on GPUs. Similarly, we showcase a full AES key recovery by exploiting the timing leakage, which is the linear relationship between the number of Shared Memory bank conflicts and the execution time of a Shared Memory load instruction. In Chapter 4, we revisit the cache structure. However, instead of exploiting the timing difference between a cache miss and hit, we demonstrate that the subtle timing leakage due bank conflicts can also be exploitable to leak applications’ memory access pattern. We derive several attack methodologies that utilize the timing leakage to break AES encryption. In Chapter 5, we propose a software-level countermeasure, MemPoline, to protect different table-based cryptographic implementations against memory-based side-channel attacks on different platforms. Specifically, we adopt an efficient and effective parameter-based permutation function to shuffle the memory space distributively, and hence, obfuscate memory accesses of an application. We evaluate our countermeasure by applying it to both T-table implementation of AES and sliding- window implementation of RSA and validate its resilience against known memory-based side-channel attacks on both CPU and GPU platforms. The performance impact is highly correlated with the

110 CHAPTER 6. CONCLUSION frequency the patched application accessed the shuffled memory space. Overall, we see a 12x performance slowdown in AES and 4% performance slowdown in RSA. Memory-based side-channel attacks are increasingly powerful and considered as a serious cyber threat. They have been demonstrated to break security guarantees from cryptographic systems and violate confidentiality and users’ privacy. In this dissertation, we demonstrate that not only the cache structure can be exploited but also other memory resources on different platforms with much subtle timing leakage can also be utilized to leak applications’ confidential memory access pattern. We recognized the need for countermeasures that work across different platforms and are independent of underlying architectures to prevent leaking confidential memory access pattern. We propose a software-level countermeasure to mitigate memory-based side-channel attacks by efficiently obfuscating secret-dependent memory accesses.

111 Bibliography

[1] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Annual Int. Cryptology Conference. Springer, 1999, pp. 388–397.

[2] S. B. Ors,¨ E. Oswald, and B. Preneel, “Power-analysis attacks on an fpga–first experimen- tal results,” in International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 2003, pp. 35–50.

[3] S. B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel, “Power-analysis attack on an asic aes im- plementation,” in International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004., vol. 2. IEEE, 2004, pp. 546–552.

[4] C. Luo, Y. Fei, P. Luo, S. Mukherjee, and D. Kaeli, “Side-channel power analysis of a gpu aes implementation,” in 2015 33rd IEEE International Conference on Computer Design (ICCD), Oct 2015, pp. 281–288.

[5] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: the case of aes,” in The RSA Conference. Springer, 2006, pp. 1–20.

[6] Y. Yarom and K. Falkner, “Flush+ reload: a high resolution, low noise, l3 cache side-channel attack,” in USENIX Security Symp., 2014, pp. 719–732.

[7] C. Percival, “Cache missing for fun and profit,” 2005.

[8] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, and S. Mangard, “Armageddon: Cache attacks on mobile devices,” in USENIX Security Symp., 2016.

[9] D. Gruss, C. Maurice, K. Wagner, and S. Mangard, “Flush+ flush: a fast and stealthy cache attack,” in Int. Conf. on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2016, pp. 279–299.

112 BIBLIOGRAPHY

[10] D. Gruss, R. Spreitzer, and S. Mangard, “Cache template attacks: Automating attacks on inclusive last-level caches.” in USENIX Security Symposium, 2015, pp. 897–912.

[11] Nvidia, “Nvidia cuda toolkit v7.0 documentation,” 2015. [Online]. Available: http: //docs.nvidia.com/cuda/index.html

[12] B. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes, 2012.

[13] K. Iwai, T. Kurokawa, and N. Nisikawa, “Aes encryption implementation on cuda gpu and its analysis,” in Int. Con. on Networking & Computing, 2010, pp. 209–214.

[14] S. Manavski et al., “Cuda compatible gpu as an efficient hardware accelerator for aes cryptog- raphy,” in IEEE Int. Conf. on Signal Processing & Communications, 2007, pp. 65–68.

[15] A. E. Cohen and K. K. Parhi, “Gpu accelerated elliptic curve cryptography in gf (2 m),” in IEEE Int. Midwest Symp. on Circuits & Systems, 2010, pp. 57–60.

[16] R. Szerwinski and T. Guneysu,¨ “Exploiting the power of gpus for asymmetric cryptography,” in Cryptographic Hardware & Embedded Systems, 2008, pp. 79–99.

[17] D. Le, J. Chang, X. Gou, A. Zhang, and C. Lu, “Parallel aes algorithm for fast data encryption on gpu,” in Int. Conf. on Computer Engineering & Technology, vol. 6, 2010, pp. V6–1.

[18] A. di Biagio, A. Barenghi, G. Agosta, and G. Pelosi, “Design of a parallel aes for graphics hardware using the CUDA framework,” in IEEE Int. Symp. on Parallel Distributed Processing, May 2009, pp. 1–8.

[19] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-based side channel attacks,” ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 494–505, 2007.

[20] F. Liu and R. B. Lee, “Random fill cache architecture,” in IEEE/ACM Int. Symp. on Microarchi- tecture, 2014, pp. 203–215.

[21] F. Liu, H. Wu, K. Mai, and R. B. Lee, “Newcache: Secure cache architecture thwarting cache side-channel attacks,” IEEE Micro, vol. 36, no. 5, pp. 8–16, 2016.

[22] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee, “Catalyst: Defeat- ing last-level cache side channel attacks in cloud computing,” in IEEE Int. Symp. on High Performance Computer Architecture. IEEE, 2016, pp. 406–418.

113 BIBLIOGRAPHY

[23] H. Raj, R. Nathuji, A. Singh, and P. England, “Resource management for isolation enhanced cloud services,” in Proceedings of the 2009 ACM workshop on Cloud computing security. ACM, 2009, pp. 77–84.

[24] T. Kim, M. Peinado, and G. Mainar-Ruiz, “Stealthmem: System-level protection against cache-based side channel attacks in the cloud.” in USENIX Security symposium, 2012, pp. 189–204.

[25] Z. Zhou, M. K. Reiter, and Y. Zhang, “A software approach to defeating side channels in last-level caches,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016, pp. 871–882.

[26] D. Gruss, J. Lettner, F. Schuster, O. Ohrimenko, I. Haller, and M. Costa, “Strong and efficient cache side-channel protection using hardware transactional memory,” in USENIX Security Symposium, 2017.

[27] S. Chen, F. Liu, Z. Mi, Y. Zhang, R. B. Lee, H. Chen, and X. Wang, “Leveraging hardware transactional memory for cache side-channel defenses,” in Proceedings of the 2018 on Asia Conference on Computer and Communications Security. ACM, 2018, pp. 601–608.

[28] G. Kadam, D. Zhang, and A. Jog, “Rcoal: mitigating gpu timing attack via subwarp-based randomized coalescing techniques,” in High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 2018, pp. 156–167.

[29] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev, “Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, p. 35, 2012.

[30] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer, “Dawg: A defense against cache timing attacks in speculative execution processors.”

[31] D. J. Bernstein, “Cache-timing attacks on AES,” University of Illinois at Chicago, Tech. Rep., 2005.

[32] E. Tromer, D. A. Osvik, and A. Shamir, “Efficient cache attacks on aes, and countermeasures,” Journal of Cryptology, vol. 23, no. 1, pp. 37–71, 2010.

[33] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache side-channel attacks are practical,” in IEEE Symp. on Security & Privacy, 2015.

114 BIBLIOGRAPHY

[34] Y. Yarom, D. Genkin, and N. Heninger, “Cachebleed: A timing attack on OpenSSL constant time RSA,” in Cryptographic Hardware & Embedded Systems, Aug. 2016.

[35] J. Bonneau and I. Mironov, “Cache-collision timing attacks against AES,” in Cryptographic Hardware and Embedded Systems, 2006, pp. 201–215.

[36] E. Biham, “A fast new des implementation in software,” in International Workshop on Fast Software Encryption. Springer, 1997, pp. 260–272.

[37] A. Rane, C. Lin, and M. Tiwari, “Raccoon: Closing digital side-channels through obfuscated execution,” in 24th {USENIX} Security Symposium ({USENIX} Security 15), 2015, pp. 431– 446.

[38] O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious rams,” J. ACM, vol. 43, no. 3, pp. 431–473, May 1996. [Online]. Available: http://doi.acm.org/10.1145/233551.233553

[39] E. Stefanov, M. Van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and S. Devadas, “Path oram: an extremely simple oblivious ram protocol,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013, pp. 299–310.

[40] R. Di Pietro, F. Lombardi, and A. Villani, “CUDA leaks: information leakage in gpu architec- tures,” arXiv preprint arXiv:1305.7383, 2013.

[41] M. J. Patterson, “Vulnerability analysis of gpu computing,” Ph.D. dissertation, Iowa State University, 2013.

[42] J. Danisevskis, M. Piekarska, and J.-P. Seifert, “Dark side of the shader: Mobile gpu-aided malware delivery,” in Information Security and Cryptology, 2014, pp. 483–495.

[43] Nvidia, “Whitepaper nvidia’s next generation cudatm compute architecture: Ke- plertm gk110,” 2015. [Online]. Available: http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

[44] A. Lashgar, E. Salehi, and A. Baniasadi, “Understanding outstanding memory request handling resources in gpgpus,” in proceedings of The Sixth International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART’2015), 2015.

115 BIBLIOGRAPHY

[45] K. Pearson, “Notes on regression and inheritance in the case of two parents,” in Proc. the Royal Society of London, vol. 58, June 1895, pp. 240–242.

[46] D. Brumley and D. Boneh, “Remote timing attacks are practical,” in Proc. Int. USENIX Security Symp., 2003, pp. 1–1.

[47] Y. Fei, A. A. Ding, J. Lao, and L. Zhang, “A statistics-based fundamental model for side-channel attack analysis.” IACR Cryptology ePrint Archive, vol. 2014, p. 152, 2014.

[48] S. A. Crosby, D. S. Wallach, and R. H. Riedi, “Opportunities and limits of remote timing attacks,” ACM Transactions on Information & System Security, vol. 12, no. 3, p. 17, 2009.

[49] D. Page, “Partitioned cache architecture as a side-channel defense mechanism.” IACR Cryptol- ogy ePrint Archive, vol. 2005, p. 280, 2005.

[50] Z. Wang and R. B. Lee, “A novel cache architecture with enhanced performance and security,” in IEEE/ACM Int. Symp. on Microarchitecture, 2008, pp. 83–93.

[51] J. Kong, O. Aciic¸mez, J.-P. Seifert, and H. Zhou, “Hardware-software integrated approaches to defend against software cache-based side channel attacks,” in IEEE Int. Symp. on High Performance Computer Architecture, 2009, pp. 393–404.

[52] Y. Wang, A. Ferraiuolo, and G. E. Suh, “Timing channel protection for a shared memory controller,” in IEEE Int. Symp. on High Performance Computer Architecture, 2014, pp. 225– 236.

[53] Y. Wang and G. E. Suh, “Efficient timing channel protection for on-chip networks,” in IEEE/ACM Int. Symp. on Networks on Chip, 2012, pp. 142–151.

[54] Z. H. Jiang and Y. Fei, “A novel cache bank timing attack,” in 2017 IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), Nov 2017, pp. 139–146.

[55] B. Schneier, “Description of a new variable-length key, 64-bit block cipher (blowfish),” in International Workshop on Fast Software Encryption. Springer, 1993, pp. 191–204.

[56] N. Nishikawa, K. Iwai, and T. Kurokawa, “High-performance symmetric block ciphers on multicore cpu and gpus,” International Journal of Networking and Computing, vol. 2, no. 2, pp. 251–268, 2012.

116 BIBLIOGRAPHY

[57] A. A. Abdelrahman, M. M. Fouad, H. Dahshan, and A. M. Mousa, “High performance cuda aes implementation: A quantitative performance analysis approach,” in 2017 Computing Conference. IEEE, 2017, pp. 1077–1085.

[58] E. Karimi, Z. H. Jiang, Y. Fei, and D. Kaeli, “A timing side-channel attack on a mobile gpu,” in Computer Design (ICCD), 2018 IEEE Int. Conf. on. IEEE, 2018.

[59] Chonglei Mei, Hai Jiang, and J. Jenness, “Cuda-based aes parallelization with fine-tuned gpu memory utilization,” in 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), April 2010, pp. 1–7.

[60] D. A. Osvik, J. W. Bos, D. Stefan, and D. Canright, “Fast software aes encryption,” in Interna- tional Workshop on Fast Software Encryption. Springer, 2010, pp. 75–93.

[61] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and analysis of aes encryption on gpu,” in 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. IEEE, 2012, pp. 843–848.

[62] N. Nishikawa, K. Iwai, H. Tanaka, and T. Kurokawa, “Throughput and power efficiency evaluation of block ciphers on kepler and gcn gpus using micro-benchmark analysis,” IEICE TRANSACTIONS on Information and Systems, vol. 97, no. 6, pp. 1506–1515, 2014.

[63] J. Gilger, J. Barnickel, and U. Meyer, “Gpu-acceleration of block ciphers in the openssl cryptographic library,” in International Conference on Information Security. Springer, 2012, pp. 338–353.

[64] H. Naghibijouybari, A. Neupane, Z. Qian, and N. Abu-Ghazaleh, “Rendered insecure: Gpu side channel attacks are practical,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2018, pp. 2139–2153.

[65] H. Eldib, C. Wang, M. Taha, and P. Schaumont, “Qms: Evaluating the side-channel resistance of masked software from source code,” in Proceedings of the 51st Annual Design Automation Conference, 2014, pp. 209:1–209:6.

[66] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, “Mutual information analysis,” 2008, pp. 426–442.

117 BIBLIOGRAPHY

[67] M. Rivain, “On the exact success rate of side channel analysis in the gaussian model,” in Selected Areas in Cryptography, 2009, pp. 165–183.

[68] B. Gulmezo¨ glu,˘ M. S. Inci, G. Irazoqui, T. Eisenbarth, and B. Sunar, “A faster and more realistic flush+ reload attack on aes,” in International Workshop on Constructive Side-Channel Analysis and Secure Design. Springer, 2015, pp. 111–126.

[69] Z. Lin, U. Mathur, and H. Zhou, “Scatter-and-gather revisited: High-performance side-channel- resistant aes on gpus,” in Proceedings of the 12th Workshop on General Purpose Processing Using GPUs. ACM, 2019, pp. 2–11.

[70] G. Irazoqui, T. Eisenbarth, and B. Sunar, “S$a: A shared cache attack that works across cores and defies vm sandboxing–and its application to aes,” in Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 2015, pp. 591–604.

[71] Z. Zhou, M. K. Reiter, and Y. Zhang, “A software approach to defeating side channels in last-level caches,” in Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security.

[72] Intel, “Intel 64 and ia-32 architectures optimization reference manual,” 2016. [On- line]. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/ 64-ia-32-architectures-optimization-manual.html

[73] ARM, “Cortex-a15 mpcore technical reference manual,” 2016. [Online]. Available: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438h/BABEFEFH.html

[74] Z. H. Jiang and Y. Fei, “A novel cache bank timing attack,” in Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press, 2017, pp. 139–146.

[75] Z. H. Jiang, Y. Fei, and D. Kaeli, “A complete key recovery timing attack on a gpu,” in IEEE Int. Symp. on High Performance Computer Architecture (HPCA), March 2016.

[76] Z. H. Jiang, Y. Fei, and D. R. Kaeli, “A novel side-channel timing attack on gpus,” in ACM Great Lake Symp. on VLSI. IEEE Press, 2017, pp. 167–172.

[77] E. Karimi, Z. H. Jiang, Y. Fei, and D. Kaeli, “A timing side-channel attack on a mobile gpu,” in 2018 IEEE 36th International Conference on Computer Design (ICCD), Oct 2018, pp. 67–74.

118 BIBLIOGRAPHY

[78] A. Kolmogorov, “Sulla determinazione empirica di una lgge di distribuzione,” Inst. Ital. Attuari, Giorn., vol. 4, pp. 83–91, 1933.

[79] M. Maas, E. Love, E. Stefanov, M. Tiwari, E. Shi, K. Asanovic, J. Kubiatowicz, and D. Song, “Phantom: Practical oblivious computation in a secure processor,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013, pp. 311–324.

119