Memory-Based Side-Channel Attacks and Countermeasures
A Dissertation Presented by
Zhen Hang Jiang
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Engineering
Northeastern University Boston, Massachusetts
July 2019 To my parents, wife, brother, and sister.
i Contents
Acknowledgments v
Abstract of the Dissertation vi
1 Introduction 1 1.1 Motivation ...... 1 1.2 Existing Memory-Based Side-Channel Attacks and Countermeasures ...... 3 1.3 Dissertation Overview ...... 5 1.4 Dissertation Contribution ...... 5
2 Information Leakage in Memory Coalescing Unit 7 2.1 Introduction ...... 7 2.2 Related Work ...... 9 2.3 Background ...... 10 2.3.1 GPU Memory Architecture ...... 10 2.3.2 AES GPU Implementation ...... 11 2.4 Correlation Timing Attack ...... 14 2.4.1 SIMT Architecture Leakage ...... 15 2.4.2 AES Encryption Leakage ...... 19 2.4.3 Correlation Timing Attack on GPU AES Implementation ...... 24 2.4.4 Attack on Highly Occupied GPU ...... 31 2.4.5 Discussion ...... 34 2.5 Countermeasures ...... 35 2.6 Summary ...... 35
3 Information Leakage in Shared Memory Banks 36 3.1 Introduction ...... 36 3.2 Background ...... 38 3.2.1 AES Encryption ...... 38 3.2.2 Nvidia GPU Memory Hierarchy ...... 39 3.2.3 Single Instruction Multiple Threads Execution Model ...... 41 3.3 Threat Model ...... 42 3.4 Cache Bank Conflicts-Based Side-Channel Timing Channel ...... 42 3.5 Differential Timing Attack ...... 44
ii 3.5.1 Mapping Between the AES Lookup Tables and GPU Shared Memory Banks 46 3.5.2 Collecting Data ...... 46 3.5.3 Calculating the Shared Memory Bank Index ...... 46 3.5.4 Recovering Key Bytes ...... 47 3.5.5 More Realistic Attack Scenarios ...... 52 3.6 Timing Analysis on Other Architectures ...... 55 3.7 Discussions and Countermeasures ...... 59 3.7.1 Multi-Key Implementation As Countermeasure ...... 61 3.8 Summary ...... 65
4 Information Leakage in L1 Cache Banks 66 4.1 Introduction ...... 66 4.2 Background ...... 67 4.2.1 AES Encryption ...... 67 4.2.2 Intel Cache Architecture ...... 68 4.2.3 Cache Timing Attacks ...... 69 4.2.4 Countermeasures against Cache Timing Attacks ...... 70 4.2.5 L1 Cache Bank and CacheBleed Attack ...... 71 4.3 Cache Bank Timing ...... 73 4.3.1 Threat Model ...... 73 4.3.2 The Cache Bank Timing Channel ...... 73 4.3.3 Attacking AES Encryption ...... 74 4.4 Countermeasures ...... 83 4.5 Summary ...... 84
5 The Countermeasure - MemPoline 85 5.1 Introduction ...... 85 5.2 Background and Related Work ...... 87 5.2.1 Microarchitecture of the Memory Hierarchy ...... 87 5.2.2 Data Memory Access Footprint ...... 88 5.2.3 Vulnerable Ciphers ...... 90 5.3 Threat Model ...... 91 5.4 Our Countermeasure - MemPoline ...... 91 5.4.1 Design Overview ...... 91 5.4.2 Define the Data Structures ...... 93 5.4.3 Initialization - Loading Original Sensitive Data ...... 94 5.4.4 Epochs of Permuting ...... 95 5.4.5 Security Analysis ...... 98 5.4.6 Operations Analysis ...... 98 5.4.7 Implementation - API ...... 99 5.5 Evaluation ...... 101 5.5.1 Experimental Setup ...... 101 5.5.2 Security Evaluation of AES ...... 101 5.5.3 Performance Evaluation ...... 107 5.6 Summary ...... 108
iii 6 Conclusion 110
Bibliography 112
iv Acknowledgments
I would like to express my deepest gratitude to my advisor, Professor Yunsi Fei, and my dissertation committee members, Professors David Kaeli, Adam Ding, and Thomas Wahl, for their invaluable advice and continual support throughout my PhD study at Northeastern University. Finally, my sincere appreciation goes to my wife, for her encouragement and being the consummate partner in all aspects of life, and my parents, brother, and sister, for their unconditional and constant love and support.
v Abstract of the Dissertation
Memory-Based Side-Channel Attacks and Countermeasures
by Zhen Hang Jiang Doctor of Philosophy in Computer Engineering Northeastern University, July 2019 Dr. Yunsi Fei, Advisor
Recent years have seen various side-channel timing attacks demonstrated on both CPUs and GPUs, in diverse settings such as desktops, clouds, and mobile systems. These attacks observe events on different shared resources on the memory hierarchy from timing information, then the secret-dependent memory access pattern is inferred, and finally, the secret is retrieved through statistical analysis. We generalize these attacks as memory-based side-channel attacks. In this dissertation, we identify several side-channel vulnerabilities in memory resources on both GPU and CPU platforms and propose novel side-channel attacks to exploit these vulnerabilities for secret retrieval. Specifically, We examine the memory coalescing unit and Shared Memory unit on GPU platforms, and L1 cache bank on CPU platforms. These microarchitectural resources, indispensable for performance optimization, inadvertently leak applications’ memory access pattern. We craft memory-based side-channel attacks to capture such leakage and exploit it to successfully recover the entire 16-byte key of Advanced Encryption Standard (AES). As memory-based side-channel attacks are very powerful and many common microarchi- tecture resources on various system are vulnerable, defenses against them should be sought after. Based on the insight that all existing memory-based side-channel attacks (including our proposed ones) exploit the fixed mapping between the content and memory resources, we propose a novel
vi software countermeasure, MemPoline, against memory-based side-channel attacks. MemPoline hides the secret-dependent memory access pattern by moving sensitive data around randomly within a memory space. Although an adversary may still observe events on microarchitecture resources, the randomness prevents her from retrieving useful secret information. We implement efficient permutations directed by parameters, significantly lighter weight than the prior oblivious RAM technology, yet achieving similar security. The countermeasure only requires changes in the source code, and has great advantages of being general - algorithm-agnostic, portable - independent of the underlying architecture, and compatible - a user-space approach that works for any operating system or hypervisor. The contributions of this dissertation include identification of several new memory-based side-channels on CPUs and GPUs, which are weaker than the traditional CPU cache side-channel but are on different microarchitecture resources and therefore orthogonal to cache side-channel countermeasures. The proposed software countermeasure addresses the root cause of memory-based side-channel attacks and effectively protects cryptographic implementations on both CPUs and GPUs against all these memory-based attacks with a minimal performance impact.
vii Chapter 1
Introduction
This dissertation focuses on memory-based side-channel attacks, which exploit the memory access footprint inferred from observable microarchitectural events, and countermeasures that prevent these attacks. In this chapter, we start with motivations for further investigations of memory- based side-channel attacks beyond the existing work, and then give an overview of the attacks and countermeasures proposed in this dissertation. Finally, we summarize the contribution of this dissertation.
1.1 Motivation
Cryptography plays a crucial role in providing three fundamental security properties: con- fidentiality, integrity, and authenticity, through various cryptographic functions including encryption, hashing, signing, authentication, etc. Rather than relying on “secure by obscurity,” information security relies on only keys being secret while the algorithms and even implementations all being open and standardized. Hence, adequately protecting secret key is critical in order to deliver the security guarantee. Since the very first successful key-recovery demonstration of Differential Power Analysis (DPA) [1] by Kocher et al., side-channel attacks have changed the notion of “security” for crypto- graphic algorithms despite their mathematically proven security. Various side channels, including the most common power consumption and electromagnetic (EM) emanation, have been leveraged to break cryptographic engines, such as Advanced Encryption Standard (AES) and RSA, on many plat- forms, such as FPGA [2] ASIC [3], and GPUs [4]. While this type of attacks requires physical access to a targeted system to obtain the physical side-channel information, memory-based side-channel
1 CHAPTER 1. INTRODUCTION attacks can be mounted remotely, presenting a serious cyber threat to cryptographic software, servers, and cloud services. Memory-based side-channel attacks, which exploit the memory access footprint inferred from observable microarchitectural events, have gained the popularity in the side-channel security community and become a serious cyber threat to not only cryptographic implementations but also general software bearing secrets. For example, researchers have demonstrated successfully in recovering a full encryption key [5, 6, 7] and logging keyboard events [8, 9, 10] using memory-based side-channel attacks. Most of memory-based side-channel attacks target one of memory resources, the cache structure, and exploit its significant difference in a cache hit vs. miss access time. With the introduction of programmable shader cores and high-level programming frame- works [11, 12], GPUs have been integrated into complex heterogeneous computer systems for accelerating applications. Given their ability to provide high throughput and efficiency, GPUs are now being leveraged to offload cryptographic workloads from CPUs [13, 14, 15, 16, 17, 18]. This move to the GPU allows cryptographic processing to achieve up to 28X higher throughput [13]. While an increasing number of security systems are deploying GPUs, the security of GPU execution has not been well studied. In this dissertation, we take the first step and thoroughly analyze two memory resources on GPUs, Memory Coalescing unit and banked Shared Memory unit, and discover side-channel timing leakage of these two resources, and devise two memory-based side-channel attacks to successfully break the 16-byte AES encryption on a GPU. Similar to banked Shared Memory unit, the L1 cache of modern complex processors is also banked in order to achieve high bandwidth for superscalar processors and reduce the power consumption. Rather than a monolithic piece of a microarchitectural module, L1-cache is composed of multiple cache banks, which allow multiple concurrent accesses to different cache banks at one time. However, when two or more accesses target the same bank, a bank conflict arises, and they would be processed in a serialized manner. The subtle timing difference between parallel and serial cache bank accesses can be exploited to leak sensitive information. Based on this timing difference, we design another memory-based side-channel attack to recover the 16-byte AES encryption key. Despite numerous countermeasures [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], none of them can prevent all existing memory-based side-channel attacks. Protecting vulnerable applications against different memory-based side-channel attacks is challenging and can be costly, thus calling for more general countermeasures that work across architectures and applications. We propose a software countermeasure, MemPoline, to provide just-in-need security level to defend against memory-based side-channel attacks. Specifically, we use a parameter-
2 CHAPTER 1. INTRODUCTION
L1 Cache Bank Shared Memory Memory Timing Channel Cache Miss/Hit Conflict Bank Conflict Coalescing Platform CPU CPU GPU GPU Bernstein Timing Attack [31], Prime+Probe [32, Jiang, et al. Jiang, et al. Jiang, al et. Attacks 33], [ICCAD 2017], [GLSVLSI [HPCA 2016] Evict+Time [32], CacheBleed [34] 2017]. Flush+Reload [6], Flush+Flush [9], ...
Table 1.1: Memory-Based Side-Channel Attacks based permutation function to shuffle the memory space progressively. Results show that our countermeasure can effectively mitigate all these known memory-based side-channel attacks with significantly low performance degradation.
1.2 Existing Memory-Based Side-Channel Attacks and Countermea- sures
Attack. Cache is a critical structure for performance that reduces the speed gap between the main memory storage and the computation (on CPU or GPU cores), by utilizing both spatial and temporal locality exhibited in program codes and data. As caches store only a portion of memory content, a memory request can be served directly by the cache in case of cache hits, otherwise by the off-chip memory (cache misses). The timing difference between a cache hit and miss can be hundreds of cycles, and hence it forms a strong timing side-channel that many memory-based side-channel attacks exploit. However, as the memory subsystem become more complex, there exist many other vulnerable memory resources. We classify the memory-based side-channel attacks and our three proposed attacks according to the memory resource that each utilizes, and present them in Table 1.1. In this dissertation, we identify and explore memory-based side-channels other than the common and strong one that exploits the timing difference between a cache hit and miss. Memory-based side-channel attacks can be classified into access-driven and time-driven.
3 CHAPTER 1. INTRODUCTION
For a time-driven attack [32, 35, 31], the adversary observes the total execution time of the victim under different inputs and uses statistical methods with a large number of samples to infer the secret. For an access-driven attack [32, 33, 34, 9], the adversary intentionally creates contentions on certain shared resources with the victim to infer the memory access footprint of the victim. It consists of three steps: 1. preset - the adversary sets the shared resource to a certain state; 2. execution - the victim runs; 3. measurement - the adversary checks the state of the resource using timing information. Countermeasure. While the number of memory-based side-channel attacks continues to grow, various countermeasures are proposed. Hardware-based countermeasures modify the cache architecture and policies, and can be efficient [19, 20, 21, 29, 28]. However, they are invasive and require hardware redesign, and often times only address a specific attack. Software countermea- sures [23, 24, 25, 36] require no hardware modification and make changes at different levels of the software stack, e.g., the source code, binary code, compiler, or the operating system. They are favorable for existing computer systems with the potential to be general, portable, and compatible. The software implementation of Oblivious RAM (ORAM) scheme shown in the prior work [37] has been demonstrated to be successful in mitigating cache side-channel attacks. The ORAM scheme [38, 39] was originally designed to hide a client’s data access pattern to the remote storage from an untrusted server by repeatedly shuffling and encrypting data. Raccoon [37] re- purposes ORAM to prevent memory access pattern from leaking through cache side-channel. The Path-ORAM scheme [39] uses a small client-side private storage to store a position map for tracking real locations of the data, and assumes the server cannot monitor the access pattern. However, in side-channel attacks, all access patterns can be monitored, and indexing to a position map is considered insecure against memory-based side-channel attacks. Instead of indexing, Raccoon [37], which focuses on control flow obfuscation and uses ORAM for storing data, streams in the position map to look for the real data location, so that it provides a strong security guarantee. However, since it relies on ORAM for storing data, its memory access runtime is O(N) given N data elements, and the ORAM related operations can incur more than 100x performance overhead. We propose a software countermeasure, MemPoline, to address the side-channel security issue of Path-ORAM [39] and the performance issue of both the prior work [39, 37]. MemPoline adopts the ORAM idea of shuffling, but implements a much more efficient permutation scheme. In our scheme, permutation is directed by a parameter. Thus, we only need to keep the parameter value private (instead of a position map) to track the real dynamic locations of data. For our countermeasure, the memory access runtime is O(1), significantly lower than O(log(N)) of Path-ORAM [39] and O(N) of Raccoon [37].
4 CHAPTER 1. INTRODUCTION
1.3 Dissertation Overview
The dissertation consists of two major parts. The first part explores vulnerable memory resources other than the cache structure that leaks side-channel information via cache hit and miss accesses, and the second part proposes a software-based countermeasure to mitigate memory-based side-channel attacks for applications running on systems with vulnerable memory resources. In Chapter 2 and 3, we examine vulnerable memory resources on GPU platforms. Specifi- cally, we thoroughly analyze two memory resources on GPUs, Memory Coalescing unit in Chapter 2 and banked Shared Memory unit in Chapter 3. We discover side-channel timing leakage of these two resources, and devise two memory-based side-channel attacks to successfully break the 16-byte AES encryption on various GPU platforms. In Chapter 4, we analyze the banked L1 cache of modern complex CPU. We derive a memory-based side-channel attack exploit the subtle timing difference between parallel and serial cache bank accesses and recover the 16-byte AES encryption key. In Chapter 5, we propose a software countermeasure, MemPoline, to provide just-in- need security level to defend against memory-based side-channel attacks. Specifically, we use a parameter-based permutation function to shuffle the memory space progressively and obfuscate memory accesses. We apply the countermeasure MemPoline to both T-table implementation of AES and sliding-window implementation of RSA. We evaluate the countermeasure against various memory-based attacks on both CPU and GPU platforms. Results show that the countermeasure can effectively mitigate known memory-based side-channel attacks with significantly less performance degradation than other ORAM-based countermeasures. We conclude the dissertation in Chapter 6.
1.4 Dissertation Contribution
In this dissertation, we propose a number of new memory-based side-channel attacks and countermeasures. We thoroughly examine several microarchitectural units in terms of their timing leakage, reverse-engineer partial structure and behavior of these units, and identify their vulnerability to side-channel attacks. Our work significantly augments the awareness of side-channel security in broader computer architecture across various computing platforms. The contributions of the dissertation to the areas of computer architecture and side-channel security include:
1. Memory Coalescing Unit: We discover the very first memory resource on GPUs that can leak the memory access footprint of an application. We overcome the challenges for memory-based
5 CHAPTER 1. INTRODUCTION
side-channel attacks introduced by GPU’s parallel computing feature and design an effective memory coalescing side-channel attack against AES encryption. This attack is time-driven, non-invasive, non-interfering, and only measures the total execution time of the GPU under different data input. We demonstrate that even a slight timing difference can render memory resources on a GPU vulnerable to memory-based side-channel attacks.
2. Shared Memory Banks: We discover another memory resource on GPUs that can leak the memory access footprint of an application, shared memory banks. We design another effective time-driven memory-based attack that only explores the interaction among parallel threads through the shared memory banks. No prior work has investigated such memory resource on GPUs.
3. L1 Cache Bank: There is a very subtle timing side-channel in L1 cache banks of CPU, caused by the small stalling delay due to conflicts between concurrent access requests to the same bank. We design an access-driven cache bank attack with a spy process and a concurrent victim process, supported by Hyperthreading. Observing the total execution time of the spy process allows malicious users to infer the memory access pattern of the victim process due to their contention on the L1 cache bank. Since all the existing countermeasures target cache side- channel attacks that rely on cache miss penalty, none of them can prevent our new cache bank attack as it is orthogonal to other cache attacks and yields different side-channel granularity - cache bank versus cache line.
4. MemPoline: We propose a software-based countermeasure to mitigate existing memory-based side-channel attacks across different memory resources and platforms. The countermeasure is built on top of a novel efficient and effective technique to randomize a memory space at run- time so that it obfuscates a program’s memory access pattern. We apply the countermeasure to multiple ciphers on different platforms (CPUs and GPUs) and evaluate the resilience and demonstrate the countermeasure can defeat all known memory-based side-channel attacks, both empirically and theoretically.
6 Chapter 2
Information Leakage in Memory Coalescing Unit
2.1 Introduction
With introduction of programmable shader cores and high-level programming frame- works [12, 11], GPUs have become fully programmable parallel computing devices. Compared to modern multi-core CPUs, a GPU can deliver significantly higher throughput by executing workloads in parallel over thousands of cores. As a result, GPUs have quickly become the accelerator of choice for a large number of applications, including physics simulation, biomedical analytics and signal processing. Given their ability to provide high throughput and efficiency, GPUs are now being leveraged to offload cryptographic workloads from CPUs [13, 14, 15, 16, 17, 18]. This move to the GPU provides cryptographic processing to achieve up to 28X higher throughput [13]. While an increasing number of security systems are deploying GPUs, the security of GPU execution has not been well studied. Pietro et al. identified that information leakage can occur throughout the memory hierarchy due to the lack of memory-zeroing operations on a GPU [40]. Previous work has also identified vulnerability of GPUs using software methods [41, 42]. While there has been a large number of studies on side-channel security on other platforms, such as CPUs and FPGAs, there has been little attention paid to side-channel vulnerability of GPU devices. Timing attacks have been demonstrated to be one of the most powerful classes of side- channel attacks [31, 35, 5, 6, 33, 7]. Timing attacks exploit the relationship between input data and the time (i.e., number of cycles) for the system to process/access the data. For example, in a cache
7 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT collision attack [35], the attacker exploits the difference in terms of CPU cycles to serve a cache miss versus a hit, and considers the cache locality produced by a unique input data set. There is no prior work of evaluating timing attacks on GPUs. To the best of our knowledge, our work is the first one to consider timing attacks deployed at the architecture level on a GPU. The GPU’s Single Instruction Multiple Threads (SIMT) execution model prevents us from simply leveraging prior timing attack methods adopted for CPUs on GPUs. A GPU can perform multiple encryptions concurrently, and each encryption will compete for hardware resources with other threads, providing the attacker with confusing timing information. Also, when using SIMT, the attacker would not be able to time-stamp each encryption individually. The timing information that the attacker obtains would be dominated by the longest running encryption. Given these challenges in GPU architectures, most existing timing attack methods become infeasible. In this chapter, we demonstrate that information leakage can be extracted when executing on an SIMT-based GPU to fully recover the encryption secret key. Specifically, we first observe that the kernel execution time is linearly proportional to the number of unique cache line requests generated during a kernel execution. In the L1-Cache memory controller in a GPU, memory requests are queued and processed in the First-In-First-Out (FIFO) order, so the time to process all of memory requests depends on the number of memory requests. As AES encryption generates memory requests to load its S-box/T-tables entries, the addresses of memory requests are dependent on the input data and encryption key. Thus, the execution time of an encryption kernel is correlated with the key. By leveraging this relationship, we can recover all the 16 AES secret key bytes on an Nvidia Kepler GPU. Although we demonstrate this attack on a specific Nvidia GPU, other GPUs also have the same exploitable leakage. We have set up a client-server infrastructure shown in Figure 2.1. In this setting, the attacker (client) sends messages to the victim (encryption server) through the internet, the server employs its GPU for encryption, and sends encrypted messages back to the attacker. For each message, the encrypted ciphertext is known to the attacker, as well as the timing information. If the timing data measured is clean (mostly attributes to the GPU kernel computation), we are able to recover all the 16 key bytes using one million timing samples. In a more practical attack setting where there is CPU noise in our timing data, we are still able to fully recover all the key bytes by collecting a larger number of samples and filtering out the noise. Our attack results show that modern SIMT-based GPU architectures are vulnerable to timing side-channel attacks. The rest of the chapter is organized as follows. In Section 2.2, we discuss related work. In Section 2.3, we provide an overview of our target GPU memory architecture and our AES GPU
8 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
Figure 2.1: The attack environment implementation. In Section 2.4, the architecture leakage model is first presented, followed by our attack method that exploits the leakage for complete AES key recovery. We discuss potential countermeasures in Section 2.5. Finally, the chapter is summarized in Section 2.6.
2.2 Related Work
Timing attacks utilize the relationship between data and the time taken in a system to access/process the data. Multiple attacks have been demonstrated successfully by exploiting cache access latencies, which leak secrets through generating either cache contention or cache reuse [35, 5, 6]. In order to create cache contention or cache reuse, attackers need to have their own process(spy process) coexisting with the targeted process (victim process) on the same physical machine. This way the spy process can evict or reuse cache contents created by the victim process to introduce different cache access latencies. We refer to this class of attacks as offensive attacks. Another kind, non-offensive attack, has also been demonstrated successfully by Bernstein et al. [31]. Unlike offensive attacks, Bernstein’s timing attack does not interfere with the victim process. His attack exploits the relationship between the time for an array lookup and the array index. The attack strategy commonly deployed in CPU-based timing attack methods consist of one block of ciphertext, and profiling the associated time to process that block. However, on a
9 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
GPU it would be highly inefficient to only perform one block encryption and produce one block of ciphertext at a time, given the GPU’s massive computational resources. In a real world scenario, the encryption workload would contain multi-block messages, and on each data sample, the GPU timing attackwould produce many blocks of ciphertext. The key difference is that the GPU scenario will only collect a single timing value for the multiple blocks. Although many successful attack methods have been demonstrated on CPU platforms, these methods cannot be directly applied to the GPU platform due to a lack of accurate timing information and nondeterminism in thread scheduling. Our timing attack method targets a GPU and is non-offensive like Bernstein’s. We exploit the inherent parallelism present on the GPU, as well as its memory behavior, in order to recover the secret key.
2.3 Background
In this Section, we discuss the memory hierarchy and memory handling features of Nvidia’s Kepler GPU architecture [43]. Note, not all details of the Kepler memory system are publicly available - we leverage information that has been provided by Nvidia, as well as many details of the microarchitecture we have been able to reverse engineer. We also describe the AES implementation we have evaluated on the Kepler and the configuration of the target hardware platform used in this work.
2.3.1 GPU Memory Architecture
2.3.1.1 Target Hardware Platform
Our encryption server is equipped with an Nvidia Tesla K40 GPU. The Kepler-family device includes 15 streaming multiprocessors (SMXs). Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units (LSU). In CUDA terminology, a group of 32 threads are called a warp. Each SMX has four warp schedulers and eight instruction dispatch units, which means four warps can be scheduled and executed concurrently, and each warp can issue two independent instructions in one GPU cycle [43].
2.3.1.2 GPU Memory Hierarchy
Kepler provides an integrated off-chip DRAM memory, called device memory. The CPU transfers data to and from the device memory before and after it launches the kernel. Global memory, texture memory, and constant memory reside in the device memory. Data residing in the
10 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT device memory is shared among all of the SMXs. Each SMX has L1/shared, texture, and constant caches, and they are used to cache data from global memory, texture memory, and constant memory, respectively. These caches are placed very close to the physical cores, so they have much lower latency than the corresponding memories. Texture memory/cache are optimized for spatial memory access patterns. Constant memory/cache is designed for broadcasting a single constant value to all threads in a warp. Global memory, together with the L1 and L2 caches and coalescing units (units that can coalesce multiple global memory accesses from a warp into memory transactions) provide for fast general-purpose memory accesses. The hierarchy of the L1/L2 caches and global memory is similar to that found on the modern multi-core CPUs. The sizes of the L1 and L2 caches on a GPU are much smaller that those found on a CPU. However, GPU caches have much higher bandwidth, which is needed to support a large number of cores.
2.3.1.3 Memory Request Handling
On a Kepler GPU, a global memory load instruction for a warp generates 32 memory requests if none of the threads is masked. All 32 memory requests are sent to coalescing units, which reorders pending memory accesses, trying to reduce memory requests down to a number of unique cache line requests. These cache line requests are issued to L1-cache controller, one per cycle, and the process is referred to as memory issue serialization [44]. If the requested data is present in the L1-cache, the data is loaded into the specified register and the cache line request is resolved in one GPU cycle by the LSU. On a miss, the request is queued in a Miss Status Holding Register (MSHR), one per cycle. If any incoming cache line request matches an outstanding cache line miss queued in the MSHR, the request is merged into a single entry in the MSHR. All requests queued in the MSHR are processed in FIFO order. These requests are then forwarded to the next level memory controllers (L2 or device memory). Upon receiving the requested data, the LSU needs to load this data into the register file and release the MSHR entry, one per cycle. This process is referred to as writeback serialization [44].
2.3.2 AES GPU Implementation
In this chapter, we evaluate 128-bit Electronic Codebook (ECB) mode AES encryption based on T-tables, which operates on blocks of 16 byte data, using a secret key of 16 bytes. The encryption implementation we use was ported from OpenSSL 0.9.7 library into CUDA. We trans- formed an entire block encryption into a single GPU kernel, so that each thread in the GPU can
11 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
Figure 2.2: The GPU AES implementation used in this work. process one block encryption independently, as shown in Figure 2.2. The encryption key scheduling step expands the 16-byte secret key into 160-byte round keys for ten rounds of operation. In the initial round, the 16-byte plain text is XORed with the first round key to generate the initial state. In the T-table version of AES, SubByte, ShiftRow and MixColumn operations are integrated to perform lookups on four T-tables. Rounds 1-9 simply need to perform T-table lookups and add round keys. In the last round, a special T-table only integrates the SubByte with the ShiftRow and does not involve a MixColumn operation. Our attack is focused on the last round of AES, whose operations are shown in Equation
(2.1), where each T table lookup returns a 4-byte value that is indexed by a one byte value, c0−15 are
12 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
the output cihpertext bytes and {t0, t1, ··· , t15} are the input bytes to the last round.
c0 = T 4[t3]0 ⊕ k0
c1 = T 4[t6]1 ⊕ k1
c2 = T 4[t9]2 ⊕ k2
c3 = T 4[t12]3 ⊕ k3
c4 = T 4[t7]0 ⊕ k4
c5 = T 4[t10]1 ⊕ k5
c6 = T 4[t13]2 ⊕ k6
c7 = T 4[t0]3 ⊕ k7 (2.1) c8 = T 4[t11]0 ⊕ k8
c9 = T 4[t14]1 ⊕ k9
c10 = T 4[t1]2 ⊕ k10
c11 = T 4[t4]3 ⊕ k11
c12 = T 4[t15]0 ⊕ k12
c13 = T 4[t2]1 ⊕ k13
c14 = T 4[t5]2 ⊕ k14
c15 = T 4[t8]3 ⊕ k15 Each generation of a ciphertext byte involves a table lookup (which returns a 4-byte value), byte positioning (taking one byte out of four bytes done by byte masking and shifting), and add-key. When implemented on Nvidia GPUs, each generation of a ciphertext byte will be implemented in CUDA using load and store instructions, in addition to logic instructions. Although the order of table lookups for the cipher bytes is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, the order of CUDA load and store instructions for all cipher bytes may be different depending on how the program is compiled. For example, the CUDA compiler, nvcc, by default assume −O3 optimization is enabled, which reorganizes the CUDA instructions to avoid data dependency stalls and thus can hide some latency of memory access instructions. When the optimization is disabled with a −O0 flag, the table lookup for each byte is directly translated into CUDA instructions and the order would be the same. In this AES GPU-based implementation, one GPU thread will perform AES encryption on one block of data. For a 32-block message, one warp of 32 threads can launch 32 encryptions in parallel. As the number of blocks per message increases, the GPU throughput will increase.
13 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
Our server system is dedicated to performing AES encryptions, so we have configured the GPU to achieve high throughput. As the constant cache stores data from constant memory that can be shared by threads in a warp, for AES encryption we use this space to store round keys. All threads in a warp will access the same round key at the same time. This allows the constant cache to broadcast the value to all of the threads in the executing warp. Although the T-tables are also constant values, they will not benefit as much from using constant memory because each thread will generate different memory accesses and the constant cache would have to return them sequentially, wasting valuable resources. Therefore, we chose to place T-tables in the global memory, reducing the number of memory requests by leveraging the Coalescing units. Also, the T-table data can be shared across SMXs through the L2 caches and across warps in the same SMX through the L1 caches. The L1 cache and the shared memory share the 64KB physical memory. Nvidia allows developers to determine this division on a per kernel basis. There are three available options: 16KB shared memory and 48KB L1 cache, 32KB shared memory and 32KB L1 cache, and 48KB shared memory and 16KB L1 cache. For our AES encryption kernel, the threads in a warp do not share any data, so the shared memory would not be used during the encryption. Thus, we want to minimize the size of shared memory and maximize the use of the L1 cache. The best configuration is to use 16KB shared memory and 48KB L1 cache. During server initialization, the five T-tables are copied into the global memory, and the round keys are copied into constant memory. Constant data remains in the device memory until the application exits. During encryption, all memory requests to global memory access the L2 cache. However, by default, all global memory load/store operations bypass the L1 cache without being cached. We found that enabling the L1 cache can increase encryption performance. In this work, we will always configure the server program with L1 caching enabled.
2.4 Correlation Timing Attack
We design a correlation timing attack that exploits the relationship present between the kernel execution time and the number of unique cache line requests generated during kernel execution. The attack will use one ciphertext byte and one key byte guess to compute the number of unique cache line requests that would be generated for the targeted ciphertext byte during table lookup. In attack, we run encrypt many messages and collect the timing information for each message (referred to as a trace, or data sample). We use the calculated number of cache line requests to correlate with timing samples. If we guessed the right key byte, we should expect to find a strong correlation
14 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT between the timing and the correct number of unique cache line requests; otherwise, if we guessed the wrong key byte, the resulting correlation should be low. In this section, we first explore the architecture leakage present on an Nvidia Kepler-family K40 device. To assess timing leakage vulnerabilities, we evaluate the success rates for correlation timing attacks using both clean measurements and noisy measurements. For the clean measurements, attackers are able to measure the warp execution time within a kernel, so the sources of inaccuracy are due to GPU internal hardware. When using noisy measurements, attackers are only able to measure when a message is received and returned by the server. In this case, the noise sources will also include processing on the server CPU, introducing some non-deterministic delay in our measurements. We consider the quality of timing data we collect, to better understand how noisy measurements can impact key recovery.
2.4.1 SIMT Architecture Leakage
With SIMT execution on a GPU, when a warp issues a load instruction, 32 memory requests by 32 threads are generated and sent to coalescing units (assuming all of the threads are active). These memory requests translate to unique cache line requests, and are merged with existing cache line requested in the MSHR. The time taken to serve all 32 memory address requests from a warp is proportional to number of unique cache line requests that are sent to the L1 cache controller due to both memory issue serialization and writeback serialization. To determine if there is a linear relationship between an SIMT load instruction’s execution time and the number of unique cache lines accessed, we develop a test kernel shown in Kernel 1. In this kernel, we measure the execution time for a warp of 32 threads to perform the load and store instructions in the kernel. Each thread is assigned an index from array indices, uses the index to load a 4-byte element from a big array A, and stores the element into the result variable. With SIMT execution of 32 threads, the array indices determine the total number of unique cache lines being referenced during the kernel execution, which is essentially sampling data array A. For example, if all the indexes are the same, Kernel 1 will only need one unique cache line. The array indices are created using Algorithm 2. In Algorithm 2, given a specified number of unique cache lines needed (e.g., 6), the array indices will generate the first six indexes with a stride of the cache line size, accessing six distinct cache lines, and the remaining 26 indexes will all be the same as the sixth one. With Algorithm 2, we can sweep the number of unique cache lines from 1 to 25 and generate corresponding indices arrays to use in Kernel 1.
15 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
Kernel 1 The kernel to measure memory access time index ← indices[tid]
time ← CLOCK() result[tid] ← A[index]
time ← CLOCK() − time
Algorithm 2 Generating memory access indices that will result in a pre-set number of cache lines access for Kernel 1 numCacheLine ← userInput indices ← [] curCacheLineIdx ← 0 for i = 1:25 do indices[i] ← curCacheLineIdx ∗ stride if curCacheLineIdx ¡ numCacheLine - 1 then curCacheLineIdx ← curCacheLineIdx + 1 end if end for
SHUFFLE(indices) return indices
16 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
With GPU’s serialization memory handling, we expect the execution time to be linearly proportional to the number of unique cache line requests. In Figure 2.3, we plot timing data for memory accesses while varying the number of cache line requests, under three strides, 32, 64, and 128 bytes. We can see that they are linearly proportional, and that the slope of the linear lines indicates how much execution timeis consumed for each unique cache access. A similar result was also reported in prior work [44]. Lines for stride size 64 and 128 bytes are exactly the same, implying that the cache line size is actually 64 bytes. From the Nvidia online literature [11], the cache line size of the L1 data cache of Kepler is 128 bytes. We confirmed this with Nvidia. What we also learned is that there are microarchitetural features in the L1 cache that are responsible for this behavior. As a result, we have elected to use 64 bytes as the cache line size in our attacks on the K40 device.
840 32 64 128 820
800
780
760 Average Timing(GPU cycles) 740
720 0 5 10 15 20 25 Number of Cache Line Requests
Figure 2.3: Nvidia GPU: Timing for 1 to 25 cache line requests, under stride 32, 64, and 128 bytes.
The Pearson Correlation value [45] of the execution time and the number of unique cache lines is found to be around 0.96. This strong correlation value suggests that the execution time can
17 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
x 104 1.844
1.842
1.84
1.838
1.836
1.834
1.832 Average timing (Nano seconds) 1.83
1.828 0 5 10 15 20 25 Number of unique cached lines
Figure 2.4: AMD GPU: Timing for 1 to 25 cache line requests. leak information about the array indices used in Kernel 1. Not only does the Nvidia Kelper GPU exhibit this type of leakage, but AMD GPUs also show the same leakage. We performed the same timing analysis, running programs written in OpenCL on an AMD R9-290X GPU, with the stride set at 64 bytes (the AMD GPU L1 cache line size) and the result is shown in Figure 2.4. We find its correlation value to be around 0.93 Since SIMT and memory coalescing are crucial features for high performance GPU, this kind of correlation will persist in various GPUs, including the Nvidia and AMD GPUs that we have experimented with. Disabling either of them would significantly degrade the performance. We therefore expect to find this correlation on GPUs from other manufacturers as well. Inspecting both Figure 2.3 and Figure 2.4, it is very clear that the execution time is directly proportional to the number of cache line requests on both families of GPUs. Given a fixed kernel, the key information (determines the number of cache line requests) can therefore be leaked by the execution time of the kernel. This keen observation inspires us to carry out a correlation timing attack
18 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT on AES implementation running on the start-of-the-art GPUs.
2.4.2 AES Encryption Leakage
As shown in Figure 2.1, the attacker and victim computers are connected via a network. This setup is the same as the one described by Bernstein et al. [31], except that the encryption is performed on the GPU. The goal of the attacker is to recover the 16-byte secret key that is used by the encryption server by using the known ciphertext and detailed timing collected. Depending on what timing data is collected, noise in this setup can be minimized if the execution time of kernels is measured. However, the measurement noise (i.e., inaccurracies) produced in a more practical setting will not inhibit the attack. As shown by Brumley et al. [46], the attacker can simply collect a larger number of traces and average out this noise. Suppose that the attacker sends a 32-block message to the server, and the server launches a warp of 32 threads to encrypt the received data. After some time, the attacker receives the 32-block encrypted message, along with the timing information for the warp execution, which are stored as one timing trace (sample) as shown below:
1 2 32 1 {c0−15, c0−15,···,c0−15,T }
There are ten rounds in AES encryption, and each round has 16 table lookups for each block of data. The index for each table lookup determines which cache line will be loaded. Thus, the entire encryption time depends on the indices for the 160 table lookups. We collected one million 32-block messages and associated timings, and recorded all indices used for the 160 table lookups during each block encryption. From all the indices used in the warp, we produce the number of unique cache line requests. We plot the average execution time associated with the number of unique cache line requests, shown in Figure 2.5, as well as the sample counts that are used to calculate the average time in Figure 2.6. Although the line in Figure 2.5 does not appear as linear as the one shown in Figure 2.3, it is clear that as the number of cache line requests increases, the average time also increases. The correlation between the number of cache line requests and the recorded execution time is 0.0596. In a real attack, it is impossible to compute all of indices used during one encryption without knowing the entire 16-byte key due to the strong cryptographic confusion and diffusion functions. It will be computationally infeasible to enumerate the entire key space (2128 ≈ 3.4×1038). However, in the last round, each lookup table index can be computed from one byte of the key and the
19 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
x 104
2.674
2.672
2.67
2.668
2.666 Average Timing(GPU cycles)
2.664
1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 Number of Cache Line Requests
Figure 2.5: The average recorded time versus the total number cache line requests in a mes- sage encryption with one million samples corresponding byte of ciphertext, independently from other ciphertext bytes. Thus, we can examine how much leakage can be observed in one byte. From Equation 2.1, we can write each byte of ciphertext as follows (byte positioning is ignored for simplicity):
cj = T 4[ti] ⊕ kj (2.2)
Using an inverse lookup table, we can find the ith byte of the input state to the last round, ti, if we know the true round key, kj:
−1 ti = T 4 [cj ⊕ kj] (2.3)
Given the GPU’s SIMT execution model, for a 32-block message, we have 32 threads
20 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
x 104 4
3.5
3
2.5
2
1.5 Number of Samples 1
0.5
0 1080 1100 1120 1140 1160 1180 1200 Number of Cache Line Requests
Figure 2.6: Sample counts versus the number cache line requests with one million samples. running simultaneously, and therefore:
1 −1 1 ti = T 4 [cj ⊕ kj] 2 −1 2 ti = T 4 [cj ⊕ kj] ...
32 −1 32 ti = T 4 [cj ⊕ kj]
1 2 32 The values of table lookup indexes ti , ti , ..., ti determine the number of unique cache lines that will be generated. Since each element in the T 4 table is 4 bytes and the size of a cache line is 64 bytes, there are 16 T 4 table elements in one cache line (assuming the T 4 table is aligned in memory). Therefore, the memory access requests can be turned into cache line requests by dropping
21 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
1 2 32 the lowest 4 bits of ti , ti , ..., ti , and so we have the following cache line requests:
1 1 hti i = ti >> 4 2 2 hti i = ti >> 4 ...
32 32 hti i = ti >> 4
1 2 32 The number of unique cache line is the number of unique values among hti i, hti i, ··· , hti i. This process of calculating the number of unique cache lines accessed from ciphertext bytes is implemented in Algorithm 3.
Algorithm 3 Calculating the number of unique cache line requests in the last round for a given key byte guess.
kj ← guess cache line cnt ← 0 holder ← [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] %% comment: i means thread id for i = 0:31 do −1 holder[T 4 [cipher[i][j] ⊕ kj] >> 4] + + end for for i = 0:15 do if holder[i] != 0 then cache line cnt + + end if end for return cache line cnt
We generated one million 32-block messages and received one million encrypted messages, along with their associated timings. For each 32-block encrypted message, we used Algorithm 3 to calculate the number of unique cache line requests generated for T 4[t3] table lookups, assuming we know the value of k3. Figure 2.7 shows the timing distribution over the number of cache line requests. We find the Pearson Correlation value to be 0.0443. We also fit a linear line with a slope of 14 cycles and offset of 26503 cycles among the timing distribution, where the slope is taken as the signal for the leaking timing channel.
22 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
Figure 2.7: Timing distribution over the number of cache line requests, calculated for one million encrypted messages, using the true value of the 3rd key byte.
Since we use only one table lookup out of all 160 table lookups for one block encryption, we should expect the Pearson Correlation to be bounded by the previously calculated correlation value for all 160 table lookups in Figure 2.5. Although the correlation value gets small, it is still significantly higher than the correlation value when using the wrong value for the 3rd key byte. If we assume the 3rd key byte to be 0, we find its correlation to be 0.0012. That is 36.9 times lower than the correlation value calculated using the right key. Although the correlation value is small, the linear relationship between the number of unique cache line requests and the encryption execution time suggests that the encryption time is leaking information about individual 9th round cipher state. Since the 9th round cipher state can be computed using ciphertext (known to the attacker) and key bytes, the encryption time ultimately is leaking individual key bytes.
23 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
2.4.3 Correlation Timing Attack on GPU AES Implementation
As we can see from Equation (2.1), each table lookup for each ciphertext byte is indepen- dent from another, and each key byte is used exclusively for its corresponding ciphertext byte. Thus, it allows us to attack one key byte at a time in a divide-and-conquer manner. For each possible value of the key byte, we use Algorithm 3 to calculate the number of cache line requests for each 32-block message. Since timing is linearly proportional to the number of cache line requests in the last round, we computed the Pearson Correlation of the timing versus the number of cache line requests. When we guess the correct key byte, we have the correct number of unique cache line requests, and the resulting correlation should be the highest; on the other hand, if we guess the wrong key byte, the resulting correlation should be low. Therefore, the key byte guess with the highest correlation value among all possible values should be the correct key. In this section, we test our Correlation Timing Attack method on the targeted Nvidia Kepler GPU. All of our experiments discussed in this section use one million traces, which can be collected within 30 minutes.
2.4.3.1 Attack Using Clean Measurements
In this experiment, we would like to demonstrate the feasibility of our attack first. Therefore, we minimize the noise by time-stamping when the kernel starts and ends as part of the AES kernel. This provides us with clean timing traces. The result of our attack is shown in Figure 2.8. The correct value for each key byte is circled. The correct key bytes stand out in the plots, when compared to the other 255 possible key values. This means we have successfully recovered all 16 bytes of the last round key. We also analyze the success rate of k5 to see the number of traces needed for reaching different success probability (success rate). The result is shown in Figure 2.9, which includes both the measured and predicted success rates. Each point on the measured success rate in the graph is the average from 100 timing attack trials using different timing traces. The predicted success rate is calculated using the methodology presented by Fei, et al. [47], in which the Signal to Noise Ratio (SNR) is obtained from real measurements (Figure 2.7) and used to predict the success of recovering correct key value. The predicted success rate precisely predicts the one we measured.Since the predicted success rate tracks well with our measured results and computing the predicted success rate takes less than 30 minutes while computing the measured success rate using 100 trials takes around 1500 minutes, we will use the predicted success rate hereafter. From Figure 2.9, both measured and predicted success rate
24 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
x 10−3 k0 x 10−3 k1 x 10−3 k2 k3 15 15 20 0.06
10 10 0.04 10 5 5 0.02
0 0 0 0 −5 −5 −0.02 100 200 300 100 200 300 100 200 300 100 200 300
k4 x 10−3 k5 k6 k7 0.06 10 0.04 0.06
0.04 0.04 5 0.02 0.02 0.02 0 0 0 0 −0.02 −5 −0.02 100 200 300 100 200 300 100 200 300 100 200 300
x 10−3 k8 x 10−3 k9 k10 k11 4 15 0.03 0.06
2 10 0.02 0.04
0 5 0.01 0.02
−2 0 0 0
−4 −5 −0.01 −0.02 100 200 300 100 200 300 100 200 300 100 200 300
x 10−3 k12 x 10−3 k13 k14 k15 6 10 0.03 0.06
4 0.02 0.04 5 2 0.01 0.02 0 0 −2 0 0 −4 −5 −0.01 −0.02 100 200 300 100 200 300 100 200 300 100 200 300
Figure 2.8: Correlation attack result under clean measurements for 32-block messages. reaches 50% using as few as 20,000 traces. With 70,000 traces, the predicted success rate converges to 1. Since the success rate is directly depending on the SNR value, we show SNR value for each byte in Table 2.1. As validated in Figure 2.8, the correlation value for the correct key in each key byte varies. Thus, the SNR value for each key byte would also be different, as shown in Table 2.1, because there exists a linear relationship between the SNR and the correlation when the correlation value is small [47]. Although the same lookup operation and same attack analysis are applied to each key byte, some key bytes, such as k5, k8, k12, and k13, have much smaller SNR values compared to others. This observation leads us to discover the effect of optimization during GPU compilation of
25 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 Predicted Success Rate
Success Rate Of Recovering One Key Byte Measured Success Rate 0 0 1 2 3 4 5 6 7 8 9 10 Number of Samples x 105
Figure 2.9: Success rate of k5 using clean measurements.
k0 0.01 k4 0.0399 k8 0.0034 k12 0.0050
k1 0.01 k5 0.0064 k9 0.0105 k13 0.0082
k2 0.0168 k6 0.0305 k10 0.0190 k14 0.0214
k3 0.0395 k7 0.0379 k11 0.0399 k15 0.0392
Table 2.1: The Signal to Noise Ratio (SNR) for each key byte. the program on the timing attack. To explore this issue deeper, we first examined how the server program is compiled for the GPU. We compiled our server program with the highest level optimization (-O3 by default in nvcc), which reorganizes some of the CUDA instructions in the kernel to avoid data dependency stalls. Before the table lookups for c5, c8, c12, and c13, other stalled load and store instructions may be congesting the GPU hardware resources, creating variable wait times for the table lookups for c5, c8, c12, and c13. We inspected the executable Shader ASSembly (SASS) code using the Nvidia
26 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT dissembler. The code is shown in Listing 2.1, and includes parts of the last round operation. Line
2a38 performs the table lookup for c5, and the value loaded is not used until line 2a78, when the GPU would stall if the requested data is not available at line 2a78. Also, there are multiple load and stores instructions before line 2a38, which may be congesting memory system and causing the duration of the load instruction for c5 to be nondeterministic. Thus, we see very little correlation and a low SNR for those key bytes. If we disable optimization during compilation, the CUDA instructions do not get reordered and each table lookup is stalled due to the data dependency and the timing actually is more determin- istic or predictable. This is shown in Listing 2.2. Line 9ef0 is the load instruction for c5, and its requested data is needed immediately in the following instruction. The same happens to c4 in line 9fa0. Overall the non-optimized program runs much slower with a lot of stalls, 120,000 GPU cycles vs 27,000 GPU cycles for the optimized program. Without the optimization, the loads of individual ciphertext bytes do not interfere with each other, and thus, we see a variance of 2354 GPU cycles square in timing data vs 128,000 GPU cycles square with optimization. The resulting execution timings of the load instructions is directly proportional to the number of unique cache lines accessed. Therefore, by performing the same attack with unoptimized server code, we would expect to have almost the same correlation value in each key byte, and the correlation is higher (around 0.06). The result is shown in Figure 2.10.
Listing 2.1: Optimized SASS Codes
1 /*2a00*/ ST.E.U8 [R6+0x3], R18; 2 /*2a08*/ ST.E.U8 [R6+0x2], R14; 3 /*2a10*/ IADD.X R13, RZ, c[0xe][0x24]; 4 /*2a18*/ IMAD.U32.U32 R16.CC, R17, R0, c[0xe][0x20]; 5 /*2a20*/ LD.E R11, [R10]; 6 /*2a28*/ LD.E R2, [R2]; 7 /*2a30*/ IMAD.U32.U32.HI.X R17, R17, R0, c[0xe][0x24]; 8 /*2a38*/ LD.E R12, [R12]; 9 /*2a40*/ LD.E.64 R14, [R8+0x8]; 10 /*2a48*/ LD.E R16, [R16]; 11 /*2a50*/ LOP.AND R19, R22, 0xff; 12 /*2a58*/ LOP.AND R22, R26, 0xff; 13 /*2a60*/ LOP32I.AND R3, R11, 0xff000000; 14 /*2a68*/ LOP32I.AND R2, R2, 0xff0000; 15 /*2a70*/ SHR.U32 R11, R28, 0x15;
27 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
16 /*2a78*/ LOP.AND R10, R12, 0xff00; 17 ...
Listing 2.2: Non-Optimized SASS Codes
1 ... 2 /*9ef0*/ LD.E.64 R4, [R6]; 3 /*9ef8*/ LOP.AND R4, R4, 0xff00; 4 /*9f08*/ LOP.AND R5, R5, RZ; 5 /*9f10*/ LOP.XOR R4, R22, R4; 6 /*9f18*/ LOP.XOR R5, R23, R5; 7 /*9f20*/ MOV32I R6, 0x0; 8 /*9f28*/ MOV32I R7, 0x0; 9 /*9f30*/ MOV R6, R6; 10 /*9f38*/ MOV R7, R7; 11 /*9f48*/ LOP.AND R8, R38, 0xff; 12 /*9f50*/ LOP.AND R9, R39, RZ; 13 /*9f58*/ SHF.L.U64 R3, R8, 0x3, R9; 14 /*9f60*/ SHL R0, R8, 0x3; 15 /*9f68*/ MOV R8, R0; 16 /*9f70*/ MOV R9, R3; 17 /*9f78*/ IADD R6.CC, R6, R8; 18 /*9f88*/ IADD.X R7, R7, R9; 19 /*9f90*/ MOV R8, R6; 20 /*9f98*/ MOV R9, R7; 21 /*9fa0*/ LD.E.64 R6, [R8]; 22 /*9fa8*/ LOP.AND R6, R6, 0xff; 23 ...
Although we see much more consistency and higher correlation values in the attack result when the server code is not optimized, it is unlikely a high performance encryption engine would use unoptimized code. Therefore, we focus on using optimized server code to test our attack. While running the optimized server code, we are still able to recover all of the key bytes, and we have a better understanding of how optimization can begin to thwart timing attacks due to interference between loads.
28 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
k0 k1 k2 k3 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k4 k5 k6 k7 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k8 k9 k10 k11 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300 k12 k13 k14 k15 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0 0 0 0 −0.02 −0.02 −0.02 −0.02 100 200 300 100 200 300 100 200 300 100 200 300
Figure 2.10: No optimization: Correlation attack result for 32-block messages.
2.4.3.2 Attack Using Noisy Measurement
In practice, it is more common for a server CPU to time-stamp the incoming and outgoing messages instead of within GPU kernels. With the same number of traces (one million) but different timing collection methods, we are able to recover 10 out of the 16 key bytes, as shown in Figure 2.11. Changing from clean measurement to noisy measurement, we see the variance in timing data increases from 128 thousands to 5.8 millions GPU cycles, which means it introduce a lot of noise. The correlation exhibited in each key byte has been reduced by more than 3X, as compared to our previous results that assumed more accurate timing measurements. Although adding noise in the timing hampers the attack slightly, it does not thwart the
29 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
x 10−3 k0 x 10−3 k1 x 10−3 k2 x 10−3 k3 5 5 5 10
5 0 0 0 0
−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k4 x 10−3 k5 x 10−3 k6 x 10−3 k7 10 5 5 10
5 5 0 0 0 0
−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 x 10−3 k9 x 10−3 k10 x 10−3 k11 5 5 5 10
5 0 0 0 0
−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 x 10−3 k13 x 10−3 k14 x 10−3 k15 5 5 5 10
5 0 0 0 0
−5 −5 −5 −5 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300
Figure 2.11: Correlation attack result with noisy measurements for 32-block messages. attack. With a large number of traces, we can achieve a 100% success rate at 3 millions traces. The attacker can use filtering to further clean up the timing information, and reduce the number of traces needed to achieve a 100% success rate. We applied a P ercentile F ilter as described by Crosby et al. [48] to our timing data. For data filtering, the attacker sends the same 32-block messages 100 times, so that she can obtain 100 timing samples, along with one 32-block encrypted message. Through experiments, we found that using the 40%percentile time among 100 timing traces produces the best attack result, as shown in Figure 2.13. By applying the simple noise reduction method, we are able to obtain even better results
30 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 Unfiltered Success Rate Success Rate Of Recovering One Key Byte Filtered Success Rate
0.5 1 1.5 2 2.5 3 3.5 4 Number of Samples x 106
Figure 2.12: Predicted Success rate for 0th key byte using filtered vs unfiltered data. than obtained using clean measurements. Because even for clean measurements, the timing infor- mation also suffers from GPU-internal noise sources, such as uncertainty by the warp scheduler. By using noise filtering, most of noise sources are filtered out and result in much cleaner timing information. The success rate (Filtered Success Rate) shown in Figure 2.12 converges to 1 at 40,000 traces, much earlier than the unfiltered scenario which requires 3 million traces. The simple filtering method significantly improves the attack effectiveness.
2.4.4 Attack on Highly Occupied GPU
Our experimental results suggest that GPU architectures with SIMT processing and a coalescing unit will produce a linear relationship between the number of unique cache line requests and the execution time. This relationship makes the GPU highly vulnerable to Correlation Timing Attacks. We can consider adding noise that would confuse attackers, but it does not fully thwart a timing attack. If we collect a large number of traces, attackers are still able to recover the secret
31 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
k0 k1 k2 k3 0.02 0.02 0.04 0.1
0.05 0.01 0.01 0.02 0 0 0 0 −0.05
−0.01 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 k4 k5 k6 k7 0.1 0.02 0.05 0.1
0.05 0.05 0.01 0 0 0 0 −0.05 −0.05
−0.1 −0.01 −0.05 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 k9 k10 k11 10 0.02 0.04 0.1
0.05 5 0.01 0.02 0 0 0 0 −0.05
−5 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 k13 k14 k15 10 0.02 0.04 0.1
0.05 5 0.01 0.02 0 0 0 0 −0.05
−5 −0.01 −0.02 −0.1 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300
Figure 2.13: Correlation attack result using filtered timing information for 32-block messages. information. Our ability to extract the secret information from larger messages (e.g., 1024 blocks) is critical, because larger messages can better utilize the high throughput of the GPU. However, this might be unfavorable to the attacker. Unlike threads in a warp, threads in different warps are not synchronized, which implies some of warps may finish before others. Therefore, the longest warp execution time dominates the time measurement that the attacker is using. Although the attacker does not know which 32-block encrypted message is dominant, she can divide a 1024-block message into 32 32-block messages and treat them as 32 traces with the same time value. One of 32 traces,
32 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT produced by the dominant warp, will have the true timing, and others might be wrong since the 31 other warps that produce the 31 others traces finished encryptions earlier than the dominant warp. Thus, the attacker can treat other 31 traces as noise to be added to the calculation. We collected one million traces using the filtering method discussed above. The results are shown in Figure 2.14. Most key bytes are still recoverable, but the key bytes with weaker correlation, such as k12, is completely buried. With more traces, k12 will also be recovered.
x 10−3 k0 x 10−3 k1 x 10−3 k2 k3 5 10 10 0.04
5 5 0.02 0 0 0 0
−5 −5 −5 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 k4 x 10−3 k5 k6 k7 0.04 5 0.02 0.04
0.02 0.01 0.02 0 0 0 0
−0.02 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k8 x 10−3 k9 k10 k11 5 10 0.02 0.04
5 0.01 0.02 0 0 0 0
−5 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 x 10−3 k12 x 10−3 k13 k14 k15 5 10 0.02 0.04
5 0.01 0.02 0 0 0 0
−5 −5 −0.01 −0.02 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300
Figure 2.14: Correlation attack result using filtered timing information for 1024-block mes- sages.
Since we treat the other 31 traces as noise during the calculation, we expect to see a success
33 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 32−Block Success Rate
Success Rate Of Recovering One Key Byte 1024−Block Success Rate 0 0 0.5 1 1.5 2 2.5 3 Number of Samples x 106
Figure 2.15: Predicted success rate for 0th key byte using 1024-block data vs 32-block data. rate of close to 100% for 15 million traces, as shown in Figure 2.15. Although increasing the number of blocks in each message may weaken the signal for each key byte, with larger number of traces, we can still recover all 16 key bytes.
2.4.5 Discussion
Moving from using clean measurements to noisy measurements, we see a lot of noise being included in our timing data, and consequently, the correlation value is suffering. In many real world situations, attackers would even not be able to get a time-stamp on the server. Thus, attackers would have to time-stamp their own packets as it being sent and received through the network. Such timing information can be much less accurate than the values in the noisy measurements. In the network setting, we observe the variance of the timing data to be 1.233e11 CPU cycles square, comparing to 1.464e8 in noisy measurements and 3.20e6 in clean measurements. As discussed in the prior work [48], network noise can be filtered to make remote timing attack possible.
34 CHAPTER 2. INFORMATION LEAKAGE IN MEMORY COALESCING UNIT
2.5 Countermeasures
A large number of defense techniques have been proposed to avoid timing attacks on CPU platforms [19, 20, 29, 49, 50, 51, 52, 53]. Given the lack of study of side-channel vulnerabilities on GPU devices, there has been no prior work on GPU countermeasures. In this section, we discuss several potential mitigation methods. Our attack exploits knowledge about the deterministic behavior of load instructions on the SIMT architecture. One method to prevent attacks is to table lookup operations in the AES implementation, as suggested by Osvik et al. [5]. We could also map the lookup tables to the GPU register file, since the register file is large enough hold a 256-byte Sbox table. Our attack is possible because attackers are able to map a table lookup index to a cache line, so the attack would be infeasible if we randomize the mapping between a table lookup index to a cache line. The similar idea is also presented in the prior work [50], in which memory-to-cache mapping randomization is used on CPU platform. With this technique, given 32 table lookup indices, attackers would not be able to map them to cache lines, and would thus not be able to calculate the number of unique cache line requests. One possible implementation would be to randomize entries in the security data (T 4) in the memory, and create a new index lookup table which maps an access index to the randomized index in the memory. Without knowing the mapping in the index lookup table, attackers would be not able to map an index to a cache line.
2.6 Summary
The execution time of a kernel is linearly proportional to the number of unique cache line requests generated during the kernel execution on modern GPU architectures. This property can be exploited to extract secret information such as encryption keys. In this chapter, we exploit this property on a Nvidia GPU platform and successfully recover all 16 bytes of AES-128. Although we perform attacks on a Nvidia GPU platform, these attacks can be carried out on others GPU platforms, given that the SIMT feature and coalescing units commonly exist on GPUs.
35 Chapter 3
Information Leakage in Shared Memory Banks
3.1 Introduction
In Chapter 2, we identify the information leakage of memory coalescing unit in a GPU, and we derive a memory-based side-channel attacks and use AES encryption as an example of target. In this chapter, we introduce a new class of timing side channels based on Shared Memory bank conflicts on the GPU. We develop a differential timing attack that exploits this timing side-channel. To demonstrate the attack, we again use the AES implementation (with T-tables stored in the Shared Memory unit) on a GPU, and successfully recover all the key bytes. The GPU on-chip Shared Memory is an important hardware unit for alleviating heavy traffic to the off-chip device memory. It is designed to store data that are shared and frequently accessed by many running threads. To support SIMT execution and deliver high memory throughput in modern GPUs, the Shared Memory is divided into multiple memory banks (versus a monolithic bank), allowing multiple concurrent paths to the Shared Memory. With multiple memory access requests for different memory banks being serviced in parallel, the Shared Memory bandwidth is significantly increased. However, when multiple memory requests compete for the same bank, they have to be serviced in a serial fashion, as each memory bank provides a single access port. We refer to such case of multiple accesses competing for a single shared memory bank port as a bank conflict. Additional requests that try to access data in the same memory bank will have their requests queued and delayed.
36 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
This scenario will result in a detectable delay, as compared to multiple memory accesses that resolve to different banks with no bank conflicts. Not only GPUs, modern high-performance CPUs (e.g., Intel’s SandyBridge and ARM’s Cortex-A) are also designed with multi-banked L1 and L2 caches. Yarom et al. [34] and Jiang et al. [54] investigate how sensitive information can be leaked when a cryptographic application runs on a CPU with multi-banked caches. A GPU generates a much more complex access pattern to Shared Memory banks, and we identify the memory bank conflict-based timing channel and exploit it for a successful timing attack. The contributions in this chapter include:
1. We identify a new memory resource that can leaks memory access pattern of an application.
2. We propose a differential timing attack methodology and successfully recover all AES key bytes.
3. We quantify the effectiveness of our attack methodology using the success rate as a metric.
4. We extend our timing analysis onto other Nvidia GPU architectures: Maxwell, Pascal, Turing, and Volta. We explore how non-blocking execution can hide timing leakage in the Shared Memory and be used to prevent our attack.
5. We propose a multi-key protection mechanism and evaluate its effectiveness in mitigating side-channel leakage and performance overhead.
This chapter is organized as follows: in Section 3.2, we provide background on the Advanced Encryption Standard (AES) algorithm, as well as the GPU memory hierarchy and execution model. In Section 3.3, we discuss our threat model. In Section ??, we explore timing variation due to Shared Memory bank conflicts, i.e., discovering the memory bank timing channel. In Section 3.5, we describe our differential timing attack targeting table-based cryptographic algorithms, and attack an AES encryption running on an Nvidia Kepler GPU. In Section 3.5.5, we apply our attack in more realistic settings. In Section 3.6, we extend our timing analysis on other GPU architectures, and explore how its non-blocking execution mode can hide timing leakage in the Shared Memory. In Section 3.7, we discuss feasible countermeasures to prevent the attack, and focus on a multi-key implementation of AES encryption. Finally, we summarize the chapter in Section 3.8.
37 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
3.2 Background
We begin by describing the AES implementation evaluated on the targeted GPU platform, as well as the memory hierarchy and execution model of Nvidia Kepler GPUs, a widely used and energy-efficient GPU microarchitecture [11].
3.2.1 AES Encryption
In this chapter, we evaluate the timing leakage vulnerability of a table-based cryptographic algorithm on a GPU. we use the same example, 128-bit ECB mode AES encryption, as in Chapter 2 for the attack demonstration. The proposed attack strategy can also be applied to other table-based cryptographic algorithms such as Blowfish [55]. The performance of AES is critical in the era of big data, where confidentiality is needed for storing and transmitting large amounts of data. Thus, high data throughput is desired. Performing AES encryption on a GPU can deliver an order of magnitude higher throughput than that on CPUs [56] since AES encryption can be easily parallelized and GPUs can exploit high degrees of execution parallelism. In order to demonstrate the generality of attack, we port the implementation of AES from a standard and widely used library, the OpenSSL 0.9.7 library, into CUDA code. Note that the ported implementation of AES is similar to the ones evaluated for performance in many other studies [56, 57]. We discuss different implementations of AES that are immune to our attack but incur performance degradation in Section 3.7. To port the implementation, we need to decide where to store T-tables in the GPU memory hierarchy and how to assign encryption jobs to GPU threads. In Chapter 2 and our prior work [58], we stored T-tables in the Global Memory unit, but the implementation becomes vulnerable to coalescing attacks. Since T-tables are constant data and shared by all threads, they are a good candidate to store in the Shared Memory unit. Multiple studies on GPU implementations of AES have demonstrated the advantages of using the Shared Memory unit for storing T-tables [59, 60, 56, 61, 57, 62, 63], and our work adopts this implementation. To assign encryption jobs to GPU threads, we transform the AES encryption procedure into a single GPU kernel, where each GPU thread processes one block encryption, independently. Each block consists of 16 bytes. The AES algorithm is composed of nine rounds of SubByte, ShiftRows, MixColumn, and AddRoundKey operations, followed by the last round with only three operations (omitting the MixColumn one). For faster processing, the first three operations are integrated into T-table lookups
38 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS in the first nine rounds. In the last round, a special T-table (T 4) is referenced, then followed by byte masking. Each encryption round requires one 16-byte round key. The ten round keys are generated by the key scheduler using one 16-byte user-specified master key. Knowing any round key, an attacker can compute the original 16-byte master key. Our attack strategy targets the last-round key. A code snippet of the last round operations generating the first four bytes of ciphertext is shown in Listing 3.1.
Listing 3.1: AES Last Round Code Snippet
1 O0 = (T4[(In0 >> 24) & 0xff] & 0xff000000)ˆ
2 (T4[(In1 >> 16) & 0xff] & 0x00ff0000)ˆ
3 (T4[(In2 >> 8 ) & 0xff] & 0x0000ff00)ˆ
4 (T4[(In3 ) & 0xff] & 0x000000ff)ˆ k0;
Variable O0 is the first four bytes of the 16-byte ciphertext. Each variable, In0 to In3, contains 4 bytes of the input state for the last round. A selected byte of each variable is used to index into the T-table to obtain a four-byte output, of which only one byte contributes to the final ciphertext. k0 is the first four bytes of the last round key. From the original algorithm, the last-round can be simplified to use byte-wise operations, as shown below:
cj = SBox[si] ⊕ rkj (3.1)
where the input byte position, i, for the SBox operation, differs from the output ciphertext byte position, j, due to the ShiftRow operation. Each byte of the last round input state, si, can be calculated once the corresponding cipher and key bytes are known by:
−1 si = SBox [cj ⊕ rkj] (3.2)
3.2.2 Nvidia GPU Memory Hierarchy
In this chapter, we describe our attack on an Nvidia Kepler K40 GPU in detail, though we extend our analysis onto other Nvidia GPUs with different architectures, demonstrating the broad application of our approach. The GPU devices used in this chapter are listed in Table 3.1. These Nvidia GPUs have a similar memory hierarchy, except some of them have a dedicated Shared Memory unit, whereas the Nvidia Kepler GPU does not. We will describe the major differences in these memory architectures in Section 3.6 and discuss how the differences can impact the effectiveness of
39 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
Architecture Kepler Maxwell Pascal Turing Volta Device TESLA GTX 1660 Tesla V100 GTX 950m TITAN X Model K40c Ti PCIE
Table 3.1: List of tested Nvidia GPUs our attack. In this section, we will focus on the memory hierarchy using the Nvidia Kepler GPU as an example, shown in Figure 3.1.
Figure 3.1: Nvidia Kepler GPU Memory Hierarchy
On the Nvidia Kepler GPU, there is an off-chip DRAM memory (device memory) which is partitioned into global, texture, and constant memory regions. Data in those memories are shared among all threads running on all 15 Streaming Multiprocessors (SMXs). Each SMX (each with 192 single-precision floating point cores) also has L1, texture, and constant caches. Data in those caches are private to the threads running on the SMX. In addition, there is a Shared Memory for each SMX, and only the block of threads that allocated specific data in Shared Memory can access that data. Also, each GPU thread owns an exclusive set of 255 registers to store the current thread state. On the NVIDIA Kepler GPU, the Shared Memory and the L1 cache reside in the same physical memory storage, with the total size at 64 KB. The individual size of the Shared Memory and L1 cache is configurable. In our case, we allocate 48 KBs as the Shared Memory and 16 KB as the L1 cache. Note that the size configuration does not affect the attack, nor the results presented in this chapter. The Shared Memory is divided into 32 banks, and it has a configurable bank line size (annotated as bank size in the Nvidia documentation [11]): four bytes or eight bytes. Since using a bank size of eight bytes will lead to fewer bank conflicts and improve an application’s performance,
40 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS we set the bank size to eight bytes for faster AES encryptions. The memory address breakdown for the Shared Memory is shown in Figure 3.2, where the three least-significant bits are used for the bank offset, and the next five bits (bits 3-7) are the bank index and are used for selecting the bank where a line is retrieved for kernel computation.
Figure 3.2: Memory address to Shared Memory bank mapping
When multiple memory requests address different Shared Memory banks (i.e., bits 3-7 are different), they can be serviced in a single GPU cycle, providing much higher memory bandwidth than that of a monolithic cache bank design. However, a bank conflict occurs whenever multiple memory requests access the same bank (with the same bank index, but different tag values). Thus, there will be a noticeable timing difference between memory requests with and without bank conflicts.
3.2.3 Single Instruction Multiple Threads Execution Model
With the SIMT execution model, one GPU instruction is executed by at least a warp of 32 threads, and each thread has its own set of registers. However, all threads within a warp must be synchronized at an instruction boundary, which means that no thread in a warp can execute the next instruction until all threads complete the current instruction. For memory instructions, each thread will generate a memory request. A warp of threads will generate 32 memory requests for one memory instruction. Under the SIMT model, the execution time of this memory instruction will be determined by the Shared Memory bank that receives the highest number of bank conflicts (i.e., the largest number of requests resolving to the same bank). In other words, the execution time of a GPU memory access instruction becomes highly dependent on the memory addresses issued and whether that address results in a bank conflict with other accesses. An attacker can exploit this dependency to recover the secret key of the cryptographic operations running on the GPU.
41 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
3.3 Threat Model
Our threat model includes co-residence of the adversary and victim on one physical machine. We use this threat model for evaluation of our attack. However, we do not anticipate any issue for this attack to work in a cloud environment. The threat model assumes that the adversary is a regular user without root-level privileges, and the underlying operating system is not compromised. The adversary can measure the execution time of a GPU encryption kernel in a direct or indirect manner. For a direct measurement, the victim may expose the timestamp when a GPU kernel is launched and ended. For an indirect measurement, the victim can use non-privileged APIs to query the status of the GPU and infer the start and stop timestamps of the GPU kernel. A similar technique is described by Naghibijouybari et al. [64]. For the purposes of our evaluation, we will assume the victim exposes the timestamp whenever a GPU kernel is launched and when it finishes, providing direct measurements. The threat model also assumes that the adversary can observe the ciphertexts.
3.4 Cache Bank Conflicts-Based Side-Channel Timing Channel
In this section, we conduct experiments to examine the impact of bank conflicts on GPU program execution time, i.e., discovering the timing side-channel. We develop a kernel that uses a warp of threads to issue loads to Shared Memory. Depending on the address of the data that each thread accesses, some number of bank conflicts will occur, resulting in different execution times for the load operations. We perform timing analysis on an Nvidia Kepler K40 GPU. All of the micro-benchmarks presented are designed specifically for the microarchitecture of the Kepler memory system. Later, we show the same timing analysis on the other architectures, such as Maxwell and Pascal, Volta and Turing, which feature a range of memory hierarchies that differ from the Kepler architecture. We develop a memory access pattern for a warp of threads to generate a specific number of bank conflicts, produced by selecting the address that each thread accesses. Using a high-resolution (a cycle-accurate timer) time-stamping mechanism, we can study the impact of bank conflicts on the kernel execution time. We have developed Microbenchmark 1, which is shown in Listing 3.2.
Listing 3.2: Microbenchmark 1
1 register uint32_t tmp, tmp2, offset = 64; 2 __shared__ uint32_t share_data[1024 * 4];
42 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
3 ... 4 int tid = blockDim.x * blockIdx.x + threadIdx.x; 5 tmp = clock(); 6 tmp2 =share_data[tid * stride + 0 * offset]; 7 tmp2 += share_data[tid * stride + 1 *offset]; 8 ... 9 tmp2 += share_data[tid * stride + 39 * offset]; 10 times[tid] =clock() - tmp; 11 in[tid] = tmp2;
The purpose of the microbenchmark is to run 32 concurrent threads in a warp, with each thread generating a sequence of memory accesses. We measure the execution time of the warp. In Listing 3.2, the variable share data points to a continuous 16KB of Shared Memory space, where each element of the array is one word (4 bytes). In Line 4, the thread ID is obtained. In Lines 6-9, each thread is accessing 40 memory locations in the sequence, with an offset of 64 words between two adjacent memory addresses (offset). Note the memory address distance between two threads is the stride, which can be fine-tuned to produce a different number of bank conflicts among a warp of threads. Inspecting Listing 3.2 and Figure 3.2, we see that two adjacent memory addresses in a thread will have the same bank index and bank offset (64 words = 28 bytes), and therefore, all the memory addresses requested by a single thread access the same bank. Each thread accesses a single memory region, and the distance between memory regions accessed by the different threads is a single, or multiple strides. By selecting the value of the stride, we can create bank conflicts among threads in a warp. We run this kernel 320,000 times and collect 320,000 timing samples (10,000 timing samples for each stride value, ranging from 1 to 32). Based on our experiments, 10,000 timing samples are enough to produce a timing distribution. However, these timing distributions can shift depending on the system load during the experiments. The timing distribution for these timing samples is shown in Figure 3.3. We observe that there are only five distinct timing distributions for the 32 stride values. Clearly, some stride values have the same timing behavior, and we suspect that those stride values result in the same number of bank conflicts. We next calculate the number of bank conflicts for each stride value. Recall that in our testing platform, the memory address breakdown is shown in Figure 3.2. Given a word index for the shared data array, we can calculate the bank index by dropping the first least-significant bit, and then perform a modulo-32 operation, as described in the formula below:
43 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
idxB = mod(idxM >> 1, 32) (3.3)
where idxB is the bank index, and idxM is the array index. The right shift operator, >>, drops the least-significant bit, and the mod is the modulo operation. As an example, assuming a stride value 16, for a memory access instruction issued across a warp of 32 threads, we will generate the following 32 memory indices: {0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496}. Using Equation (3.3), we have the following bank access indices for the warp: {0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24}. Thus, these requests are for four banks, and each receives eight concurrent requests, i.e., eight bank conflicts produced when the stride is 16. Similarly, we calculate the number of bank conflicts for each stride value in the range of 1 to 32 words, and they end up in five groups, as shown in Figure 3.3. Each group corresponds to a different number of bank conflicts and associated average execution time. We also plot the average execution time for a group (for selected stride values) versus the number of Shared Memory bank conflicts in Figure 3.3. We can easily identify a linear relationship. The slope of the linear line is 392, with an offset of 1002 GPU cycles. Since we are performing 40 sequential Shared Memory loads, the result implies that the average penalty per bank conflict is 9.8 GPU cycles, which is also the strength of the timing channel signal in the Shared Memory banks. Although the penalty for GPU shared memory bank conflicts is not as large as the CPU cache miss penalty (another well-studied cache side-channel, with the difference between a hit and a miss around 100 cycles [6, 33, 9]), it can still be a source of information leakage. Countermeasures resistant to the original cache timing attacks may not work for the bank conflict timing channel. Next, we will demonstrate the feasibility of exploiting this fine-grained timing channel for key retrieval through statistical methods.
3.5 Differential Timing Attack
In this section, we devise a differential timing analysis attack to exploit the timing channel in Shared Memory banks. We start by attacking an AES algorithm, because its table lookup operations are key-dependent memory accesses. In our AES implementation, the lookup table is word aligned, similar to the shared data array used in Microbenchmark 1. Therefore, we expect the execution time of one table lookup operation of a warp of threads to be linearly dependent on the number of bank
44 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
Figure 3.3: The number of bank conflicts vs. the associated timing for 32 stride values. conflicts generated by the threads. The execution time of one entire encryption is also dependent on the number of bank conflicts created by the table lookup operations. Since the index for a table lookup operation is related to the round key, with the correct key guess, we can predict the number of bank conflicts that will occur during one round of AES encryption across a warp using Equation (3.3). By using many different blocks of plaintext, the correlation between the average encryption timing and the number of Shared Memory bank conflicts for the correct key guess should be high, and the correlation for incorrect key guesses should tend to be lower. This is the basic principle for a differential timing attack, similar to the traditional differential power attack (DPA) [1]. Next, we present the details of our attack methodology on AES. We specifically look at the mapping between the AES lookup tables and Shared Memory banks, as well as collect data and recover the last round AES encryption key.
45 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS
3.5.1 Mapping Between the AES Lookup Tables and GPU Shared Memory Banks
As described in Section 3.2, since we are attacking the last round of the AES encryption, we only need to examine the T4 lookup table mapping in the Shared Memory. Note that attacking more rounds (more than three) becomes infeasible due to the algorithm-inherent statistical confusion and diffusion properties. There are 256 4-byte elements in the T4 lookup table. Equation (3.3) is used to calculate the Shared Memory bank index, given a T-table lookup index, where idxB is the bank index, and idxM is the T-table lookup index. Because the bank size is 8-bytes in our Nvidia Kepler K40 GPU, we apply a right shift operator, >>, to drop the least-significant bit.
3.5.2 Collecting Data
The data collection procedure is similar to the experiments that we performed in Section ??. Instead of 40 memory load instructions, each thread performs an actual AES encryption using a random input data block. We record both the encryption time and the ciphertext for a warp of 32 threads. Each data sample composes of 32 16-byte ciphertexts and a timing value, as shown in the following format:
[{C0, C1, ..., C31}, t] where each Ci is a 16 − byte ciphertext produced by
i i i thread i, consisting of 16 bytes {c0, c1, ..., c15}, and t is the total encryption time for this warp
We consider the encryption time measured from the GPU side and CPU side, respectively. The encryption time measured from the GPU side contains much fewer noise sources than that measured from the CPU side, because of the non-deterministic data transfer time between the GPU and CPU, as well as other required initialization procedures in the GPU device for running a kernel. However, the frequency of the CPU is much higher than that of GPU. Hence, the CPU-side encryption time provides more accurate measurements of the encryption. Moreover, measurements on the CPU side represent a more realistic scenario, as the adversary is just a passive observer and normally does not hold the GPU timer.
3.5.3 Calculating the Shared Memory Bank Index
For the last round of AES, with the output ciphertext known, the input state byte can be calculated using Equation (4.2). For a warp of 32 threads, 32 such table lookups are running
46 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS concurrently, and therefore we have:
0 1 31 −1 0 1 31 {si , si , ..., si } = SBox [{cj , cj , ..., cj } ⊕ rkj] (3.4)
0 0 where cj is the cipher byte produced by thread 0, si is the lookup table index for thread th 0, and rkj is the j last round key byte, which is common to all the threads. These lookup table 0 1 31 indices, {si , si , ..., si }, are exactly the Shared Memory indices. We use Equation (3.3) to further calculate the bank indices used by all threads in the warp for this table lookup instruction, and then derive the number of bank conflicts.
Figure 3.4: Calculation from ciphertexts to number of bank conflicts
Using the example shown in Figure 3.4, assuming we have encrypted 32 16-byte plaintexts using 32 threads (i.e., a warp) and obtain 32 16-byte ciphertexts, and we are targeting to recover the 0th last round key byte. First ( A ), we select the 0th cipher byte from the 32 ciphertexts. Second ( B ), we convert the selected cipher bytes to the last rounds states using a guessed key byte value. Lastly ( C ), we calculate the accessed Shared Memory bank indices and the number of bank conflicts that occurred. Note that if the key guess value is not correct, the number of bank conflicts calculated will be incorrect, and would not correlate with the observed timing.
3.5.4 Recovering Key Bytes
Using the collected data, we can launch a correlation timing attack. As shown in Listing 3.1, for the last round of AES each T-table lookup uses one byte in the 16-byte state, and therefore, each round key byte can be attacked independently. For each data sample we collected, we calculate the number of bank conflicts for the table lookup instruction that is using the jth last round key byte, as shown in the example in Figure 3.4. For each key byte value guessed (ranging from 0 to 255), we can calculate the correlation between the average timing and the number of bank conflicts, and use
47 CHAPTER 3. INFORMATION LEAKAGE IN SHARED MEMORY BANKS the correlation value to differentiate the correct key byte from other incorrect key guesses. For the data collected, the number of bank conflicts between the 32 threads falls in the range of [2, 4]. The power of a correlation timing attack lies in the linearity of the timing model, i.e., the total execution time should consist of a deterministic component, linearly dependent on the number of bank conflicts, and an independent Gaussian random variable contributed by the other nine rounds. During an actual AES execution, the timing distribution does not conform to the ideal model, and therefore a correlation timing attack may not be more effective than a differential timing attack, which only considers two values in terms of the number of bank conflicts. Thus, we adopt a differential timing attack approach, and calculate the two average timing values, one for the group of data samples that generate two bank conflicts, and the other for the group of data samples that generate four bank conflicts. The Difference-of-Means (DoMs) between these two groups should be about two times the bank conflict penalty, i.e., around 19 cycles. Thus, for each sample we collected, we first calculate the number of bank conflicts as shown in Figure 3.4. Second, we classify its corresponding timing into one of two groups based on the number of bank conflicts. Finally, we compute the DoM between these two groups. If the correct key value were used, we would see a DoM value of around 19 cycles. Otherwise, the DoM should be close to zero.