<<

Novel Side-Channel Attacks on Emerging Cryptographic Algorithms and Computing Systems

A Dissertation Presented by

Chao Luo

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University Boston, Massachusetts

December 2018 To my family.

i Contents

List of Figures iv

List of Tables vi

List of Acronyms vii

Acknowledgments viii

Abstract of the Dissertation ix

1 Introduction 1 1.1 Motivation ...... 1 1.2 Research Agenda ...... 3

2 Side-Channel Analysis of XTS-AES 5 2.1 Introduction and Motivation ...... 5 2.2 Preliminaries ...... 6 2.2.1 XTS-AES Algorithm ...... 6 2.2.2 Attack and Leakage Model ...... 7 2.3 Simple Power Analysis of Software Implementation on Microcontroller ...... 8 2.4 Horizontal Attack of Hardware Implementation: Analysis of Modular Multiplication 9 2.4.1 Tweak Generation Leakage Analysis without Noise ...... 10 2.4.2 Improved Tweak Recovery ...... 13 2.4.3 Block Tweak Leakage Analysis with Noise ...... 14 2.4.4 Experimental Results ...... 19 2.5 Vertical Attack of Hardware Implementation: CPA on XTS-AES ...... 20 2.6 Countermeasures ...... 23 2.7 Summary ...... 25

3 Side-channel Analysis of AES on GPU 26 3.1 Introduction and Motivation ...... 26 3.2 Preliminaries ...... 27 3.2.1 GPU Basics ...... 27 3.2.2 AES and a CUDA implementation of AES ...... 30

ii 3.2.3 Side-channel Attack and Typical Correlation Power analysis ...... 31 3.2.4 Attack Model ...... 32 3.3 Experimental Setup and Power Trace Acquisition ...... 32 3.4 Power Model Building ...... 34 3.4.1 Hamming Distance Based Power Leakage Extraction ...... 34 3.4.2 GPU’s Power Leakage Model ...... 36 3.5 Discovery of GPU by Power Analysis Attacks ...... 39 3.5.1 Full Key Extraction ...... 40 3.5.2 A More Realistic Execution Environment ...... 42 3.6 Countermeasures ...... 46 3.7 Summary ...... 47

4 Side-channel Analysis of RSA on GPU 48 4.1 Introduction and Motivation ...... 48 4.2 Background: RSA and GPU Implementation ...... 50 4.2.1 Sliding Window Exponentiation ...... 50 4.2.2 Montgomery Multiplication ...... 52 4.2.3 GPU Parallelization of RSA ...... 53 4.3 The Timing Models of RSA on GPUs ...... 54 4.3.1 GPU Timing Model ...... 54 4.3.2 Timing Model Verification ...... 56 4.4 Correlation Timing Attack ...... 56 4.4.1 Attack CLNW ...... 57 4.4.2 Attack VLNW ...... 59 4.5 Success Rate Analysis ...... 60 4.6 Error Correction ...... 63 4.7 Experimental Results ...... 65 4.8 Countermeasures ...... 66 4.9 Summary ...... 67

5 Side-channel Analysis of ECC on Embedded Systems 68 5.1 Introduction and Motivation ...... 68 5.2 Preliminary ...... 70 5.2.1 ECC Background ...... 70 5.2.2 Side-Channel Countermeasures of micro-ecc ...... 72 5.3 Noval Simple Power Analysis ...... 74 5.4 Collision Attack ...... 80 5.5 Discussion ...... 83 5.6 Summary ...... 83

6 Conclusion 85

Bibliography 87

iii List of Figures

2.1 Diagram of XTS-AES sector ...... 7 2.2 Power difference for Tj[127] = 1 and0 ...... 9 2.3 Operation of modular multiplication for different cases of {Tj[127],Tj+1[127]} .. 11 2.4 Hamming weights/distances of block tweaks ...... 12 2.5 Probability distribution of number of possible values for the 7 least significant bits . 14 2.6 Bit error rate of Bayesian test and ML-based test ...... 18 2.7 Distribution of power difference ΔPj ...... 20 2.8 Comparison of ΔHDj and ΔPj ...... 20 2.9 Experiment results with Bayesian test ...... 21 2.10 Correlation coefficient with T[0] and RT[0] ...... 22 2.11 Power difference for Tj[127] = 1 and 0 with dummy XOR protection ...... 24 2.12 Comparison of ΔHDj and ΔPj with dummy bit protection ...... 25

3.1 Typical CUDA threads and blocks present in a single grid [1]...... 28 3.2 Block diagram of a TESLA C2070 streaming multiprocessor [1]...... 29 3.3 The round operation running as one thread...... 31 3.4 The power measurement setup used in this work...... 33 3.5 A sample power trace of our GPU running AES, with the DC signal subtracted. . . 33 3.6 Last round operation on registers for one state byte...... 35 3.7 Distribution of confusion coefficient for one byte of the key for the GPU...... 38 3.8 Distribution of the confusion coefficient without linearity...... 39 3.9 Correlation between the power traces and the Hamming distances for all possible subkey byte values...... 40 3.10 Our CPA attack results...... 41 3.11 Success rate with different combinations of linear and nonlinear Hamming distances. 42 3.12 Empirical and theoretical success rates for 8, 16 and 32 blocks of plaintext. .... 46

4.1 Timing model verification ...... 57 4.2 Operations on Mtemp with CLNW ...... 58 4.3 Operations on Mtemp with VLNW...... 59 4.4 Theoretic and empirical success rate ...... 63 4.5 Sequence of correlation coefficients of a timing attack when an error happens . . . 64 4.6 Correlation coefficients of attacking zero and nonzero windows...... 65

iv 4.7 Always reduce countermeasure ...... 67 4.8 Random assignment countermeasure ...... 67

5.1 Modular multiplication and addition ...... 75 5.2 Simple power leakage from power and EM traces...... 76 5.3 Correlation of power trace with sliding multiplication pattern ...... 77 5.4 Count number of additions after modular multiplication ...... 78 5.5 Ephemeral key candidate search ...... 79 5.6 Power and EM trace collision ...... 82

v List of Tables

2.1 Threshold and BER for Different SNR ...... 16 2.2 Complexity of Search Among Erroneous Bits ...... 19

4.1 Attack result with error correction...... 66

5.1 Attacks and Countermeasures ...... 83

vi List of Acronyms

AES Advanced Encryption Standard. The Advanced Encryption Standard (AES), also known by its original name Rijndael is a specification for the encryption of electronic data established by the U.S. National Institute of Standards and Technology (NIST) in 2001.

CLNW Constant Length Nonzero Window. A sliding window algorithm of RSA, which partitions the private key into segments of windows. The nonzero window has a constant length of bits.

CPA Correlation Power Analysis.

DPA Differential Power Analysis.

ECC Elliptic-Curve . ECC is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. ECC requires smaller keys compared to non-EC cryptography (based on plain Galois fields) to provide equivalent security.

RSA Rivest–Shamir–Adleman. RSA is an algorithm used by modern computers to encrypt and decrypt messages. It is an asymmetric cryptographic algorithm. Asymmetric means that there are two different keys. This is also called public key cryptography, because one of the keys can be given to anyone. The other key must be kept private. The algorithm is based on the fact that finding the factors of a large composite number is difficult: when the integers are prime numbers, the problem is called prime factorization.

SPA Simple Power Analysis.

VLNW Variable Length Nonzero Window. A sliding window algorithm of RSA, which partitions the private key into segments of windows. The nonzero window has a variable length of bits.

XTS-AES Xor-encrypt-xor-based tweaked-codebook mode with stealing AES. An AES mode designed for disk encryption and standardized on 2007-12-19 as IEEE P1619.

vii Acknowledgments

Thanks to my advisor Professor Yunsi Fei, who guided and supported me through the years of my PhD study and research. Thanks to Professor David Kaeli for all the help with my research. Thanks to Professor Adam Ding for the help with mathematics involved in my research. Thanks to Professor Pau Closas for his contribution to the statistic analysis. Thanks to Professor Aatmesh Shrivastava for proof reading.

viii Abstract of the Dissertation

Novel Side-Channel Attacks on Emerging Cryptographic Algorithms and Computing Systems

by Chao Luo Doctor of Philosophy in Computer Engineering Northeastern University, December 2018 Dr. Yunsi Fei, Advisor

After more than 20 year’s research and development, side-channel attacks are constantly posing serious threats to various computing systems. When targeting crypto-implementations to retrieve the secret, side-channel attacks utilize the peculiarity of the specific implementations, and achieve much better efficiency than brute force attacks and traditional which attacks the weakness of the cryptographic algorithms themselves. Typical side channels include power consumption, electromagnetic emanation, and execution time. With inherent correlation between these side-channel information and the secret, statistic analysis can be employed to find the secret. However, there are still many challenges presented for side-channel research driven by two trends: new and emerging computing platforms. New ciphers or variants are being developed to provide higher level of security or get tailored to different applications. For example, XTS- AES (XEX-based tweaked-codebook mode with ciphertext stealing AES) is a security-hardened mode of AES for storage systems, which increases the algorithm complexity and hides more system- dependent parameters to users (attackers). Meanwhile, we see more emerging computing platforms, for general purpose computing or specific algorithm acceleration. Graphic Processing Unit (GPU) has been used to run a range of cryptographic algorithms for higher performance. However, the security of GPU when processing sensitive data, especially the highly relevant side-channel vulnerabilities, has received little attention and is vastly unexplored. Yet GPU differs from other computing platforms distinctly in terms of the hardware structure and software programming model, making side-channel attacks on GPU much more challenging. In this dissertation, I propose several novel side-channel attacks, targeting new ciphers including XTS-AES and ECC and also popular accelerators - GPUs. Some of our vulnerabilities analysis and

ix security evaluation are first of its kind, and we anticipate them to pave the way for mitigations and lead to more active side-channel research. The contributions of this dissertation include:

• Evaluation of the security of XTS-AES. XTS-AES features two secret keys instead of one, and an additional tweak for each data block. These characteristics make the mode not only resistant against cryptoanalysis attacks, but also more challenging for side-channel attacks. In this project, I comprehensively analyze the side-channel power leakage of various XTS-AES implementations and invent effective attacks. I first run a simple power analysis of a software implementation. For a hardware implementation on FPGA, I analyze side-channel leakage of the particular modular multiplication in XTS-AES mode. In addition, I utilize the relationship between two consecutive block tweaks and propose a method to work around the masking of ciphertext by the tweak. These attacks are verified on an FPGA implementation of XTS-AES. The results show that XTS-AES is susceptible to side-channel power analysis attacks, and therefore dedicated protections are required for security of XTS-AES in storage devices.

• Analysis of the power side-channel of GPU running AES. I propose and implement a side- channel power analysis methodology to extract all the last round key bytes of an AES im- plementation on an NVIDIA TESLA GPU. I first analyze the challenges of capturing GPU power traces due to the degree of concurrency and underlying architectural features of a GPU, and propose techniques to overcome these challenges. I then construct an appropriate power model for the GPU. I describe effective methods to process the GPU power traces and launch a correlation power attack on the processed data. I carefully consider the scalability of the attack with increasing degrees of parallelism, a key challenge on the GPU. Both our empirical and theoretical results show that parallel computing hardware systems such as a GPU are vulnerable to power analysis side-channel attacks, and need to be hardened against such threats.

• Analysis of the timing leakage of a public , RSA on GPU. I build a timing model to capture the parallel characteristics of a RSA public-key cipher as implemented on a GPU, and consider optimizations that include Montgomery multiplication and sliding window exponenti- ation. Our timing model considers the challenges of parallel execution, complications which do not occur in single-threaded computing platforms. I describe the first successful timing attack on RSA running on a GPU, extracting the private key of RSA. I also present an effective error detection and correction mechanism. The results demonstrate that GPU acceleration of RSA is vulnerable to side-channel timing attacks. Countermeasures to defend against such attacks are proposed to be incorporated into next-generation GPU implementations.

• Power analysis methods on a real-world ECC library, micro-ecc [2]. micro-ecc incorporates many countermeasures against power analysis attacks on ECC and is the state-of-the-art implementation. Still, I discover two new side channel leakages. Based on these leakages, I propose and evaluate attacks to exploit these two weaknesses, and demonstrate practical full-key recovery on both AVR and ARM embedded systems using either a single power trace or a single electromagnetic emanation trace. These attacks are also applicable to other IoT systems using micro-ecc, and represent information leakage threats to a wide class of security services that are built on top of ECC.

x Chapter 1

Introduction

1.1 Motivation

Side-channel attack was first proposed by Korcher in the seminal work [3] in 1996. It fundamen- tally changed the notion of cryptanalysis, as side-channel attack utilizes information leakage obtained from the physical implementation of a cryptographic algorithm rather than just theoretically analyz- ing the algorithm’s weakness. Typical side channels include power consumption, electromagnetic emanation (EM), and execution time. Depending on the side-channel leakage and properties of the system under attack, the analysis method could be simple power (EM) analysis, differential power (EM) analysis, advanced statistical analysis like Mutual Information Attack [4], Template Attack [5] and Timing Attack [6]. Side-channel attacks are demonstrated to be powerful and effective against all the widely used ciphers, such as DES, AES and RSA, running on various computing systems, such as embedded systems, FGPA and CPU, if no dedicated protection is employed. However, creation of new ciphers and emerging of new computing platforms also present more challenges and opportunities for side-channel analysis. New ciphers or variants are designed to increase the security level or get tailored to specific applications. These ciphers are impacting our daily life significantly with wide deployment of security engines in diverse applications and infrastructures. Their mathematical security has been proven by classic cryptanalysis and certified and standardized by government agencies. However, side-channel vulnerabilities of these ciphers are not fully evaluated yet. They also introduce new challenges for side-channel research, because the algorithms get much more complex. The existing side-channel attacks may not apply to them, and the algorithms have to be strictly scrutinized for potential side-channel leakage. My dissertation first targets XTS-AES, a security-hardened advanced mode of AES proposed by IEEE and approved

1 CHAPTER 1. INTRODUCTION by NIST as an encryption cipher for block-oriented storage systems. It enhances the security over existing AES modes in multiple ways. First, it uses two secret keys instead one. The two keys are independent of each other, such that knowing only one of the keys would not break the system. Second, the input and output of AES encryption are masked by a logical address dependent tweak, which is generated by another AES encryption and modular multiplication. As a result, the internal state of the cipher is blinded to attackers. Since the approval of NIST, it has been widely used in hard disk drives, solid state drives and flash drives. It was also introduced to Windows 10’s BitLocker encryption. With the broad usage in real systems, any side-channel vulnerability would be detrimental to the system security. I thoroughly analyze the algorithmic characteristics of XTS-AES, discover unique side-channel power leakage of XTS-AES operations, and invent novel attacks, revealing the weakness of XTS-AES to side-channel analysis. As post-quantum cryptography is currently being developed for next-generation security engines, this line of research considering new ciphers will provide guidelines to side-channel analysis of post-quantum cryptography. Another emerging development for cryptographic implementations is ciphers are migrating from conventional sequential computing platforms (CPU) to parallel computing platforms for higher performance. Graphic Processing Unit (GPU) has been employed for general-purpose computing in addition to traditional graphics rendering, including accelerating a range of cryptographic algo- rithms. Block ciphers are suitable for parallel computing with their independent data blocks and common operations on blocks. RSA, a computation-intensive public cipher, is notorious for its low performance and acceleration is desired. There has been many crypto libraries porting both block ciphers and public key ciphers onto GPUs, with the improvement of throughput reaching hundreds to thousands of times compared to the fastest CPUs. However, the research on GPU security, especially its side-channel vulnerability, is still in its infancy. Our exploration along this direction is the first of its kind. We realize that it is very challenging to analyze side-channel vulnerabilities of GPUs, due to its distinct characteristics significantly different from other computing platforms, including CPUs and FPGAs. With the unique SIMT (Single Instruction Multiple Thread) model, many threads (processes) are running concurrently with different data, introducing enormous noise in the physical side-channels the attacker would monitor. In addition, there is temporal uncertainty of execution due to the non-deterministic and unknown scheduling of these threads on the underlying processing elements, while oftentimes execution and measurement alignment is a requirement for side-channel analysis. It is a daunting task to perform effective side-channel analysis. In this dissertation, I target side-channel analysis of GPU. I propose methods to overcome these challenges and design innovative attacks to launch successful attacks on both AES and RSA implementations on GPU.

2 CHAPTER 1. INTRODUCTION

The last work of my PhD research is evaluating the side-channel (power analysis) vulnerability of a side-channel resistant real world ECC library, micro-ecc. ECC, a relatively new public-key cipher, achieves much higher computation efficiency with shorter key size compared to RSA. Many developed countermeasures against known side-channel vulnerabilities of ECC have been incor- porated into micro-ecc, making it the state-of-the-art implementation of ECC. micro-ecc is being widely deployed in embedded systems, including the burgeoning Internet-of-Things (IoT) devices and systems. Despite the claimed side-channel resistance [2], I discovered two vulnerabilities of the implementation, and designed efficient attack methods to retrieve the secret key. To avoid the attack, countermeasures are also proposed in this dissertation.

1.2 Research Agenda

With the aforementioned motivations, my PhD dissertation investigates several new cipher/modes and the emerging computing platform for cryptographic implementation, GPU. I investigate the following topics in the rest of the dissertation:

1. Power analysis of XTS-AES: I target an XTS-AES implemented on FPGA using power analysis. The novelty lies in the leakage discovery of the modular multiplication of XTS-AES which has not been investigated before. I also discover the relationship between consecutive block can be leveraged to break the system. I design the attack methods utilizing the new vulnerabilities to retrieve the full secret key.

2. Power analysis of AES on GPU: It is the first successful side-channel power analysis of GPU. I derive a power model and corresponding data processing method to overcome the challenges introduced by GPU’s parallel computing feature. Correlation power analysis is used to extract the secret key with the power model. The results show that GPU is vulnerable to side-channel power analysis, for the first time, and side-channel countermeasures should be developed for GPUs as well.

3. Timing analysis of RSA on GPU: I target a popular RSA implementation on GPU with Montgomery multiplication and sliding window exponentiation. A hierarchical timing model is build to explicitly capture various complex interactions in a massively parallel computing platform. Based on this model, I design a successful correlation timing analysis to retrieve the private key in an iterative manner. An effective error correction algorithm is designed to detect and correct attack errors, achieving significant improvement of success rate.

3 CHAPTER 1. INTRODUCTION

4. Power analysis of state-of-the-art ECC implementation: I target a real world implementa- tion of ECC, micro-ecc. Two vulnerabilities are newly discovered even though it has already been protected from common side-channel power attacks, including SPA and DPA. A simple power analysis and collision attack are designed to exploit the two weaknesses, respectively and successfully recover the private key of ECC. I also propose effective countermeasures to prevent such attacks.

4 Chapter 2

Side-Channel Analysis of XTS-AES

2.1 Introduction and Motivation

XTS-AES (XEX-based tweaked-codebook mode with ciphertext stealing) [7], an AES (Advanced Encryption Standard) mode designed specifically for data protection on block-oriented storage devices, has been widely used on hard disk drives (HDD), solid-state disks (SSD) and flash cards. It protects the confidentiality of sensitive data even when the adversaries have physical access to the device. The US National Instituted of Standards and Technology (NIST) has approved its usage in 2010 [8]. It adopts AES as its block cipher, but involves two phases of AES encryption with different keys. In the first phase, a series of block tweaks (128-bit data blocks) are generated through one AES encryption with a tweak key followed by a multiplication-by-2 in finite field GF(2128). The second phase is for data encryption, where every input and output block of AES is XORed with a distinctive block tweak generated from the first phase. In order to reveal the plaintext of the data stored on a protected device, the attacker has to infer the block tweaks in addition to the data encryption key, all unknown to adversaries. The security of different modes of AES under side-channel analysis has been evaluated by many researchers. Jaffe described a differential power analysis (DPA) against the counter mode (CTR) without knowing the initial counter [9]. In [10], Jayasinghe et al. studied the common advanced modes, including Cipher Block Chaining (CBC), Cipher Feedback (CFB), Output Feedback (OFB) and CTR, under side-channel analysis in a quantitative way. Recently the security of the new XTS- AES mode has also attracted a lot of research attention. Unterluggauer et al. pointed out the data encryption key can be extracted by one extra attack on the second-last round [11] in addition to the normal attack on the last round. However, the attack on the second-last round needs to deal with

5 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

MixColumn operation [12], which increases the attack complexity with four subkey bytes. Side- channel information may also be leaked through modular multiplication in finite field, and be used to launch attacks. In [13, 14], Bela¨ıd et al. described a side-channel analysis of finite field multiplication. Their attack works when the secret operand of multiplication is constant and the known operand changes. However, their attack is not applicable to XTS-AES, where the secret operand (tweak) keeps changing by multiplication of a constant of 2 in XTS-AES’s modular multiplication.

2.2 Preliminaries

In this section, I first give a brief introduction of XTS-AES and its modular multiplication. I then describe the attack model and the leakage model of XTS-AES on sector-based storage drives.

2.2.1 XTS-AES Algorithm

XTS-AES mode is an instantiation of Rogaway’s XEX (XOR Encrypt XOR) tweakable block cipher [15]. It provides stronger security than the common Electronic Codebook Mode (ECB), and can be parallelized for better performance [16] compared with CBC Mode. It protects data against ciphertext manipulation and cut-and-paste attacks [17]. The data to be encrypted is divided into equal-sized sectors containing multiple data blocks (e.g., each at the size of 128 bits for AES-128). The typical sector size of storage devices is 512 bytes before 2011 and 4K bytes after [18]. In this chapter, we consider the sector size of 4K bytes, which consists of 256 128-bit blocks. Fig. 2.1 shows the process of an XTS-AES sector encryption. The initial 128-bit tweak, T0, is normally the encrypted logical sector address (i) where the data is stored with a tweak key KeyT by an AES 128 encryption. T0 then goes through a modular multiplication α, a primitive element of GF (2 ) which is 2 here, to produce the next block tweak T1, and so on for further block tweaks. Each block tweak is applied to mask both the plaintext (using XOR) before the encryption, and demask the output after the data encryption to generate the final ciphertext. The operation of modular multiplication of α can be described as below. With each block tweak

Tj represented as a vector of 128 bits (Tj[127],Tj[126], ... , Tj[0]), the relationship between Tj and

Tj+1 is: ⎧ ⎨ T [(i − 1)%128] ⊕ T [127] for i ∈{1, 2, 7} T i j j j+1[ ]=⎩ (2.1) Tj[(i − 1)%128] otherwise

6 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

AES-enc ӽ ӽ

˔ ˔

AES-enc AES-enc

˔ ˔

Figure 2.1: Diagram of XTS-AES sector encryption

where ⊕ is bit XOR, and % is integer modular operation. It means that when Tj[127] = 0, the modular multiplication is a round left shifting; otherwise, it is a round left shifting plus inverting the bits at indices 1, 2, 7.

2.2.2 Attack and Leakage Model

In the attack model, we assume that the adversary only has the knowledge of the stored ciphertext,

Cj, which can be obtained from the memory storage device physically. It is also the threat model considered in the design of XTS-AES [7]. The plaintext and the block tweaks are never revealed. Although the logical sector address, i, can be observed by probing the memory bus [11], it can be easily hidden by adding an unknown offset onto it or other obfuscation methods. We consider it unknown for general attacks. The output of the data encryption, CCj, is unknown because of the unknown block tweak Tj. For the leakage model, we assume the Hamming weight or Hamming distance of intermediate data is leaked with some Gaussian noise. The Hamming weight model is commonly used for attacking microprocessor implementations, and the Hamming distance model is more suitable for hardware implementations such as FPGA and ASIC. The goal of our side-channel power analysis is to recover both the encryption and tweak keys.

The block tweak is a key element here. By knowing the T0 of every sector associated with different i, the tweak key KeyT can be recovered by a classical CPA considering T0 as the ciphertext. For the encryption key KeyE recovery, the ciphertext on the disk is first demasked by the block tweak, then CPA can be applied with the knowledge of the encryption output CC.

7 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

2.3 Simple Power Analysis of Software Implementation on Microcon- troller

In our prior work [19], we focus on the attacks of hardware implementation. For comprehensive power analysis of the XTS-AES algorithm, we also have to consider software implementations. In this section, we evaluate the side-channel vulnerability of software implementations and perform a simple power analysis of microcontroller. The recommended software implementation of XTS-AES in [7] has a vulnerability to simple power analysis, leading to the disclosure of tweak T0. For updating th the (j +1) block’s tweak Tj+1 from Tj as in Equation (2.1), there is a conditional extra XOR operation depending on the most significant bit value of Tj. Algorithm 1 depicts the tweak updating th process, where Tj(i) is the i byte of Tj, and there are 16 bytes in each tweak.

Algorithm 1 Tweak Updating

Input: Tj

Output: Tj+1 = Tj ⊗ α

1: Cin ← 0 2: for i =0to 15 do {16-byte left shifting}

3: Cout ← (Tj(i)  7) & 1

4: Tj+1(i) ← ((Tj(i)  1) + Cin)&0xFF

5: Cin ← Cout 6: end for

7: if Cout == 1 then

8: Tj+1(0) ← Tj+1(0) ˆ 0x87 9: end if

10: return Tj+1

In Algorithm 1, the input Tj and output Tj+1 are the two consecutive tweaks of 128 bits. Lines 2-6 implement the 16-byte left logical shifting (the rightmost filled bit is zero) and the leftmost bit (MSB) is saved in Cout. It start from the least significant bytes, Tj(0), and iterates 16 times for all the bytes. For each byte processing, Cout extracts the most significant bit of the byte, and Cin is the MSB of the next lower byte. Lines 7-9 are for the conditional XOR. If the Cout of the tweak Tj is one, the generated tweak Tj+1 should be XOR-ed with 0x87.

If Tj[127] = 1, there will an extra XOR operation at line 8. As a result, the power consumption by the XOR operation can be observed in the power trace, when compared with another one without

8 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES the extra XOR operation. The power profile would be longer, and in different shape. The most significant bit of Tj can be extracted by a simple power analysis as shown in Fig 2.2. The power traces are measured from an ATmega328p microcontroller running the XTS-AES encryption. As shown in Section 2.4, T0 can be recovered by knowing Tj[127] for j =0, ..., 127. There are multiple quick solutions to fix this vulnerability. For example, one is to balance the branch, also with an XOR operation Tj[127] = 0, but the constant is 0x00 instead of 0x87. Another solution is to use two extra variable to hold the values of Tj+1(0) and Tj+1(0)ˆ0x87, and copy one of them to the final output of Tj+1(0) according to Tj[127], which incurs execution overhead and may also possibly be optimized by the compiler. Both methods are effective for hiding the value Tj[127] on ATmega328p from simple power analysis, but may still vulnerable to advanced side-channel attack such as template attack [20].

170 T [127]=1 j T [127]=0 160 j

150

140

130 Normalized Voltage

120

extra XOR operation

110 0 1000 2000 3000 4000 5000 6000 7000 8000 Sampling Points

Figure 2.2: Power difference for Tj[127] = 1 and 0

2.4 Horizontal Attack of Hardware Implementation: Analysis of Mod- ular Multiplication

For hardware implementation, the simple power analysis is not effective any more. The common implementation of line 7 to 9 in Algorithm 1 is a multiplexer selecting the original Tj+1(0) or the

XOR-ed result as the final value of Tj+1(0) based on the value of Tj[127]. For different values of

Tj[127], there will be no time difference, and the power difference is minimal. As a result, we choose to attack the modular multiplication in XTS-AES to recover the block tweaks, where the operations go horizontally within a data sector encryption, as shown in Fig. 2.1, to generate block tweaks consecutively. We call such attack horizontal attack. We first ignore the effect

9 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

of noise, and demonstrate how to get the value of T0 bit by bit. We then show how to recover the bit information with noise and make correction in case of errors.

2.4.1 Tweak Generation Leakage Analysis without Noise

We assume all the tweaks are stored in a register. Its Hamming weight and Hamming distance are HW(Tj) and HW(Tj+1 ⊕ Tj) for j ≥ 0. For the Hamming weight power leakage model, we can obtain the Hamming weights from simple-power analysis (SPA) of a single side-channel power trace. If Tj[127] is 0, according to (2.1),

Tj+1 is Tj round left shifted by one, resulting in the same Hamming weight as HW(Tj+1)=HW(Tj).

Otherwise, there are three bits inverted in Tj+1, which change HW(Tj+1) by either ±1 or ±3.We denote the Hamming weight difference between two consecutive block tweaks in a sector as:

ΔHWj+1 = HW(Tj+1) − HW(Tj) for j ≥ 0 (2.2)

It can be summarized as below, ⎧ ⎨ 0 for T [127] = 0 HW j Δ j+1 = ⎩ (2.3) Δ for Tj[127] = 1, Δ ∈{±1, ±3}

By observing ΔHWj+1, we can tell what the value Tj[127] is. After having {T0[127],T1[127], ... , T127[127]}, the highest bits of 128 consecutive blocks tweaks within one sector, we can reconstruct T0 based on (2.1): ⎧ ⎪ T0[n]=T127− [127] 127 ≥ n ≥ 7 ⎪ n for ⎨⎪ T0[n]=T127−n[127] ⊕ T0[121 + n] for 6 ≥ n ≥ 2 (2.4) ⎪ T T ⊕ T ⊕ T ⎪ 0[1] = 126[127] 0[122] 0[127] ⎩⎪ T0[0] = T127[127] ⊕ T0[121] ⊕ T0[127] ⊕ T0[126]

For the Hamming distance power leakage model, we define the differential of two consecutive Hamming distances in the register as:

ΔHDj+2 = HDTj+2 − HDTj+1 (2.5) = HW(Tj+2 ⊕ Tj+1) − HW(Tj+1 ⊕ Tj)

We consider all the 4 cases of (Tj[127],Tj+1[127]), and illustrate the composition of Tj, Tj+1, and Tj+2 in Fig. 2.3. We represent each bit of Tj as a box with an index number, and also Tj+1 and Tj+2 using bits of Tj. A number with a bar over it means its inverse, and two bars means the

10 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

127 126 … ... 8 7 6 5 4 3 2 1 0

126 125 … ... 7 6 5 4 3 2 1 0 127

125 124 … ... 6 5 4 3 2 1 0 127 126

127 126 … ... 8 7 6 5 4 3 2 1 0

126 125 … ... 7 6 5 4 3 2 1 0 127

125 124 … ... 6 4 3 2 1 126

127 126 … ... 8 7 6 5 4 3 2 1 0

126 125 … ... 7 5 4 3 2 127

125 124 … ... 5 4 3 2 127 126

127 126 … ... 8 7 6 5 4 3 2 1 0

126 125 … ... 7 5 4 3 2 127

125 124 … ... 4 3 2 126

Figure 2.3: Operation of modular multiplication for different cases of {Tj[127],Tj+1[127]} inverse of its inverse, which is itself. We denote a set I = {0, 1, 2, ...127}, and I\A as a subset in I complementing the subset A.

Under the case of (Tj[127],Tj+1[127]) = (0, 0), the bit values of all the three tweaks do not change, only the bit positions in Tj+1 and Tj+2 change. The Hamming distances HDTj+1 and

HDTj+2, are both equal to the sum of XOR results of every bit with its previous bit. That’s HDT00 HDT00 T i ⊕ T i j+1 = j+2 = j[ ] j[( + 1)%128] (2.6) i∈I HD00 HDT00 − HDT00 Δ j+2 = j+2 j+1 =0 (2.7)

The superscription is used for different cases of (Tj[127],Tj+1[127]).

For the case of (0, 1), there are three bits, Tj[127], Tj[0], Tj[5] being flipped in Tj+2 at different positions, and therefore HDTj+1 and HDTj+2 are different in three places. We have

HD01 HDT01 − HDT01 Δ j+2 = j+2 j+1

= Tj[127] ⊕ Tj[0] + Tj[0] ⊕ Tj[1] + Tj[5] ⊕ Tj[6] (2.8)

− Tj[127] ⊕ Tj[0] − Tj[0] ⊕ Tj[1] − Tj[5] ⊕ Tj[6]

11 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

a ⊕ b − a ⊕ b a ⊕ b − a ⊕ b HD01 ∈ We know = , with value of 1 or -1. Therefore we have Δ j+2 {−3, −1, 1, 3} by enumerating all the possibilities. The case of (1, 0), is similar to (0, 1), with only the positions of the difference varied, we also HD10 ∈{− , − , , } have Δ j+2 3 1 1 3 HD11 For (1, 1), with the same analysis we have Δ j+2 =0. In conclusion, ⎧ ⎨ 0 for T [127] = T +1[127] HD j j Δ j+2 = ⎩ (2.9) Δ otherwise, Δ ∈{±1, ±3}

Different from the Hamming weight model, we can only recover the relationship between any two consecutive Tj[127] and Tj+1[127] bits by SPA of the power trace. To reconstruct T0, we start with a guess of T0[127], and Tj[127] for j ≥ 1 can be determined according to (2.9). Then T0 can be reconstructed in the same fashion as in Hamming weight leakage model by using (2.4), but with two candidates instead of one (one for T0[127] = 0, one for T0[127] = 1). The erroneous one can be eliminated through the verification method in next section.

A simulation of HW(Tj) and HDTj+1 with a random T0 is shown in Fig. 2.4. If Tj[127] = 1,

HW(Tj) and HDTj+1 are represented as ◦, otherwise ∗. For the Hamming weight model case, we see if Tj[127] = 1, the next Hamming weight changes. For the Hamming distance model case, if

Tj[127] is not equal to the previous one, the next Hamming distance changes.

70

65

HW(T ) j 60 T [127]=0 j

Hamming Weight T [127]=1 j 55 0 20406080100120 70 HDT j 65 T [127]=0 j T [127]=1 60 j

55 Hamming Distance 50 0 20406080100120 Block Sequence j

Figure 2.4: Hamming weights/distances of block tweaks

12 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

2.4.2 Improved Tweak Recovery

The method described in Section 2.4.1 requires 128 horizontal tweak generations to recover the MSBs of the tweaks first and then to derive individual tweaks. We take the Hamming weight model as an example for its simplicity, while the method also applies to the Hamming distance model.

For one modular multiplication to generate Tj+1 from Tj,ifTj[127] = 1, the Hamming weight difference Δ between Tj+1 and Tj is only dependent on Tj[7],Tj[2] and Tj[1], because only these three bits are toggled. If any one of the three bits is zero, it will be toggled to one in Tj+1, contributing 1 Hamming weight increase in Δ; otherwise, it contributes 1 Hamming weight decrease in Δ. The relationship can be expressed as: Δ= (1 − Tj[k] ∗ 2) (2.10) k=0,1,6

When Tj[127] = 0, we have no extra information because all the bits’ value keep the same and Δ=0. Assuming a Hamming-weight based power model without noise, the differential of two con- secutive Hamming weights, Δ, can be read directly from the power trace. The first Δ determines the MSB, T0[127], and also the relationship between the three bits if T0[127] is one, according to Equation (2.10). Similarly, the second Δ determines the second MSB, T0[126], and also the relationship between the {0, 1, 6} bits of T2, which consist of T0[0] and T0[5] (or flipped value) due to the circular shifting. At most the seven MSBs of T0 will generate seven equations over the seven

LSBs of T0. The problem can be formulated into a classical boolean satisfiability problem, and solved by a SAT solver. Intuitively, to find the values of the 7 LSB variables with at most 7 equations (we learn nothing if one MSB is zero), the solution would not be unique. However, we only have a very small number of solutions (legal values) on average. To get the probability distribution of the number of solutions, we simulate the tweak generation with randomly generated initial tweak, construct the appropriate equations using Equation (2.10) for j =0, 1, ..., 6, and plug them into a SAT solver to get all possible solutions. The probability distribution of number of solutions to the 7 least significant bits are shown in Fig 2.5. The highest probability happens when there are only 8 possible values. The total probability of possible number of values less or equal to 32 is 94.34%. It means, when we launch the attack with only 121 power traces instead of 128, we only have to iterate at most 32 possible solutions for the most of the time, which is trivial for a modern computer.

13 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

0.25

0.2

0.15

Probabilit 0.1

0.05

0 14812162432 48 64 96 128 Number of legal alues

Figure 2.5: Probability distribution of number of possible values for the 7 least significant bits

2.4.3 Block Tweak Leakage Analysis with Noise

Here we are targeting FPGA implementation, of which Hamming distance is a more suitable power model. The physical side-channel measurements (e.g., power consumptions) Pj are noisy observations of the targeted Hamming distances, Pj = HDTj + Nj, j ≥ 1, where  is the unit power consumption for one switching, Nj is the noise, normally following a Gaussian distribution denoted by N (c, σ2). The side-channel signal-to-noise ratio (SNR) is defined as the ratio between the variance of the deterministic component of the power consumption, HDTj, and the variance of 2 2 random noise, Nj. For the 128-bit Tj, SNR =32 /σ , and the SNR can be obtained empirically. The attacker can find the noise variance σ2 by repeating the encryption of a sector at certain logical address (the same i, and therefore the same T0 and the following modular multiplication) and finding the variance of the power measurement at a fixed j position, whose variation is just due to the noise. The attacker can find the variance of the power consumptions across different j within one sector. 2 The variance of such power consumptions is therefore Var(HDTj)+σ , under the assumptions that HDTj and Nj are independent. For simplicity, we consider the normalized (by ) side-channel information so that

Pj = HDTj + Nj for j ≥ 1. (2.11)

With noisy observations, the maximum likelihood principle [21] leads us to search for the T0 value that will maximize the likelihood of the data being observed. Under the assumption of white

14 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

Gaussian noises for an observed trace of (P1, ..., Pn) within one sector, this becomes the least square estimator that minimizes the mean squared error (MSE):

1 n MSE = (P − HDT∗)2, n j j (2.12) j=1 HDT∗ j ≥ T ∗ where j ( 1) is the Hamming distance value corresponding to the j produced by the T ∗ P − HDT∗ 2 HDT − modular multiplication of (2.1) with a guessed value 0 . Since ( j j ) =( j HDT∗ N 2 n MSE T ∗ ≈ σ2 T ∗ T MSE T ∗ ≈ σ2 T ∗ T j + j) , for large , ( 0 ) for 0 = 0, and ( 0 ) 64 + for 0 = 0. Therefore, when σ2 is small (i.e., the SNR is large), the least square estimator can distinguish the correct tweak value T0. The attacker can always increase the SNR by attacking the averages of 2 2 r traces (P¯1, ..., P¯n), whose noise N¯j variance is σ /r (i.e., SNR is increased r times to 32r/σ ) according to the Central Limit Theorem. Finding the MSE minimum in (2.12), however, is prevented by the computation complexity of 128 enumerating all the 2 possible values for T0. Therefore, we first recover the relationship:

δj = Tj[127] ⊕ Tj−1[127] (2.13) from the noisy power observations, and then reconstruct T0 as shown in Section 2.4.1.

Relationship δj Recovery Using Maximum Likelihood Based Test: From Equation (2.9),

δj =0when ΔHDj+1 =0and δj =1when ΔHDj+1 =0 . We have to infer whether ΔHDj+1 is zero from the differential of the two consecutive power measurements: Hence we judge whether

ΔHDj is zero from the differential of the noisy power consumption:

ΔPj+1 = Pj+1 − Pj =ΔHDj+1 +ΔNj+1 (2.14)

2 2 where ΔNj+1 = Nj+1−Nj follows the distribution N (0, σ˜ =2σ ), and ΔHDj+1 ∈{−3, −1, 0, 1, 3} with the corresponding probability {1/16, 3/16, 1/2, 3/16, 1/16}. Hence the absolute value of ΔPj+1 tends to be smaller when ΔHDj+1 =0than when ΔHDj+1 =0 . We can use a threshold TH on the observed ΔPj+1, when it is below the threshold, ΔHDj+1 =0, and δj =0. The bitwise error rate (BER) of such recovered δj is therefore: 3 BER = [Φσ˜(TH − 1) − Φσ˜(−TH − 1)] 8 (2.15) 1 + [Φ˜(TH − 3) − Φ˜(−TH − 3)]+Φ˜(−TH) 8 σ σ σ 2 where Φσ(·) denotes the cumulative distribution function of N (0,σ ). We can find the best TH and the minimal error rate for a given SNR, by numerically minimizing (2.15). The result is shown in

15 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

Table 2.1: Threshold and BER for Different SNR

SNR 8 16 32 64 128 256 512 TH 3.03 2.26 1.74 1.33 0.98 0.74 0.62 BER 0.460 0.430 0.385 0.330 0.266 0.182 0.093

Table 2.1. When the SNR is big (SNR > 100), the optimal threshold can be solved analytically as TH ≈ / / σ2 1 2+loge(8 3)˜ . 127 With probability (1 − BER) , the noisy attack can recover all 127 δj bits correctly as the noiseless attack of Section 2.4.1. Thus, for the noisy attack to achieve the same result as the noiseless attack with 99% probability, we need BER =0.00008 which corresponds to SNR= 3750.As mentioned above, by attacking the averages of r traces, SNR increases r times. With big enough r, the attacker can ensure the attack is the same as the noiseless attack. The two initial guesses of

T0[127], together with the (δ1, ..., δ127), result in two guessed T0 values, and the correct one should minimize the MSE in (2.12).

Relationship δj Recovery Using the Bayesian Test: In our prior work [19], we only presented the maximum likelihood attack method. In this work, we continue to improve the attacking success rate by adopting Bayesian test [22]. The idea of Bayesian test is to decide two hypotheses H1 and H2 from the observation of a random variable Y by comparing the conditional probability of p(H1|Y = y) and p(H2|Y = y). We decide in favor of H1 when p(H1|Y = y) >p(H2|Y = y), otherwise we choose H2. Here the two hypotheses are H1 :Δ=0and H2 :Δ=0 . The observation is the power measurements. From the r traces, we construct two power measurement vectors across the r traces at two time points, Pj = {Pj,1, ..., Pj,r} and Pj+1 = {Pj+1,1, ..., Pj+1,r}, for recovering δj, where Pj,k is the th power of computing Tj during the k repetition. We assume the noise samples in the measurement are i.i.d. (independent and identically distributed) from zero-mean Gaussian distribution with variance 2 of σ . The distribution of Pj and Pj+1 are

2 Pj ∼N (HDTj1,σ I) (2.16) 2 Pj+1 ∼N ((HDTj +Δ)1,σ I) (2.17)

The recovery of δj is equivalent to test the two hypotheses. According to the Bayesian theory,

16 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES the optimal test is given by the ratio

p(Δ = 0|P , P +1) H1 j j ≷ 1 (2.18) p(Δ =0 |Pj, Pj+1) H2

It can be expanded in terms of the likelihood and a priori distribution as

p(P , P +1|Δ=0)p(Δ = 0) H1 j j ≷ 1 (2.19) p(Pj, Pj+1|Δ =0) p(Δ =0) H2 where we consider that p(Δ = 0) = p(Δ =0)=0 .5. From Equations (2.16) and (2.17), we derive

2r 2 p(Pj, Pj+1|Δ = 0) = N (Pn; HDTj,σ ) (2.20) n=1

th where Pn is the n element of the concatenated vector {Pj, Pj+1}.

p(Pj, Pj+1|Δ=k) p(P , P +1|Δ =0)= p(Δ = k) (2.21) j j p(Δ =0) k∈{−3,−1,1,3}

r 2r 2 2 p(Pj, Pj+1|Δ=k)= N (Pn; HDTj,σ ) N (Pn; HDTj + k, σ ) (2.22) n=1 n=1+r

In Equations (2.20) and (2.22), we assume the power measurements are i.i.d. such that the probability of {Pj, Pj+1} is the product of probability of each Pn. In Equation (2.21), we are calculating the probability as a weighted sum for cases when k = −3, −1, 1, 3. We also know p(Δ = −3) = p(Δ = 3) = 1/16, p(Δ = −1) = p(Δ = 1) = 3/16. We get a final equation by plugging in the equations:

H p(Δ = 0|Pj, Pj+1) p0 1 = ≷ 1 (2.23) p(Δ =0 |Pj, Pj+1) E1 + E2 H2 −n 1  E1 =2p1 exp( )cosh( (P +1 − HDT 1) 1)) (2.24) 2σ2 σ2 j j −9n 3  E2 =2p3 exp( )cosh( (P +1 − HDT 1) 1)) (2.25) 2σ2 σ2 j j where p0 =0.5, p1 =3/16 and p3 =1/16. Note that the decision is independent of Pj. 2 In practice, the noise variance σ can be accurately measured from the device, and HDTj can be estimated as the mean of Pj. By using this Bayesian approach, we can avoid estimating the threshold TH and can reduce the error rate. The bit error rate of the Bayesian test and our previous ML-based

17 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

0.25 ML-Based Test Bayesian Test

0.2

0.15

0.1 Bit Error Rate

0.05

0

200 300 400 500 600 1000 2000 3000 4000 Equivalent SNR

Figure 2.6: Bit error rate of Bayesian test and ML-based test test is show in Fig 2.6 with different SNR. From the figure, we can see Bayesian test always has a lower bit error rate, especially when the equivalent SNR is low. Trade-off between Search Complexity and Traces Used: We can lower the number of traces r needed by not insisting all δjs are recovered correctly as in the noiseless attack for both Maximum Likelihood based (ML-based) test and Bayesian test. Assuming there are up to m flipped (wrong) ∗ (T0[127],δ1, ..., δ127) bits, we search the minimum MSE over the recovered T values. Searching 0 m T δ 127 × / m− among bit flips ( 0[127] and (m-1) j bits) requires a data complexity of 2 m−1 =2 127! (( 1)!(128 − m)!) as listed in Table 2.2. If the highest allowable complexity is 264, then we can search among at most m =16bit flips. The correct tweak value is included in this search space if there are at most 15 incorrect δj bits. To ensure at most 15 errors with 99% probability, we need BER =0.066 which corresponds to SNR= 650 for ML-based test and SNR= 350 for Bayesian test. Compared to the SNR= 3750 for ML-based test and 1300 for Bayesian test needed for correctly recovering all δj bits, this expanded search trades the complexity for a reduction of SNR of about six and four times (corresponding to six and four times less traces).

18 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

Table 2.2: Complexity of Search Among Erroneous Bits

Bits 1 2 4 8 16 Complexity(log2) 1.0 8.0 19.4 37.5 64.5

2.4.4 Experimental Results

We implemented an XTS-AES algorithm on an SASEBO-GII (FPGA) board [23]. We first send the fixed sector address to the FPGA, which will set the unit tweak i. Then a sector of data (256 blocks) is sent to the FPGA for encryption. The first-phase encryption will encrypt the unit tweak i with KeyT to generate the first block tweak T0. For the following second-phase encryption, we use 128 consecutive modular multiplications to compute the corresponding block tweaks, used by data encryption. There is a trigger signal associated with every block tweak generation and block encryption, allowing us to record one power trace for each data block in a unit. The first cycle of the power trace is doing modular multiplication. We pick the point of interest in the power trace where the variance is the largest in that cycle as Pj, and normalize it to fit Equation (2.11). To measure the variance of noise, we keep both the sector address and data constant, repeat the encryption of first two blocks multiple times, and calculate the variance at the point of interest as the estimation of noise variance. On the FPGA, we have the noise variance of 3 (SNR=10.7), and the distribution of ΔPj under the condition of ΔHDj is shown in Fig. 2.7. The distributions are distinct for different ΔHD, but the overlap is significant, especially for ΔHD ∈{−1, 0, 1}. With such a small SNR (< 650), the attack cannot succeed even with the aforementioned expanded search. We then attack the average of 500 traces (which increases the SNR to be above 5000).

For the ML-based test method, the averaged ΔPj and ΔHDj for 2 ≤ j ≤ 128 are shown in Fig. 2.8. For each j, they are very close to each other. With a threshold of 0.5 (two symmetrical lines for 0.5 and -0.5), the zero-valued and non-zero-valued ΔHDj can be determined correctly by comparing the observed |ΔPj| and the threshold. We can see that this essentially becomes a noiseless attack as predicted in previous analysis, and the true value of T0 is recovered. For the Bayesian test method, we use Equation (2.23) to determine if the Hamming distance (Δ=0|P P ) p j , j+1 > 1 ΔHD =0 δ =0 has changed or not. If p(Δ=0|Pj ,Pj+1) , it means , and we denote j ; otherwise,

δj =1. The result is shown in Figure 2.9 with 250 traces which is half of that for ML-based test. This method is more robust than the ML-based test. First, there is no need to pick a threshold TH,

19 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES whose value is critical for the ML-based attack. Second, the margin of deciding ΔHD is zero or not is much larger, yielding a higher success rate.

ΔHD=-3 0.16 ΔHD=-1 ΔHD=0 Δ 0.14 HD=1 ΔHD=3 0.12

0.1

0.08 Probabilit 0.06

0.04

0.02

0 -10 -5 -3 -1 0 1 3 5 10 Po er Difference

Figure 2.7: Distribution of power difference ΔPj

4 r

3

2

1

0

-1

-2

-3 ΔHD j A eraged ΔP -4 j Threshold Difference of Hamming Distance and Po-5 e 0 20406080100120 Block Sequence j

Figure 2.8: Comparison of ΔHDj and ΔPj

2.5 Vertical Attack of Hardware Implementation: CPA on XTS-AES

Without obtaining T0 value from the first attack, alternatively, we propose a variant CPA on the second-phase AES encryption with KeyE. This attack requires a number of encryption runs, each with the same sector address and block address (therefore the same block tweak) but with different plaintext, and we call it vertical attack as it involves multiple plaintexts at the same block. In this

20 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

4

3

2

1

0

-1 HD and

-2

-3 HD j -4 j -5 0 20406080100120 Block Sequence j

Figure 2.9: Experiment results with Bayesian test vertical attack, we first recover the tweaked last round key (i.e., the XOR result of a tweak and the key), and propose a new method to recover both the encryption key and the block tweak without attacking the second-last round.

For simplicity, we fix the block tweak to T0 in the analysis. While the objective is to recover the last round key Rkey and block tweak T0, the CPA first recovers the tweaked key Rkey ⊕ T0, denoted as RT0. We analyze the Hamming distance leakage model. Depending on implementations, there are two kinds of Hamming distance leakage. If the implementation does not store the encryption output,

CCj, in register, but only stores the tweaked ciphertext, Cj, the Hamming distance is between the last round input state and the ciphertext. Cj. A straightforward CPA will recover the tweaked last round key RT0 directly. If the implementation stores every AES round state, i.e., including CCj, then the last-round Hamming distance is between the input state and CCj, which has to be obtained from the known ciphertext Cj and T0. The CPA attack has to guess both RT0 and T0. We analyze the second case in detail. th Denote HDi as the attacked Hamming distance intermediate value related to Rkey[i], the i

21 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

byte of the last round key. Then HDi can be expressed as follows.

Sin RT j ⊕ C j i = InvSubByte( 0[ ] [ ]) (2.26) Sout C i ⊕ T i i = [ ] 0[ ] (2.27)

HDi = HW(Sini ⊕ Souti), (2.28) where j = ShiftRow(i)) is the byte position corresponding to ith byte after the ShiftRow operation.

In addition to two known ciphertext bytes, C[i] and C[j], HDi involves two unknown bytes RT0[j] and T0[i].

CPA finds the maximum correlation between a power measurement value and HDi over the 16 2 possible values of the combination of RT0[j] and T0[i], which will reveal both RT0[j] and T0[i].

However, due to the linear relationship of HDi and T0[i], usually only the RT0[j] value is recovered reliably and we always have multiple candidates for T0[i] by the CPA. We launch this CPA on our SASEBO-GII implementation with 20K traces. Each power trace is measured with different plaintext at the same sector address. The SNR for the second-phase data encryption is 0.128 with 16-byte Hamming distance considered as the signal. Correlation coefficients are calculated with guesses of RT0 and T0. Fig. 2.10 shows the results of CPA on the first byte of the tweaked last round key. When T0[0] is fixed to the correct value, the correlation coefficient of the correct RT0[0] is distinguished clearly from other guesses. However when we fix the RT0[0] to the correct value and examine the correlation coefficients of different T0[0] guesses, they change in a zigzag way, and the correct value even does not yield the highest correlation. Based on this observation, the CPA attack can only identify the correct RT0[0] value with the maximum correlation.

(a) T [0] fixed to the correct value 0 0.1 Correct alue 0.05

0

-0.05 Correlation coefficient 0 50 100RT [0] 150 200 250 0 (b) RT [0] fixed to the correct value 0 0.1 Correct alue

0

-0.1

Correlation coefficient T [0] 0 50 1000 150 200 250

Figure 2.10: Correlation coefficient with T[0] and RT[0]

22 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

To further recover the encryption key and block tweak, we propose a solution by repeating the attack on the next data block, with the block tweak of T1. Then the two CPAs reveal RKey ⊕ T0 and RKey ⊕ T1. By XORing these two results, we recover δ = T0 ⊕ T1, respectively. To further retrieve the target tweak T0, we consider two methods. The first method is to treat this as a boolean satisfiability problem [24], with the relationship T1 = T0 ⊗ 2 existing. Considering every bit of T0 as a boolean variable, we have 128 equations for 128 variables, and thus T0 can be solved by a SAT solver with an unique solution.

Alternatively, we propose an efficient algorithm to find T0 by utilizing the properties of modular 0 1 multiplication. We start with two guesses, T0[127] equals to 1 or 0. T0 and T0 denote the two possible values for T0. As shown in Section 2.4.1, if T0[127] = 0, δ = T0 ⊕ (T0  1), every bit of δ is equal to XORing two consecutive bits of T0, so that T0 can be recovered bit by bit from T0[126] to

T0[0].IfT0[127] = 1, δ = T0 ⊕ (T0  1) ⊕ 0X86, we can obtain T0 ⊕ (T0  1) from δ ⊕ 0X86, and can recover all the bits of T0 similarly. We then use the last bit of δ to select the correct T0 value 0 0 out of the two options, because δ[0] = T0 [0] ⊕ T0 [127] when T0[127] = 0. Overall the first 127 bits of δ are used to derive the rest 127 bits of T0, and the last one bit is used as redundancy check to filter out the incorrect guess.

Both the two proposed attacks on XTS-AES recover the block tweak T0 at a fixed sector address. The horizontal attack targets the modular multiplication, collecting power traces within a sector for consecutive block tweak generations. Because the information is retrieved in a simple power analysis (SPA) way, the requirement of SNR is relatively high, but requiring much less power traces. The vertical attack is launched on encryption of multiple plaintexts (first two data blocks in each sector). The CPA (differential) has higher tolerance of noise, but more traces are needed to succeed.

2.6 Countermeasures

Both software and hardware implementations of XTS-AES are vulnerable to side-channel power analysis. Countermeasures against such attacks have to be designed for specific implementation and attack. In this section, we propose countermeasures, and present leakage and security evaluation of the implementations. For simple power analysis targeting the software implementation, we can modify the algorithm to balance the running time of updating the tweaks for different values of the MSB of T . A common solution is to introduce a dummy XOR operation if the MSB of T is 0. This protection adds one more logic operation in one tweak updating if the the most significant bit value is one. In average, it

23 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES adds 0.5 logic operation for each tweak updating. Fig. 2.11 shows the power traces with the dummy

XOR protection. The power difference for different Tj[127] is minimal, only caused by the noise. With this protection, the attack cannot extract the value of T directly from the power trace anymore.

170 T [127]=1 j 160 T [127]=0 j 150

140

130

Normali ed Voltage 120

110

100 0 1000 2000 3000 4000 5000 6000 7000 8000 Sampling Points

Figure 2.11: Power difference for Tj[127] = 1 and 0 with dummy XOR protection

For the horizontal attack on hardware implementations, the leakage comes from the change of Hamming weight and distance depending on the MSB of T . Hiding countermeasures can be designed to make the Hamming weight or distance always change regardless of the MSB of T . For example, at the algorithm level, a dummy T bit can be introduced which flips when the MSB of T is 0, and keeps unchanged otherwise during the updating of T . This protection in hardware will increase the register width from 128-bit to 129-bit, consuming a bit more hardware resource. Also, because the extra bit flips when the MSB of T is 0, it increases the power consumption. Fig 2.12 shows the attack with such protection. When the ΔHDj =0, the power difference ΔPj is either close to 1 or -1. As a result, the attacker cannot tell the value of Tj[127] any more since the power difference always changes regardless of the value of Tj[127]. When the ΔHDj =0 , the protection has no effect on the Hamming weight and distance change. We will still see Hamming distance of ±3. Masking (XORing with a random number) can also obfuscate the dependency of the Hamming weight change on the MSB of T . For the vertical attack, we only take the advantage of the relationship between T0 and T1 on top of classic correlation attack. Since we cannot change T0 and T1’s relationship within XTS-AES, we have to rely on the countermeasures for the underlying AES to defend against such attack. Existing methods including hiding and masking will render AES resistant to power analysis attack, and therefore XTS-AES resistant to the vertical attack.

24 CHAPTER 2. SIDE-CHANNEL ANALYSIS OF XTS-AES

4

3

2

1

0

-1

-2

-3 HD j A eraged P -4 j Threshold -5 Difference of Hamming Distance and Po er 0 20406080100120 Block Sequence j

Figure 2.12: Comparison of ΔHDj and ΔPj with dummy bit protection

2.7 Summary

In this chapter, we explore vulnerabilities of the XTS-AES algorithm to side-channel power analysis. For the software implementation, we analyze its simple power analysis vulnerability from the conditional branch, and propose a simple fix. For the hardware implementation, we design two different attacks to retrieve the block tweaks and therefore two encryption keys. Through power analysis of modular multiplication, i.e., horizontal attack, we can obtain the block tweak value T0 by SPA with error detection and correction. We apply two statistical test methods: ML-based test and Bayesian test. In our CPA, i.e., vertical attack, we attack the first two data blocks of each sector, and extract the tweaked key values, T0 ⊕ Rkey and T1 ⊕ Rkey. We then utilize the relationship between

T0 and T1 to recover T0 and the round key Rkey with very low complexity. We also propose efficient countermeasures to mitigate these attacks. Our experiment results show that XTS-AES is side-channel analysis vulnerable, countermeasures should be carefully designed and deployed to ensure its security.

25 Chapter 3

Side-channel Analysis of AES on GPU

3.1 Introduction and Motivation

Graphics Processing Units (GPUs), originally designed for 3-D graphics rendering, have evolved into high performance general purpose processors. Today, a GPU can provide significant performance advantages over traditional multi-core CPUs by executing workloads in parallel on hundreds to thousands of cores. What has spurred on this development is the delivery of programmable shader cores, and high-level programming languages [25], including CUDA and OpenCL. Since then, GPUs have been used to accelerate a wide range of applications [26], including: signal processing, circuit simulation, molecular modeling and machine learning. Motivated by the demands of efficient cryptographic computation over large amounts of data, GPUs are now being leveraged to accelerate a number of cryptographic algorithms. Before the introduction of CUDA and OpenCL, Cook et al. [27][28] made their first efforts of mapping an AES cipher to a fixed graphics pipeline using OpenGL. By using CUDA, Manavski [29] implemented AES on an NVIDIA GPU G80, achieving a speedup as high as 5.9 times, as compared to the fastest CPU at the time. Iwai et al. achieved approximately a throughput of 35Gbps (Gigabits per second) on a NVIDIA Geforce GTX285 [30]. Li et al. [31] achieved the highest performance, around 60Gbps throughput on a NVIDIA Tesla C2050 GPU, which runs up to 50 times faster than an Intel Core i7-920. More recent work accelerated asymmetric ciphers by exploiting the power of GPUs [32]. Gilger et al. [33] implemented multiple block ciphers, both in CUDA and OpenCL. This provided an OpenSSL cryptographic engine solution that could easily accelerate common ciphers, and thus, reduces the development effort. While the focus of prior work has been on accelerating cryptographic implementations leveraging

26 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU a GPU’s computation power, there is a little prior work that addresses the security of execution on a GPU. Di Pietro et al. [34] demonstrated that leakage of information can occur in a GPU’s shared memory, global memory and registers by using standard CUDA instructions. Maurice et al. [35] recovered data of a previously executed GPU application in a virtualized environment. Lombardi et al. [36] described how a GPU-as-a-Service in the Cloud can be misused and lead to denial-of-service attacks and information leakage. However, side-channel vulnerabilities of GPUs have received limited attention in the research community. Meanwhile, cryptographic systems based on other platforms, including micro-controllers [37], smart cards [38], application-specific integrated circuits (ASICs) [39] and FPGA platforms [40, 41], have all been shown to be highly vulnerable to side-channel attacks. We are the first to conduct research on side-channel analysis of GPUs. In our prior work [42], we presented the first power analysis of AES on a GPU, demonstrating the feasibility of an attack. Our group also launched the first timing attack of AES on a GPU [43]. Distinct from other computational platforms, the Single Instruction Multiple Thread (SIMT) model used on a GPU presents a range of challenges to side-channel analysis. During execution, each thread can be in a different phase of execution, generating some degree of randomness (i.e., timing uncertainties and misalignment of power traces). In addition, the complexity of the GPU hardware system makes it rather difficult to obtain clean and synchronized power traces. The power consumption model is very complicated. To address these challenges, we develop effective methods to obtain clean power traces, and build a suitable side-channel power leakage model to guide a successful power analysis attack. Our correlation power analysis (CPA) attack [42] demonstrates that AES-128 developed in CUDA on an NVIDIA C2070 GPU is susceptible to power analysis attacks.

3.2 Preliminaries

In this section we begin by describing the GPU hardware architecture and associated software programming model. Then we review the specific AES cipher considered in this work and its implementation in CUDA on an NVIDIA GPU. We also review the basics of correlation power analysis attacks, followed by the attack model we use in this work.

3.2.1 GPU Basics

CUDA is a parallel computing platform developed by NVIDIA, and it is also an application programming interface for their GPUs [44]. CUDA source code is divided into two components,

27 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

Grid

Block(0,0) Block(1,0) Block(2,0)

Block(0,1) Block(1,1) Block(2,1)

Block(1,1)

Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0)

Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1)

Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Figure 3.1: Typical CUDA threads and blocks present in a single grid [1]. host code and device code. The host code is run on the CPU (typically C/C++ code), and the device code is executed on the GPU utilizing a number of parallel threads. A group of threads run the same kernel, but are processing different data. Threads are organized into blocks and blocks into a grid, as shown in Fig. 3.1. A thread is indexed both by its block id and thread id, which can be used to specify the data it works on. Thread scheduling is managed by the GPU according to the availability of hardware resources and the degree of parallelism inherent in the kernel. In terms of the hardware structure, a GPU consists of several Streaming Multiprocessors (SM). Each SM can work as a complete independent processor, having its own register file, local cache, and control unit. In a SM, there are many Streaming Processors or CUDA cores, the main computation units on which threads run in parallel. The SM also includes additional hardware resources, including a warp scheduler and dispatch unit, which are needed to control the flow of instructions. The configuration of the SM and CUDA cores varies for different models of a GPU. While our work here

28 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

Instruction Cache

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit

Register File (32,768 x 32-bit)

Core Core Core Core LD/ST SFU ......

Core Core Core Core LD/ST SFU

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Figure 3.2: Block diagram of a TESLA C2070 streaming multiprocessor [1]. specifically targets an NVIDIA TESLA C2070, which has 14 SMs, and 448 CUDA cores (32 for each SM), our attacks can be successful on a large number of different GPUs. The structure of one SM of a TESLA GPU is shown in Fig. 3.2. During execution, thread blocks will be dispatched to the SMs such that the threads in the same block can communicate through local shared memory. Within one thread block, 32 threads are grouped in a warp. A warp is the smallest schedulable program unit. Threads in one warp run in synchronization. Due to data dependencies and control hazards, if a warp in execution is stalled, the warp scheduler will dispatch another warp in order to make the best use of hardware resources. In summary, the host code prepares the data and sets up the runtime environment for the kernel to run on the GPU. The programmer writes the device code which explicitly divides the job into blocks and threads. The GPU scheduler will decide when and where the blocks and threads will run in parallel, based on the available hardware resources, the data dependencies present in the code, as well as the presence of any memory conflicts. For a well-designed CUDA program, data dependencies and memory conflicts should be minimized, and the GPU should use all of its available resources and maximize parallel kernel execution.

29 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

3.2.2 AES and a CUDA implementation of AES

AES [45] is a block cipher algorithm announced as the encryption standard by the National Institute of Standard and Technology in 2001. It is a symmetric-key cipher on a fixed-sized block of data. AES consists of a variable number of rounds, depending on the key length. For key sizes of 128, 192 and 256 bits, AES has 10, 12 and 14 rounds, respectively. For AES-128, one block of data is organized as a 4x4 array of bytes, termed the state. Each round is a sequence of four operations: SubByte, ShiftRow, MixColumn, and AddRoundKey, except for the initial and last rounds. The initial round has only an AddRoundKey, and the last round omits the Mixcolumn. All the round keys are derived from a single initial key by the key schedule. We implement an ECB (Electronic Code Book) mode AES-128 encryption using a CUDA-based kernel based on the reference implementation by Margara [46]. The T-table version of AES [45] is adopted, which is more efficient than the original byte-based SBox version. Our analysis is also applicable to other implementations with minor modifications. The three operations, SubByte, ShiftRow and Mixcolumn, are integrated into T-table lookups and XOR operations. Each thread is responsible for computing one column of a 16-byte AES state, so we need 4 threads to manage one whole block of data, as shown in Fig. 3.3. Note the aforementioned GPU thread block is different from the 16-byte AES data block, which is iteratively updated in each round, transforming the plaintext input into the ciphertext output. Fig. 3.3 shows the round operations for one column running as a single thread. The initial round is simply an XOR of the plaintext and the first round key. There are nine middle rounds for the 128-bit AES encryption. Each thread takes one diagonal of the state as its round input, and maps each byte into a 4-byte word through a T-table lookup. These four 4-byte words are XORed together with the corresponding 4-byte round key bytes, and the result is stored in a column of the output state. The last round has no MixColumns operation, and so only one out of four bytes is kept after the T-table lookup, making it equivalent to a SBox lookup operation and ShiftRow. AddRoundKey is then performed on the four remaining bytes. To begin an AES encryption, the plaintext is first copied into the GPU global memory. Each thread will load its own data into local memory, based on its block id and thread id. After encryption is complete, the ciphertext in local memory is copied back into global memory, and then copied into CPU memory. In ECB mode, the encryption of each block of data is independent, and thus can be parallelized as much as possible, depending only on the size of the data and the available GPU resources.

30 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

Initial round

+= State Plaintext

Round key

Middle rounds

+ + + += State State

Round key T-table lookup Last round

+ + + += State Ciphertext

Round key T-table lookup

Figure 3.3: The round operation running as one thread.

3.2.3 Side-channel Attack and Typical Correlation Power analysis

Side-channel attack is a type of attack based on information gained from the physical implemen- tation of a cryptosystem. Side-channel information can include power consumption, electromagnetic emanation, timing information, and even sound [47]. Because the leaked information depends on the secret key, an attacker can utilize correlation to recover the key with a complexity less than brute force. The attack can be as simple as Simple Power Analysis (SPA) [48] using only a single power trace, reading the key bits directly by inspecting the temporal power variation. It can also be as complicated as Mutual Information Analysis [4], which is based on information theory. We use the Correlation Power Analysis (CPA) method to extract keys of AES-128. It is based on the correlation between the observed power information generated by the hardware and the power estimation calculated from a power model (which is a function of the key). To calculate the correlation, the attacker runs the cipher multiple times with different input plaintexts, and each run generates a power trace. For a block cipher, processing each byte is independent of others, and therefore the attack can be conducted in a divide-and-conquer manner, retrieving the subkey

31 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU bytes one by one. The power model estimates the deterministic part of the power consumption, e.g., the Hamming distance model for CMOS technology [49], which computes the number of logic changes (i.e., 0-to-1 or 1-to-0) based on the known plaintext (or ciphertext) and a guessed subkey byte value. Equipped with a large enough number of power traces, we compute the Pearson correlation coefficient on the trace data and computed data for each guessed sub-key value. If the subkey guess is right, the calculated correlation tends to be higher than when the sub-key guesses are wrong. For a sub-key byte, one out of 256 possible values will be identified. For AES-128 with 16 bytes of key, the entire iteration would only be 2048 (=28 × 16), much lower than the complexity of 2128 for a brute force attack.

3.2.4 Attack Model

For the attack model, we make the following assumptions:

• We assume the attacker knows the encrypted ciphertext. For part of our analysis, we assume knowledge of the plaintext, where the size and value of the input message can be controlled.

• The attacker can obtain the power consumption of the GPU for each encryption. Power traces can be obtained via measurement or on-chip power sensors, locally or remotely.

3.3 Experimental Setup and Power Trace Acquisition

In our experiments, we consider a client-server computing platform, where a GPU is used to accelerate AES encryption. The TESLA C2070 GPU is hosted on the PCIE interface of a workstation running Ubuntu. Fig.3.4 shows our system. In order to measure the power consumption of the GPU card, a 0.1Ω resistor is inserted in series with the ATX 12V GPU power supply. To minimize the invasion to the GPU card, other parts of the board are kept untouched. Since the output of the ATX power is almost constant at 12V when connected to one end of the resistor, we only need to measure the voltage at the other end (using an oscilloscope) to get the voltage drop across the resistor. The attacker sends plaintext to the server. Upon receiving the data file, the server copies it to the GPU memory for encryption. The ciphertext is generated on the GPU, and then returned to the attacker. During encryption, the oscilloscope records the power consumption for the attacker with the sampling frequency of 5GHz while the GPU’s processor clock frequency is 1.15GHz. When the GPU is idle, it consumes little power, and the voltage the oscilloscope measures is close to the supply voltage, 12V. As the AES encryption starts, more power is drawn by the GPU, so the voltage drops.

32 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

ATX Power

Power 0.1 Ohm Power supply supply

plain plain Voltage cipher Host GPU probe Attacker Power trace cipher

Oscilloscope

Figure 3.4: The power measurement setup used in this work.

After it is done, the voltage returns to its original level. Fig. 3.5 shows a sample power trace for our GPU running AES, with the 12V DC signal subtracted. We found the speed of the voltage drop is much slower as compared to the speed of the voltage rise. This may be because when the GPU starts the encryption, it gradually loads the data into memory, but ends by finishing all work in parallel. The power trace is also very noisy, and there seems to be no regular pattern corresponding to the AES round iterations.

-0.36

-0.38

-0.4

-0.42

-0.44

-0.46 Voltage(v)

-0.48

-0.5

-0.52

-0.54 -1012345 Time(s) ×10-4

Figure 3.5: A sample power trace of our GPU running AES, with the DC signal subtracted.

Power trace acquisition on a GPU is performed very differently than the approaches used on

33 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

MCUs, FPGAs and ASICs [37, 40, 39] for a number of reasons. First, the power trace of a GPU contains much more noise for several reasons. Our measurement point on the ATX power supply is far away from the GPU silicon die power supply. On the GPU card, there are many DC-DC units converting the 12V voltage into various other voltage values needed by the GPU, which introduces switching noise, so we need to filter out the desired power information by using large capacitors. The measured total power consumption of the GPU card also contains power consumption of the cooling fan, off-chip memory, PCIE interface and many other auxiliary circuits. These unrelated sources of power consumption further contribute to the noise. Second, since there is no GPIO (General-Purpose Input/Output), or a dedicated pin, on the GPU to provide a precise trigger signal to indicate the start or end of the encryption, the oscilloscope takes the rising edge of the power trace as the trigger signal. Because a power trace can be very noisy and its rising and falling edges are not as clean and uniform as we desire, it is challenging to consistently identify the beginning of an encryption in a trace. Therefore, the traces for different encryptions are not synchronized. The last and most important issue is that the parallel computing behavior of the GPU may cause timing uncertainty in the power traces. The GPU scheduler may switch one warp out and bring in another at any time, and this behavior is not under the programmer’s control. Moreover, there are multiple streaming multiprocessors, each performing encryption concurrently and independently. These facts all pose significant challenges for GPU side-channel power analysis.

3.4 Power Model Building

Next, we build the power leakage model of the GPU for CPA, which will provide us with the power estimation formula PE(k), where k is the key candidate. The correlation between PE(k) and the actual power traces is then used to find the secret key.

3.4.1 Hamming Distance Based Power Leakage Extraction

The principle of a side-channel power analysis attack is that the power consumption of a cryptosystem is determined by key-dependent internal state switchings. The power consumption of a CMOS circuit consists of static and dynamic power [50]. The static power persists as long as the circuit is powered on, due to the leakage of reversed pn junctions. The static power dissipation depends mainly on the temperature and working voltage, and less on the internal data. Hence the

34 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU static power does not vary much and is treated as noise in the power model. The dynamic power is due to switching of voltages in circuit gate outputs (intermediate states). One part of this power is for charging and discharging the parasitic capacitance. Another part of the dynamic power is consumed by the short circuit formed by the PMOS and NMOS transistors to change the output voltage. In a simplified model, the magnitude of the dynamic power consumption is linear with the number of changing bits (i.e., the Hamming distance) of the intermediate state. Next we find the intermediate states of our AES GPU implementation that depend on the secret key and derive their Hamming distances. If any round key is retrieved, we can deduce the secret 128-bit AES key by reversing the key scheduling [45]. Hence we focus on finding leakage for each subkey byte in the last round. By disassembling the CUDA code, we find the related instructions are:

LOAD Rn [Rn]

AND Rn Rn 0X000000FF

XOR Rm Rm Rn

Register name

Reserve only one Load T-table XOR key byte byte +

Value One byte state Four bytes T-table One byte of key One byte off cipher text

Figure 3.6: Last round operation on registers for one state byte.

Fig. 3.6 shows the corresponding operations on resources (registers). The GPU uses a 4-byte register Rn to hold one byte of the last round input state sin (i.e., the three most significant bytes of Rn are zero). Then the GPU loads the 4-byte T-table contents Ta(sin) into the same register. Because there is no MixColumn operation in the last round, only one byte in the register is needed.

Hence the other three bytes are ANDed with zeros, and the result sout remains in the register. The sout is in fact the SubByte and ShiftRow output corresponding to the input sin. Then sout in register

Rn is XORed with its corresponding last round subkey byte k1 in register Rm to get the final cipher byte c1 (also in Rm).

35 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

These three instructions involve two registers, and result in three transitions at three clock edges in the registers. Hence the three Hamming distances (two on register Rn and one on register Rm) can be determined as follows:

h1 =HW((0, 0, 0,sin) ⊕ Ta)=HW(2 ⊗ sout)+ (3.1) HW(3 ⊗ sout)+HW(sout)+HW(sin ⊕ sout),

h2 =HW(Ta ⊕ (0, 0, 0,sout)) = (3.2) HW(2 ⊗ sout)+HW(3 ⊗ sout)+HW(sout),

h3 =HW(k1 ⊕ c1)=HW(sout). (3.3) where ⊕ is XOR, and ⊗ denotes the multiplication in field GF(28).

Since the attacker knows the cipher byte c1, he/she can calculate these Hamming distances from the cipher byte c1 and a guessed subkey byte k1 value, in reverse order. First sout = k1 ⊕ c1, linearly depends on the subkey byte value. Then sin can be recovered through looking up the inverse of the

SBox table, which is non-linear of the key. Ta is the 4-byte T-table value looked up by sin, which consists of four components, 2 ⊗ sout, 3 ⊗ sout, sout, sout, in different order according to the sin byte position. All the three Hamming distances depend on the guessed key value.

3.4.2 GPU’s Power Leakage Model

In previous work targeting CPA on FPGAs and CPUs, usually the attacker uses a power model based on nonlinear Hamming distances, such as hs = HW(sin ⊕ sout). Adopting such a power model is very effective for attack when the power traces can be aligned and the hs-specific operation occurs at a fixed time t in all the power traces. The attacker just needs to analyze the power values at one time point. However, for our parallel computing GPU platform, there are many concurrent threads running. For the three Hamming distances in (3.1)(3.2)(3.3) for each thread, the first is nonlinear, while the other two are linear in terms of key dependencies. Given the non-determinism of the hardware thread scheduler, the operations corresponding to different leakage Hamming distances may be executed at different times by each thread. The resulting power trace for the GPU contains multiple leakages at random time locations. Assuming there are M threads running and the power consumption is measured at N discrete times t1, ..., tN , for each thread i, there are a number of H Hamming distance leakages as follows: hi,1, ..., hi,H . We denote the time of the hi,j-corresponding operation as ti,j. Then the power

36 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU consumption at time t in one power trace is:

M H P (t)=a I{t = ti,j}hi,j + R(t), (3.4) i=1 j=1 where I{t = ti,j} is the identity function (i.e., when t = ti,j, I(t) is 1, otherwise it is 0). a is the unit power consumption for a single bit switching, and R(t) is the noise at time t. The noise R(t) includes all other unrelated power consumption (e.g., operations by other threads executing at the same time period as the measurement, and other unrelated concurrent operations in the same thread).

Since each thread’s power trace is misaligned with other threads’ traces, ti,j can be shifted randomly for different threads. Given the parallel computing behavior of the GPU, it becomes very difficult to identify the exact value of ti,j, since threads can be executing different instructions at any time.

To retain the information of hi,j without the knowledge of ti,j, we propose to sum up P (t) over time t, similar to the sliding window DPA in [51]. Corresponding to the summation over the N discrete times t1, ..., tN of a power trace, the power model becomes the average power consumption of each power trace:

M H P = a hi,j + R (3.5) i=1 j=1 = aP E + R (3.6) where R is the summation of noise over time t, and the power estimation is

M H PE = hi,j. (3.7) i=1 j=1

To launch a CPA attack on a GPU using the average (total) power consumption model described above, the power traces will be processed accordingly, and the Pearson correlation coefficient is derived between the predicted and measured values. The effectiveness of a CPA attack on a highly parallel computing platform has to be evaluated first. Our previous modeling work [52] showed that the success rate of CPA can be predicted by two factors: i) the physical side-channel signal-to-noise ratio and ii) algorithmic confusion coefficients (a metric defined to capture the key distinguishability due to the algorithm and the select function). The noise level tends to be higher in the CPA attack on a GPU since the summation over time includes much more noise due to irrelevant operations in multiple threads. The confusion coefficients also

37 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

0.14

0.12

0.1

0.08

0.06 Frequenc 0.04

0.02

0 0 0.2 0.4 0.6 0.8 1 1.2 Confusion coefficient of GPU

Figure 3.7: Distribution of confusion coefficient for one byte of the key for the GPU. differ in the GPU, as we are targeting the sum of three Hamming distances. We next derive the confusion coefficients of AES on a GPU. From Fig. 3.6 and Equations (3.1) (3.2) (3.3), the select function relating to one byte of ciphertext c1 in one thread is:

hs = h1 + h2 + h3

=3HW(sout)+2HW(2 ⊗ sout)+2HW(3 ⊗ sout) (3.8)

+ HW(sin ⊕ sout) where sout = c1 ⊕ k1. Compared to CPA on other non-parallel computing platforms, where the attack assumes HW(sin ⊕ sout), the hs for a GPU contains extra terms 3HW(sout)+2HW(2 ⊗ sout)+2HW(3 ⊗ sout). Note that for a single bit change in k1, HW(sout) only changes by one, and so does the HW(2 ⊗ sout) and the HW(3 ⊗ sout) in most time periods (when the multiplication result does not overflow). These extra terms have a nearly linear relationship to the hamming weight of the key. Hence, the distribution of confusion coefficients is more spread out on a GPU than in other computing platforms, leading to a less powerful CPA attack. The confusion coefficient [52]is defined as the variance of the difference between the power estimations calculated from the true key and one false key (i.e., a higher confusion coefficient, larger distance between the true key and false key, and easier to distinguish).

2 κi = E[(PEi − PE) ] (3.9)

38 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

where PEi is the power estimation of a false key ki, PE is the power estimation of the correct key, and κi is its confusion coefficient.

0.35

0.3

0.25

0.2

0.15 Freqeuenc 0.1

0.05

0 0 0.2 0.4 0.6 0.8 1 1.2 Confusion coefficient ithout linearit

Figure 3.8: Distribution of the confusion coefficient without linearity.

Fig. 3.7 shows the distribution of the normalized confusion coefficients calculated for one subkey byte of the last round key for the GPU. Because of the linearity, the distribution is widely spread. The false keys possessing near-zero confusion coefficients could become ghost keys, which are hard to distinguish from the true key when working with high noise levels. Fig. 3.8 shows the distribution of the normalized confusion coefficients for a select function, excluding the linear terms and keeping only the non-linear term of HW(sin ⊕ sout), which is the normal situation for a FPGA and a CPU . The confusion coefficients are more concentrated in larger values, and therefore the true key is easier to identify.

3.5 Key Discovery of GPU by Power Analysis Attacks

We next discuss our power analysis of the targeted GPU-based AES implementation. We first extract the full AES key under the chosen-plaintext attack model, where the adversary can choose the size and content of the plaintext message. In this attack, the signal strength is boosted by redundant AES encryption instances of the same plaintext message, making full use of the GPU’s hardware resources to overcome the problem of high noise and the linearity in confusion coefficients. Then, we extend our analysis to the known-cipher attack model, where the attacker has no control of the input of plaintext, and predict the number of traces needed for specific attack success rates.

39 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

3.5.1 Full Key Extraction

To take advantage of the parallel computing structure of the GPU, we let each AES block encryption use L =4threads. We first set the message size to 8 blocks, requiring 32 concurrent threads, which we call an AES encryption instance. We consider the attack in a multi-threaded environment by executing multiple concurrent AES instances. Since the power consumption of one AES encryption is very small, we first use 768 instances of AES processing the same message to increase the power consumption for the measurement. That is, we use 768 × 4 × 8=24, 576 threads for each power trace. Across different power traces, the plaintext messages vary and are generated independently of each other. We first test the viability and correctness of our power model. We increase the number of power traces from 1000 to 100,000, in steps of a 100 traces. For each selected number of traces, the power estimates for one selected subkey byte (the first byte of last round key) are computed for all 256 possible values. The Pearson correlation coefficient for each subkey byte guess is computed using the power estimate according to Equation (3.7) and the average (sum) of the power points in each power trace, which is plotted in Fig. 3.9.

0.15 Correct subke Wrong subke s

0.1

0.05

0 Correlation coefficient

-0.05

-0.1 012345678910 Number of traces ×104

Figure 3.9: Correlation between the power traces and the Hamming distances for all possible subkey byte values.

As shown in Fig. 3.9, after 40,000 traces, the correct subkey clearly stands out, producing the largest (negative) correlation coefficient. The negative correlation coefficient is due to the usage of the voltage to represent the power consumption here. Lower voltage in fact means higher power

40 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU consumption. In Section 3.5.2, we build a statistical model to estimate the success rate of getting the correct subkey for a specific number of traces. We also observe that, although the correlation coefficient of the correct subkey stands out, some false subkey values result in very close correlation coefficients, no matter how many traces we collect and analyze. This is due to the fact that the power model (3.7) depends heavily (and linearly) on the key value, as discussed in the end of Section 3.4.2. Therefore, the false values for subkeys possessing small Hamming distances from the true value are not as easily distinguishable in a GPU setting, especially when compared to the same values obtained on a CPU or FPGA computing platform. Next, we run CPA on the power traces, extracting the last round key, byte by byte. Fig. 3.10 shows the attack results for 100,000 traces. We label the true subkey bytes with ‘*’ and the candidates with lowest correlation with a ‘◦’. In the figure, all the correct subkey byte values have the lowest correlation coefficients (i.e., the attacker can recover the exact last round key with control of the plaintext).

Byte0 Byte1 Byte2 Byte3 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 Byte4 Byte5 Byte6 Byte7 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 Byte8 Byte9 Byte10 Byte11 0.03 0.02 0.01 0

Correlation coefficient -0.01 -0.02 -0.03 Byte12 Byte13 Byte14 Byte15 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 1 64 128 192 2561 64 128 192 Key values 64 128 192 2561 64 128 192 256

Figure 3.10: Our CPA attack results.

The select function for our attack (shown in Equation (3.8)) includes all three Hamming distances in the last round which are key-dependent. In general, a non-linear select function would result in

41 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU higher and more concentrated confusion coefficients, which would lead to a higher success rate under the same SNR (according to the success rate prediction model given in [52]). With three Hamming distances, adding them up would reduce the noise and increase the SNR, and would also result in different confusion coefficients as compared to using only one Hamming distance. For the three

Hamming distances h1, h2 and h3, h1 is nonlinear, and h2 and h3 are mostly linear with the subkey byte’s value. We generate seven different select functions based on different combinations of the three Hamming distances, calculate their corresponding confusion coefficients, derive SNRs from the measured traces, and plug them into the success rate prediction formula (detailed in next section).

Fig. 3.11(a) shows that among the three Hamming distances, h1 is the best select function due to its non-linear nature and higher signal level (which is a little bit higher than h2). h3 is the worst select function, since it only provides a linear component. In addition, in Fig. 3.11(b), comparing the three groups of curves (all three Hamming distances included, two included, and only one, the highest h1, included), we see that including more leakage Hamming distances always produces better results. For this reason, in the subsequent analysis, we always use all the Hamming distances (h1, h2 and h3) for effective attacks.

1 1 h1 h1 0.9 h2 0.9 h1+h2+h3 h3 h1+h2 0.8 0.8 h1+h3 h2+h3 0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Success rate Success rate

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Numer of traces (103) Numer of traces (103) (a) (b)

Figure 3.11: Success rate with different combinations of linear and nonlinear Hamming distances.

3.5.2 A More Realistic Execution Environment

GPUs are powerful platforms, able to run thousands of concurrent threads. GPUs can be used effectively to accelerate AES encryption/decryption. In an actual AES implementation on a GPU, a large number of threads would be encrypting/decrypting different plaintext/ciphertext values

42 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU concurrently. A highly-tuned implementation of AES would try to utilize the full capacity of the GPU. Given this more realistic scenario, the attacker does not have control over the plaintext. We would like to understand how successful the attack will be, as a function of the number of power traces collected. To begin extending our attack model to a more realistic execution environment, we build off our previous work presented above [42], and leverage the success rate model for CPA proposed by Fei [52], to predict the number of traces needed to launch a successful attack under this situation. Mangard in [53] also proposed a model to estimate the trace number, but it failed to consider the effects of false key candidates (confusion coefficients), which has a major impact in our attack. Assume we have a B block plaintext message, where each block requires L (L=4 in our implementation) threads to encrypt. If BL is smaller than the maximum capacity of the GPU M, for simplicity, we assume the other M − BL threads are idle and do not contribute noise in our power measurement. Later, we will add the effect of idle threads into our analysis. We also assume that the noise generated by the BL threads is i.i.d., Gaussian distributed, with a zero mean. Then the standard deviation of the noise R in Equation (3.6) can be expressed as √ σB Bσ1 N = N (3.10) σB B where N is the noise standard deviation of the power trace from encrypting a block plaintext σ1 B message, and N is for =1. For the true key value, the power estimation becomes: B H B PE = hi,j (3.11) i=1 j=1 th th where hi,j is the j Hamming distance of one AES instance in the i message block (with the correct key byte), and there are H Hamming distances for each data block. For the lth false key, its power estimation can be expressed as: B H PEB h l = i,j,l (3.12) i=1 j=1 th th th where hi,j,l is the j Hamming distance of one AES instance in the i block, for the l false key. In previous work [52], Fei et.al described a model to calculate the theoretic success rate quantita- tively for one key byte as √ |a| SR { n −1/2κ} =ΦNk−1 K (3.13) 2σN

43 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

N − where ΦNk−1 is the cumulative distribution function of a k 1 dimensional standard normal distribution, Nk is the number of key candidates, n is the number of traces, a is the coefficient in

Equation(3.6), and σN is the standard deviation of noise. κ is the vector holding the confusion coefficients. For our highly parallel GPU environment, the confusion coefficient for the lth wrong key is:

κB E |PEB − PEB|2 l = [ l ] (3.14)

The (l, m)th element in the three-way confusion coefficient matrix K is

KB E PEB − PEB PEB − PEB l,m = [( l )( m)] (3.15)

We add the superscript B to emphasize that we are dealing with B blocks of plaintext encryption on a GPU. We denote the difference between power estimations of the true key and a false key as B H B QB PEB − PEB h − h q l = l = ( i,j i,j,k)= i,l (3.16) i=1 j=1 i=1 q H h − h ith where i,l = j=1( i,j i,j,l), the difference of the Hamming distance of the block between th the correct key value and the l false key. Due to the diffusion property of AES, the mean of qi,l is 0, and qi,l and qj,l are independent if i = j,sowehave

κB E |PEB − PEB|2 E |QB|2 Bσ2 , l = [ l ]= [ l ]= q,l (3.17)

σ2 q i i with q,l denoting the variance of i,l.Subscript is dropped because the variance is independent of . For a three-way confusion coefficient matrix KB, we denote

Rl,m = E[qi,lqi,m] (3.18)

Then we have KB E PEB − PEB PEB − PEB l,m = [( l )( m)] B B (3.19) E QBQB E q q BR = [ l m]= [ i,l j,m]= l,m i=1 j=1 After plugging Equation (3.10, 3.19, 3.17) into Equation (3.13), we get √ |a| B B −1/2 B SR =Φ −1{ n (K ) κ } Nk σB 2 N √ |a| (3.20) 1 −1/2 1 =Φ −1{ n ( ) κ } Nk σ1 K 2 N

44 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

This shows that with the same number of traces, the success rate is independent of the block size B. a σ1 To produce the values of and N , and verify the correctness of Equation (3.13, 3.20), we performed attacks with three different plaintext message sizes, with B equal to 8, 16 and 32. Just as in the experiments in Section 3.5.1, we utilize M =24, 576 threads to increase the power consumption for better measurement resolution. However, by doing this, each AES instance will be replicated M/BL times. As a result, we have:  M σ B = σ1 N L N (3.21) M M 2 κB =( )2Bσ2 =( )σ2 l BL q,l BL2 q,l (3.22) M M 2 KB =( )2BR =( )R l,m BL l,m BL2 l,m (3.23)

Then √ |a| M B 1 −1/2 1 SR =Φ −1{ n ( ) κ } Nk σ1 BL K (3.24) 2 N

This suggests that when M = BL, there would be no repeated AES encryption instances, SRB = SRB, as expected. When M>BL, we have a more typical scenario of how a GPU would leverage many threads for acceleration (while many threads are used, the GPU is not run at full capacity), so only BL threads are running for computation and the rest M − BL threads are idle. Formula (3.20) provides the success rate for this attack. Note that formula (3.24) is for a reference attack, where M there are BL instances of the same computation, i.e., with much higher side-channel SNR. The reason for us to examine the reference attack, and run experiments with repeated computations, is to generate power measurements with sufficient resolution. We can then extrapolate the results for the M reference attack and predict the number of traces needed for the real case attack, requiring BL times more power traces. Fig. 3.12 shows both the empirical success rates for the reference attacks and theoretical calcula- tions generated by Equation (3.24), for M =24, 576, B is 8, 16, or 32. For each B number, the two curves track each other very well. Taking B =8in Fig. 3.12 as an example, for a realistic attack with no repeated AES encryption instances, to obtain the same success rate, the number of traces would need to be 756 (M/BL = 24, 576/(4 × 8)) times that of the reference attack. For example, to achieve a 70% success rate, the number of traces needed for the reference attack is approximately 33,000, while the number needed for a realistic attack would be 25.3 million (756 × 33, 000). Furthermore, when M>BL, the

45 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

1

0.9

0.8

0.7

0.6

0.5

0.4 Success rate Empirical 8 0.3 Empirical 16 Empirical 32 0.2 Theoretical 8 Theoretical 16 0.1 Theoretical 32

0 0 20 40 60 80 100 120 140 160 180 200 Numer of traces (103)

Figure 3.12: Empirical and theoretical success rates for 8, 16 and 32 blocks of plaintext. unutilized threads may be used to do other computation, which will increase the noise level without contributing any side-channel signal, and thus we may need even more traces to obtain the same success rate.

3.6 Countermeasures

To defend AES on GPU from side-channel power analysis, countermeasures should be applied. The common countermeasure of masking would also work on GPU. Since the intermediate values will be randomized by the mask, the attacker can not correlate the power with the model any more. However, another typical countermeasure, random delay of instructions, is not effective on GPU. One reason is that the instructions issued by different warps on GPU are already randomized in some way by the scheduler. Another reason is that the attacker can always average the entire trace to capture the leakage as it is done in this chapter. With sufficient key storage and adequate key management, users should avoid using the same secret key for all the encryption on GPU. Not necessarily different key for different data block, a couple of keys for the GPU would significantly increase the attack complexity and render the attack infeasible. Another effective countermeasure is to initialize the registers with random values before writing to them. It introduces some degradation of the performance, but we can limit the random initialization only to the sensitive registers, e.g., the last round registers, to minimize the effect. Also thanks to the high performance of the GPU, such a small overhead would be negligible.

46 CHAPTER 3. SIDE-CHANNEL ANALYSIS OF AES ON GPU

3.7 Summary

In this chapter, we present side-channel power analysis on a GPU AES implementation. We describe a process to obtain power consumption measurements on an NVIDIA GPU. The various challenges of power analysis on a GPU are highlighted. To overcome these difficulties, we have proposed effective strategies to process the power traces for a successful correlation power analysis. The corresponding power model is built based on the CUDA PTX assembly code. We begin our analysis of the attack assuming control over the plaintext, and analyze its scalability as we increase the size of plaintext. We find a linear relationship between the amount of plaintext and the the number of traces needed, though the computation complexity grows exponentially. The attack results show that a GPU, a highly-popular but very complex, and parallel computing device, is vulnerable to side-channel power analysis attacks.

47 Chapter 4

Side-channel Analysis of RSA on GPU

4.1 Introduction and Motivation

The RSA algorithm [54] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [55, 32, 56, 57], leveraging the Single Instruction Multiple Thread (SIMT) model. However, the security of RSA on a GPU has not received adequate attention. Side-channel vulner- ability (i.e., private key retrieval through side-channel analysis) is a major concern on cryptosystems. Meanwhile, various side channels, including power, electromagnetic emanation, and timing, have been exploited for implementations of RSA on other computing platforms, such as CPUs, FPGAs, ASICs, and MCUs. Among these attacks, timing attacks have become a realistic cyber threat given that they are non-invasive and can be launched remotely. Many timing attacks of RSA deployed on CPUs are micro-architectural (cache-based) attacks [58, 59, 60, 61]. For example, in access-driven attacks such as PRIME+PROBE [58] and FLUSH+RELOAD [59], a spy process first sets the cache into a known state, the victim RSA process executes, and during the run the spy measures the timing of its memory accesses to infer the cache access pattern of the victim and recovers the secret key. We initially anticipated that applying such cache attacks on a GPU to be very challenging, if not impossible, due to the high degree of noise resulting from the massive parallel execution model of

48 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU a GPU. These devices launch thousands of concurrent threads, so the victim would not be a single process, but a kernel that runs across threads on Streaming Multiprocessors (SMXes). The cache access pattern would be a complex mix of many copies of the same kernel, and therefore the spy cannot easily attribute a particular cache access to a specific victim thread. Each SMX has its own sets of caches, so the only cache leakage possible is when the spy kernel and victim kernel are on the same SMX. However, the victim kernels may be thread-switched out of the SMX by the hardware scheduler, which attempts to maintain high utilization. This results in a significant amount of noise in the cache access pattern that the spy kernel observes. An alternative timing attack is time driven - exploiting key-dependent variations in computation time, and treating variations in memory accesses as key-independent noise. Such timing attacks heav- ily depend on the RSA algorithm and associated optimizations. There are two common optimizations used in RSA algorithms: a sliding window exponentiation and the Chinese Remainder Theorem (CRT). A Montgomery multiplication [62] is the basic operation that is common in many RSA implementations. Kocher [63] first outlined the theoretical foundation for possible timing attacks in several cryptosystmes. Dhem et al. [64] proposed a practical timing attack of RSA on a smart card in a bit-by-bit fashion. Their attack exploits the variability present when using a Montgomery multiplication as the timing leakage, which is also adopted in this work. However, massive concurrent RSA executions on a GPU make this particular timing side channel much more obscure, presenting a challenge for our attack. Our GPU RSA implementation also employs sliding window exponentiation, which was not used in prior work [64], making the attack based on recovering the exponent, window by window. Toth´ proposed another timing attack under the assumption that the total number of extra reductions is known to the adversary [65], which is not a practical attack scenario on a GPU. RSA encryptions using the Chinese Remainder Theorem (CRT), and protocols based on CRT-RSA, have also been attacked using timing leakage [66, 67, 68, 69, 70, 71]. Our preliminary analysis shows that there may be no difference in the attack strategy for CRT-RSA on a GPU versus a CPU. I targets an RSA implementation that uses sliding window exponentiation. There has been very limited prior work focused on timing attacks on a GPU [43, 72]. Our proposed work is distinctively different from the prior work. First, on the timing leakage, our work exploits variations in computation time, while prior work exploited timing differences related to memory access, one due to the on-chip coalescing unit [43], and the other due to bank conflicts on an on-chip shared memory [72]. Second, our work targets a different cipher, RSA, while the prior work targets AES implementations. For AES (a block cipher), the data memory address pattern leaks key information. RSA is computationally-intensive, and we anticipate such memory-access related

49 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU timing channels are very weak. Instead we utilize a computational timing side channel for RSA on a GPU. When multiple RSA operations are running concurrently with hundreds of threads on a GPU, the interaction between these threads makes it hard to predict their behavior and extract side-channel timing information, reducing the GPU’s vulnerability to attack. The main contributions of this work include: 1) a hierarchical timing model for an RSA implementation on a GPU with Montgomery multiplication and sliding window exponentiation, explicitly capturing various complex interactions in a massively parallel computing platform; 2) a successful correlation timing analysis attack based on the proposed timing model to extract the private key; 3) an effective error correction algorithm designed to detect and correct attack errors to improve the success rate; 4) a success rate analysis for success rate prediction with the quality of side-channel information obtained from the GPU.

4.2 Background: RSA and GPU Implementation

An RSA cipher uses a pair of keys, one public and one private. The public key (n, e) is used for encrypting the plaintext message M, and the private key, (n, d) is used for decrypting the ciphertext C: M = Cd mod n (4.1) where n is the modulus, e and d are the public and private exponents, respectively. The relationship between n, d, and e is that they have to satisfy constraints so that the encryption and decryption are reverse operations to each other [54]. The key length (the bit length of C, M, d, and n) can be 512, 1024, 2048 or longer, providing different levels of security. Usually the public key exponent e is chosen to be a relatively small number with a small Hamming weight, while the secret key exponent d is much larger, and requires many more bits. Therefore, the decryption process using the private key is 20 to 60 times slower than the encryption process, demanding acceleration for higher throughput [57].

4.2.1 Sliding Window Exponentiation

To perform exponentiation of a large number, a simple binary method [73] performs squaring and conditional multiplications based on each bit of the exponent. One optimization method is to perform sliding-window exponentiation, where the exponent d is decomposed into a series of

50 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

zero and nonzero windows, Fi of length L(Fi), and the exponentiation is processed window by window. Using sliding window exponentiation, we can achieve a 21% improvement in performance on average [74]. There are two ways of constructing a sliding window: applying a Constant Length Nonzero Window (CLNW) [75] or a Variable Length Nonzero Window (VLNW), without incurring significant performance overhead [76]. For CLNW, the nonzero window length is fixed, while the zero window can have different lengths. For VLNW, both nonzero windows and zero windows have variable lengths. Assuming Jang et al.’s implementation [57], CLNW has nonzero window lengths of 5, 6 and 7 for RSA-512, RSA-1024 and RSA-2048, respectively. The following example illustrates the partitioning of an exponent, with L =4, from right to left: 1001 1101 00000 1011 0111 000 0001 0 0111 000 1011 Note that the exponent starts from most-significant non-zero window, and the value of the nonzero windows must be odd, while the length of the zero windows vary, and any zero window occurs between two constant-length nonzero windows. For VLNW, the partitioning between zero and nonzero windows is slightly more complex. The nonzero window length has an upper bound, Lmax. It switches to a zero window when the next q bits are all zeros. The algorithm proceeds from the least significant bit to the most significant bit, followed by the status of ZW, NW, indicating a zero window and a nonzero window, respectively. ZW: Check the next bit. If it is zero, stay in ZW, else go to NW.

NW: Next, check min(q, Lmax −Lcur) bits. Lcur is the current window length. If all are zero, switch to ZW. If not, include the bits into current window. If Lcur = Lmax after including the new bits, check the incoming single bit: if zero, go to ZW; else go to a new NW.

The following example illustrates the partitioning of an exponent, with Lmax =5, q =2: 101 11101 00 101 10111 000000 1 00 111 000 1011 Note that any zero window occurs between two nonzero windows, and there may be multiple nonzero windows in a sequence. The nonzero windows will only be at odd value (not even). In Jang et al.’s implementation [57], Lmax =4, 5, 6 and q =3, 2, 2 respectively for RSA-512, RSA-1024 and RSA-2048. The sliding window exponentiation method for decryption is described in Algorithm 2, regardless of how the windows are partitioned. There are h windows of exponent d. To begin processing, the odd powers Cw of the method have to be pre-calculated once (Lines 1-4). The computation loop iterates over the most significant window. During each window (iteration), there are a number of

51 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU squares computed (Line 7), which is equal to the bit length of the window, followed by conditional multiplication (only if the window is non-zero) with the multiplicand determined by the window value, Fi (Line 9).

Algorithm 2 Sliding Window Exponentiation

Input: C, d, n (d decomposed into zero and nonzero windows Fi of length L(Fi), for i = 0, 1, 2, ..., h − 1) Output: M = Cd mod n 2 1: C = C · C

2: for w =3, 5, 7, ..., 2L − 1 do −2 2 3: Cw = Cw · C mod n

4: end for

5: M ← 1

6: for i =0to h − 1 do 2L(Fi) 7: M ← M // squaring

8: if Fi =0 then 9: M ← M · CFi mod n // multiplication 10: end if 11: end for 12: return M

4.2.2 Montgomery Multiplication

Montgomery’s algorithm [62] is an efficient algorithm for modular multiplication, the basic operation used in the RSA cipher. The multiplier and multiplicand are first transformed into the Montgomerized form before they go through the multiplication. The product is in Montgomerized form as well: a · b = MM(a, b)=a · b · r−1 mod n, where r is a power of 2, and co-prime with both n and n

52 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU which makes it a squaring operation, the probability of the extra reduction is:

Prob(extrar eduction in MM(a, a)) = n/3r (4.2) which is a constant, and determined by n and r only. If a is random, and b is a constant, the probability becomes: Prob(extra reduction in MM(a, b)) = b/2r (4.3) i.e., dependent on values of b and r.

4.2.3 GPU Parallelization of RSA

NVIDIA GPU hardware features many Streaming Multiprocessors (SMXs), with each consisting of homogeneous cores and shared resources such as caches. In GPU programming, the workload is partitioned into a grid of blocks, with each block consisting of many threads. Threads from one block will only be executed on the same SMX. On CUDA cores, threads of one block are grouped into warps of 32 threads, executing instructions in a synchronized fashion. Whenever the execution path of threads in a warp diverges, only one path is executed at a time and the others have to wait. Control divergence negatively impacts program performance (i.e., the program runs slower), resulting in data-dependent time variations. The execution time is dependent on the message contents and the private key. Warps on the same SMX are competing for the same limited hardware resources, such as CUDA cores, warp schedulers and shared memory. For warps on different SMXes, they run independently. We adopt a parallel RSA implementation developed in CUDA, as described by Jang et al. [57, 77]. Note that the for loop iterations in Algorithm 2 (lines 6-11), when used for processing the exponent windows, are executed sequentially. This is because each iteration depends on the result of the previous iteration. Any parallelization is accomplished between messages and within messages. When processing a batch of ciphertext messages on a GPU, the messages are independent and decrypted in parallel, while using the same secret key exponent. In addition, the Montgomery multiplications associated with one message’s computations are also parallelized. Each message is represented by an array of s 64-bit words, and the multiplications are carried out by s threads, executed in cooperation. For example, to decrypt a single message in RSA-512, the s value is 8, i.e., it requires 8 threads to decrypt. A batch of 32 messages require 256 threads in total, which are divided into 2 blocks, with a typical block size of 128 threads.

53 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

4.3 The Timing Models of RSA on GPUs

Given the parallel execution environment on a GPU, it becomes more challenging to predict of the execution time versus a single-threaded execution on a CPU. Program execution includes both concurrent computation and blocking execution (e.g., for divergence in a warp). The execution time is not simply equal to the addition of all instructions any more. In this section, we introduce a hierarchical timing model for RSA decryption on a GPU, which accounts for various interactions during execution, and is the basis for enabling our side-channel timing attack. Note that we choose the extra reductions in the Montgomery multiplication as the timing channel. The number of reductions depends on both the input message and each window of the secret exponent d. We start with a general linear timing model as follows:

T = O + v · R (4.4) where R is the number of extra reductions in the Montgomery multiplications, v is the unit execution time of one reduction, and O represents other key-independent execution time factors that follow a distribution when the input message varies. For a sequential implementation, R will simply be the total number of extra reductions performed in each window. With multiple messages being decrypted in parallel by many threads on the GPU, there are complex interactions between the threads through competition for shared resources and the specifics GPU programming model. We build a hierarchical model of R according to the interaction at different levels of parallelism on a GPU. We consider fine-grained thread (message) level interaction, as well as interactions at a warp level, a SM level, and finally across the entire GPU.

4.3.1 GPU Timing Model

We begin by considering only a single message being decrypted by multiple threads (e.g., 16 for RSA-1024, which are within the same warp). Since threads in a warp are synchronized, we can still view the decryption for one message as sequential execution. For each window of d, we calculate the reduction number, as shown in lines 7-10 in Algorithm 2. The first for loop (lines 2-4) is ignored here because it is independent of the secret key d. th For the i window Fi of d, the intermediate variable M is first squared L(Fi) times. If Fi is nonzero, its length L(Fi) is always L when using the CLNW method. When Fi is zero, its length can vary from 1 to the key length. Under the assumption that each bit of d is independent and uniformly distributed between 0 and 1, the longer the length of Fi, the smaller the probability will be. We

54 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU assume the length of the nonzero window is no longer than 32 bits (probability of 2.33 × 10−10). We use rmsg(i) to record the number of reductions of the squares for window i. The number of reduction numbers performed for the entire decryption of message, Rmsg, is the sum of rmsg(i) over i. Note we do not include the extra reductions of the multiplication operation (line 9 of Algorithm 2). Given that the probability of executing a reduction depends on the value of CFi , as expressed in

Equation (4.3), there may be other windows with the same value of Fi. This kind of dependency on an extra reduction in multiple windows will bias the attack result [64], and an effective attack should not use the reductions associated with the multiplication. Next, we consider running a warp using multiple messages in parallel. The number of messages in a warp depends on the key size. There are 2 messages in a warp for RSA-1024 and 1 for RSA-2048. Because threads in one warp are synchronized, the operations across different messages are also synchronized, (i.e., all the messages are processing the same window of size d at the same time). For one squaring operation in a window, if one or more messages have an extra reduction, the whole warp has to wait for it to finish, and the warp is considered to have an extra reduction. Thus, the number of reductions in a warp for a window rwarp(i) is not simply the sum of the reductions used for squaring for all the messages.

To calculate rwarp(i), we use a binary number, bmsg(i, j), to record the extra reductions during th th execution, for the j message in the i exponent window. The length of bmsg(i, j) is L(Fi), because in each window there are L(Fi) squaring operations, and each bit of bmsg(i, j) is for the corresponding squaring. If the squaring of a message results in an extra reduction, this bit is 1; otherwise, it is 0. For each squaring operation, all the messages in a warp are synchronized, (i.e., as long as there is one message resulting in an extra reduction, the warp (all the messages) encounters a delay). Therefore, all bmsg(i, j) should have bit-wise OR operations over j, and the number of reductions in the warp, rwarp(i) is the Hamming weight of the bit-wise OR result. The number of reductions across the warp for the entire decryption Rwarp is also the sum of rwarp(i) over i. For one SMX with multiple warps running, there is no synchronization between warps. When the total number of threads (number of warps × 32) is equal to or smaller than the number of CUDA cores on the SMX, where these warps will run mostly in parallel. However, the reduction operation involves shared memory accesses, which will be processed in a serialized manner, and therefore all the warps are processed in a blocking way to some extent. In scenarios where the total thread number is much larger than the number of SMX CUDA cores, the warps have to be scheduled serially. Based on these considerations, we let the reduction number of one SMX, Rsmx, to be approximately equal to the sum of its warps’ reductions.

55 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

For the complete decryption with multiple SMXes, we assume the SMXes are running indepen- dently, and the time of one decryption is determined by the slowest SMX. The number of reductions associated with one decryption on the GPU is Rgpu, which is the maximum of all the SMX reductions,

Rsmx.

4.3.2 Timing Model Verification

To verify our timing model, we run multiple experiments of RSA-1024 on an Nvidia K40 GPU. We set the block size to 128, such that each block will have 4 warps. We run experiments and vary the message number for one decryption from 1, 2, 8 to 64 with CLNW. We will have 1 message, 1 warp, 1 block (on one SMX) and 8 blocks (on 8 SMXes), respectively. For each experiment, we run the decryption 10K times with random messages, record the execution time of each decryption using the CPU’s clock, and calculate the reduction number for the workload according to the timing model. Our results are shown in Figure 4.1, with the execution time normalized, and correlation coefficients (ρ) shown at the top. The figure shows we obtain a higher linear correlation between the measured execution time and calculated reduction number for (a) and (b), when we have one message and one warp running, respectively. The points in (c) and (d) spread out more for multiple warps and multiple SMXes, due to the approximation of the reduction number for one SMX. We also characterize the timing channel for the VLNW implementation, and the correlation results are similar.

4.4 Correlation Timing Attack

In this section, we propose an attack method for the RSA GPU implementation, based on the timing models we built in Section 4.3. We assume the ciphertext message is random, but known to the attacker. For each decryption run, the adversary only records the total execution time. The secret key d is extracted one window after another, starting from the most significant one, following the same order as decryption, as shown in Algorithm 2. For the ith target window, the previous attack results are used to calculate the value of the intermediate variable M at the beginning th of i iteration. To avoid confusion with M in other steps in the algorithm, we denote it as Mtemp.

For the first window, Mtemp is 1. In the attack, we make a guess of the target window. Based on the guess, we calculate the corresponding number of reductions. We compute the Pearson correlation coefficient between the

56 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

(a) = 0.79643 (b) = 0.66031 0.9865 1.014 0.986

1.013 0.9855

1.012 0.985

1.011 0.9845

0.984

Normali ed running time 1.01 Normali ed running time

0.9835 180 190 200 210 220 400 450 500 550 Reduction number of the message Reduction number of the arp

(c) = 0.5467 (d) = 0.1828 1

1

0.9995

0.9995

0.999

0.999

0.9985 Normali ed running time 0.9985 Normali ed running time

0.998 1550 1600 1650 1700 1.28 1.29 1.3 1.31 1.32 1.33 1.34 Reduction number of the SMX Reduction number of the GPU 10 4

Figure 4.1: Timing model verification number of reductions and the observed timing information. To determine the correct value for a nonzero window or length for a zero window, we iterate over all possible guesses for the target window, and pick the one with the largest correlation based on our attack results.

4.4.1 Attack CLNW

We first describe the details of the attack of CLNW. To calculate the number of reductions for the ith window for one message, we assume there are at least two windows after the target window. Because the window could be zero or nonzero, we divide our guess into two groups, the length of zero windows and values of non-zero windows, which differs from previous work [64] which only works bit by bit. If the window is zero, its length, Lx, is varying, and we tally the number of Lx squarings i of Mtemp first. For the sliding window algorithm, a zero window is always followed by a nonzero window. Therefore the next window must be nonzero of length L, resulting in another L squarings in the next iteration, but without any multiplication in between due to the zero window. From another perspective, the operations can also be considered as L squaring operations, followed by Lx squaring operations.

57 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

If the window is a nonzero window, we guess its value as Vx (Vx is odd). The following operations

Vx on Mtemp are L squaring operations and one multiplication by C . Beyond the multiplication, we can still infer the operations by checking the next one or two windows. If the next window is nonzero, the operation beyond the multiplication is another L squarings, because the length of the next nonzero window is L. If the next window is zero, the window after the next window is nonzero, and we have more than L squaring operations, consisting the squarings of the next zero window and L squarings of the next next nonzero window.

In conclusion, the first group of operations on Mtemp are L squaring operations. After that, if the window is zero, the subsequent operations involve Lx squarings; if the window is nonzero with a

Vx value of Vx, the subsequent operations are multiplications with C and L squarings, as shown in Figure 4.2.

ሺ݅ െ ͳሻ௧௛ window ݅௧௛ window ሺ݅ ൅ ͳሻ௧௛ window

ܮ௫ Squarings

ܯ௧௘௠௣ ܮ Squarings

Multiply by ܥ௏ೣ ܮ Squarings

Figure 4.2: Operations on Mtemp with CLNW

When calculating the number of reductions, we exclude the first L squaring operations because they are common to all guesses. For a zero window, we record the extra reductions of the next Lx squarings in rzero(Lx,i,j) for each message, where i is the window index, and j is the message index. For a nonzero window, we record the extra reductions for the next L squaring operations, in

Vx rnonzero(Vx,i,j), where the value Vx is used to compute the multiplications C . We then subsequently calculate the number of reductions for the warps, SMXes, and entire GPU for the target window, using the timing model described in Section 4.3. Note that in our attack model, we are attacking window by window, and the calculated number of reductions is only for the current window. While in Section 4.3.2, the linear correlations are for the sum of the number of reductions across all windows. In our attack, we utilize the small linear correlation of one window, which will have a lower correlation coefficient than the correlation across all h windows, is shown in Figure 4.1.

58 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

4.4.2 Attack VLNW

Attacking VLNW is more challenging than attacking CLNW, because the length of the nonzero window is no longer constant. To accommodate for this issue, we do not separate attacking zero and nonzero windows anymore. Instead, we combine the nonzero window with its preceding zero window as one attack unit. If a nonzero window is preceded with another nonzero window, itself is an attack unit. In the attack, we first target the length of one attack unit, Lx, which equals to the length of the nonzero window, adding the length of the preceding zero window, if it exists. Then we target the value of the attack unit, Vx, which is also the value of the nonzero window. By knowing the length and value of each attack unit, we can reconstruct the private key. Here we have another observation about the length of an attack unit: it is at least q +1bits long. The (q +1)-bit unit only happens when q zeros are followed by a single one bit. For other cases, the attack unit length is always longer than q +1. During window partitioning (from right to left), if a NW switches to a ZW, there must have been q zeros, or the NW already has Lmax bits. And we assume, Lmax >q+1.

ሺ݅ െ ͳሻ௧௛ unit ݅௧௛ unit ሺ݅ ൅ ͳሻ௧௛ unit

௏ೣ ܯ௧௘௠௣ ܮ௫Squarings Multiply by ܥ ܮ௤ାଵSquarings

Figure 4.3: Operations on Mtemp with VLNW.

To attack the ith unit, we assume the previous (i−1)th units are already known. The intermediate th value of Mtemp can be calculated at the beginning of the processing the i unit. We first guess the length of the unit Lx by squaring Mtemp Lx times and calculating the number of reductions for each message, the same as was done for CLNW. The value of Lx is determined by the correlation attack, similar to the zero window attack for CLNW. Next, we guess the value of the unit Vx, by multiplying M Lx CVx q the temp by , and then squaring the result +1times and calculating the number of reductions.

59 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

Because the next attack unit has to be least q +1bits, we assume the computation would be correct if we have the correct guess of Vx. By knowing Lx and Vx of the unit, we can recover the zero and nonzero windows and the private key.

4.5 Success Rate Analysis

In this section, based on the timing model and the correlation attack method presented in Sections 4.3 and 4.4 respectively, we build a probability model to predict the success rate of the attack. Since the attack methods of CLNW and VLNW are similar, we only present an analysis for CLNW. We target our model for the case of single-warp execution: there are m messages being encrypted in single warp. Single message execution is just a special case of one warp execution with m =1.For the multiple warps and the multiple SMX cases, the prediction will not be less accurate considering the low correlation shown in Figures 4.1 (c) and (d). As a result, we only present the success rate analysis for single warp execution here, and we only analyze the attack of the CLNW shown in Section 4.4.1. We rewrite the timing model of Equation (4.4)

T = C + O + v · R (4.5) where R is the number of reductions for the target window, v is the execution time of one reduction, C is a constant, and O is the noise, which follows a Gaussian distribution N (0,σ). The attack methods for the nonzero and zero windows are different. We derive the success rates of both methods. To simplify our analysis, we assume we know whether the window under attack is zero or nonzero. This is a valid assumption, because when the proceeding window is zero, the current window under attack is nonzero.

We first consider attacking a nonzero window by guessing the value of Vx. Assuming the correct window value is Vx = h, the actual reduction number calculated from the timing model is R = Rh.

For other incorrect values of Vx, we index hi, with 0 ≤ i

60 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

N − where ΦNk−1 is the cumulative distribution function of a k 1 dimensional standard Gaussian distribution; N is the number of traces; v and σ are the same parameters as in Equation (4.5). K is a three-way confusion matrix, whose element at index (i, j) is defined as:

Ki,j = E[(Rh − Ri)(Rh − Rj)] (4.7)

th κ is a confusion vector of size Nk − 1, whose i element is defined as:

2 κi = E[(Rh − Ri) ] (4.8)

To calculate K and κ, we need to know the the distribution of R. With a nonzero window length of L, there will be L squarings, each of which is possible to generate one extra reduction. Based on the analysis of the timing model for single-warp execution, for one square computation to generate an extra reduction, requires that any of the m messages’ square computations has an extra reduction. We assume the message values are randomized, which means the probability of an extra reduction is n/3r, as indicated by Equation (4.2). With n ≈ r, the probability is close to 1/3. If we assume messages are independent, then the probability of an extra reduction in the warp for one squaring is 1 − (1 − n/3r)m. There are L squares computed, so that the distribution of R is a binomial distribution, with parameters n = L and p =1− (1 − n/3r)m. n R ∼ B(L, 1 − (1 − )m) (4.9) 3r L P (R = k)= pk(1 − p)L−k k (4.10)

For different guesses of Vx, the intermediate results are random. We assume Rh, Ri and Rj are independent. From probability theory,

E[(Rh − Ri)(Rh − Rj)]=0,i= j (4.11)

k k 2 2 E[(Rh − Ri) ]= P (Rh = s)P (Ri = t)(s − t) (4.12) s=0 t=0 2 As a result, we have K as a diagonal matrix of E[(Rh − Ri) ] × I, and κ has Nk − 1 identical 2 elements of E[(Rh − Ri) ]. From our experiments, we have found that v/σ =0.061, 0.043,, and 0.025 for RSA-512, RSA- 1024 and RSA-2048, respectively. The success rates of attacking a nonzero window for RSA-512, RSA-1024 and RSA-2048 when running one warp are shown in Figure 4.4(a). Our experiment results show that, as the key length increases, the success rate decreases. The trend is because as the

61 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU key size increases, there will be more windows. Our attack targets one window at a time, while using the total execution time. Therefore the signal to noise ratio drops with larger key size. The predicted success rates track the empirical success rates well, with a little over-estimation of the effectiveness of the attacks.

When attacking a zero window, the window length, Lx, is variable from 1 to the key length. The window length Lx is also a random number of geometric distributions, with parameter p =0.5.

−k P (Lx = k)=0.5 (4.13) i Because the probability of the occurrence of a zero window that is longer than 32 is close to zero, we only make guesses between 1 and 32 during the attack. We assume the correct window length is Lh, with the value between 1 and 32, and the corresponding number of reductions is Rh.We index the other 31 guesses as Li, 0 ≤ i ≤ 30, and the corresponding number of reductions is Ri.

There is a difference from attacking the nonzero window. When a a wrong guess of Vx is made, the calculated number of reductions is independent of the correct one. However, with a wrong guess of the Lx, the calculated number of reductions is still partially correct. If Li Lh, the computation of the number of reductions is correct in the first Lh squarings. As a result, the distribution of Rh − Ri is a binomial distribution, with the parameter values n = Lh − Li and m p =1− (1 − n/3r) . The success rate is dependent on Lh. The final success rate of attacking zero windows is a weighted average of the attacking different window lengths: 32 SR P L L SR = ( x = h) Lh (4.14) Lh=1

We plug in the distribution of Rh − Ri into Equations (4.7) and (4.8), Ki,j and κi can be L SR calculated for window length h, and the success rate Lh using Equation (4.6). Note the value of K and κ are dependent on the correct window length Lh. The success rates for attacking zero windows are plotted in Figure 4.4(b). Results show that the success rate of attacking zero windows is much lower than nonzero windows with the same number of traces, indicating that our ability to distinguish zero windows is much lower than that of nonzero windows. In the attack of nonzero window, the wrong guess of a window value will yield a totally incorrect number of reductions. For a zero window, if the window length guess is close to the correct window length, the reduction number calculation is also very close. As a result, it is much harder to differentiate the correct window length from those close to it.

62 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

(a) Nonzero Window Attack (b) Zero Window Attack 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Success Rate RSA-512 empirical Success Rate RSA-512 empirical RSA-512 theoretical RSA-512 theoretical RSA-1024 empirical RSA-1024 empirical 0.2 0.2 RSA-1024 theoretical RSA-1024 theoretical RSA-2048 empirical RSA-2048 empirical RSA-2048 theoretical RSA-2048 theoretical 0 0 0 2000 4000 6000 8000 10000 0246810 Trace Number Trace Number 104

Figure 4.4: Theoretic and empirical success rate

4.6 Error Correction

The inherent drawback of the proposed attack, due to the iterative processing shown in Section 4.4, is that the attack on target window is highly dependent on attack results on the previous windows. If any of the previous guesses were wrong, the current attack may not succeed because of the calculation

Mtemp and reduction number will be wrong. As a result, the correlation coefficients after the first error will be much lower than those before, as can be observed in Figure 4.5. Each point in Figure 4.5 is the maximum correlation coefficient of one window attack when we iterate over all possible values. This non-robust feature can be utilized to detect errors if there exists N consecutive window attacks with low correlation coefficients. In our attack, the length of a zero window is much harder to attack. Because the number of extra reductions is roughly linear with the bit length of the zero window, the ability to differentiate provided by the correlation attack is fairly small. For example, when guessing the length, the correlations of the wrong length guesses is very close to that of a correct ones, as shown in Figure 4.6 (a) for the case of attacking a zero window, where the length of the window is 4. If a wrong guess ends up with the highest correlation, the error can only be detected in later windows. We define a parameter, distinguishability D, to indicate how likely an error can happen. The value of D for one window is defined as the normalized difference between the largest correlation and the second largest correlation value. The lower D is, the more likely an error will occur. Based on these observations, we propose two error correction methods, forward and backward methods.

63 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

0.04

0.03

0.02 Erroneous indo attack

0.01 Correlation coefficient

0 0 102030 Windo inde

Figure 4.5: Sequence of correlation coefficients of a timing attack when an error happens

Forward error correction is triggered when the distinguishability D of a target window is below a threshold Dth. We need to reconsider those values that were guessed and did not produce the highest correlation. We proceed with tentative attacks on the next window multiple times, by setting the target window’s value to other possible values, one after another. For each of those tentative attacks on the next window, we obtain a maximum correlation coefficient. To pick the right value for the target window, we choose the one resulting in the largest maximum correlation in the next window. Backward error correction is triggered when there are N consecutive window attacks, with the correlation coefficients below Cth. The attack rolls back to the first window of the N consecutive windows. This time, we choose the value for this window with the next highest correlation, and then proceed. If this still results in an error and the window is attacked again, the next ranked guess should be chosen next. If all possible values for the window have been attempted and there is still an error, we move further backward to the previous window and reconsider other guesses.

Parameter selection: In our error detection method, there are three parameters, Dth, N and

Cth. Their values affect the effectiveness and convergence speed of the attack. For Dth,ifitisset too high, the forward error correction will be launched too often (false positive) and the attack will be impacted. If it is too low, the chance of overlooking an error is high (false negative). The error is only detected by the backward error correction, which is much slower than the forward one. So

64 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

(a) a ac d (b) a ac d 0.08 0.08

0.06 0.06

0.04 0.04

0.02 0.02

Correlation coefficient 0 0

-0.02 -0.02 135791137111519232731 135791137111519232731 length of ero indo alue of non ero indo length of ero indo alue of non ero indo

Figure 4.6: Correlation coefficients of attacking zero and nonzero windows. a suitable Dth should balance the speed of the attack and the error detection rate. Similarly both error detection rate and attack speed will be affected by the N value. The choice of Cth has a similar impact as N, but in the opposite direction. This means a larger Cth is equal to choosing a smaller N. In our experiments, these values are chosen to maximize the success rate, with the attack speed in an acceptable range.

4.7 Experimental Results

We implement the RSA cipher on an Nvidia K40 GPU, which has 15 SMXes and 192 CUDA cores on each SMX, hosted on a workstation running Ubuntu 14.04.05 LTS with an Intel E5-1603 processor. We record the timing information using the CPU clock which runs at 2.8GHz, instead of the more accurate GPU clock, because normally it is not accessible to the attacker. We run experiments with different key lengths of 512, 1024 and 2048. The number of threads launched in one decryption varies from 32 to 1024. For each experiment, 100K decryption runs are carried out with random input messages. The error correction parameters Dth, N and Cth are set to 0.2, 8 and 0.01, respectively. The time used for one attack ranges from around 15 minutes for RAS-512 with 32 threads, to 8 hours for RSA-2048 with 1024 threads. The larger the key size and the number of threads, the longer the attack time because of the higher computational complexity, and a larger number of error corrections needed. Table 4.1 shows the number of errors corrected by forward and backward error correction, and how many windows (least significant/right most) are incorrect after the attack. In the results, we see the number of errors increases as the number of threads and the key size increase. The forward

65 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU error correction works best with fewer threads and a smaller key size. The backward error correction works best for more threads and a larger key size. The errors left in our experiment are very small (0-3 windows, 18 bits at most), and the full key can be easily recovered using a lattice attack [79]or brute force attack.

Table 4.1: Attack result with error correction.

Threads# 32 512 1024 FW BW E FW BW E FW BW E RSA-512 0 0 1 4 2 3 3 2 1 RSA-1024 7 2 2 10 5 1 5 19 3 RSA-2048 12 1 0 7 16 0 24 13 0 FW/BW: forward/backward error corrections; E: errors left

4.8 Countermeasures

There have been several countermeasures presented that protect a CGPU from a side-channel timing attack while running RSA. They attempt to eliminate data-dependent variations associated with the reductions [80, 81], or mask the messages and exponent with random numbers [63]. Such countermeasures are general, and are also applicable to GPU implementations. However, they will introduce a significant performance degradation, which is a major concern for GPU acceleration. We evaluate an always-reduce countermeasure on GPU, which increases the execution time by 4.5%. We also compare the Pearson correlation coefficient between the number of reductions and the execution time for one message, and show the result in Figure 4.7. The correlation is significantly reduced from 0.79, as shown in Fig. 4.1 (a), to 0.06, rendering the time leakage very small and hard to exploit. Special countermeasures targeting GPU implementations can also be devised. For example, we can randomize the assignment of messages to threads and blocks such that the execution time cannot be correctly predicted by the attacker. With a randomized assignment countermeasure, the correlation of one block reduces from 0.54, as shown in Fig. 4.1 (c), to -0.0073 as shown in Figure 4.8. We can also add more noise to the timing measurements with dummy random messages, at the cost of performance and throughput. If we insert one dummy random message for every 3 messages, the correlation between the number of reduction and execution time of one warp also reduces from 0.66 to 0.12.

66 CHAPTER 4. SIDE-CHANNEL ANALYSIS OF RSA ON GPU

(a) = 0.79 (b) = 0.06 1 1.0455

0.9995 1.045 0.999

0.9985 1.0445

0.998 1.044 0.9975

0.997 1.0435

Normali ed running time 0.9965 Normali ed running time 1.043 0.996

0.9955 1.0425 180 200 220 240 260 280 240 260 280 300 320 340 Reduction number of the message Reduction number of the message

Figure 4.7: Always reduce countermeasure

(a) = 0.5467 (b) = -0.0072622

1 1

0.9998 0.9998

0.9996 0.9996

0.9994 0.9994

0.9992 0.9992

0.999 0.999

0.9988 Normalized running time Normalized running time 0.9988

0.9986 0.9986

0.9984 0.9984 1550 1600 1650 1700 1550 1600 1650 1700 Reduction number of the SMX Reduction number of the SMX

Figure 4.8: Random assignment countermeasure

4.9 Summary

In this work, we evaluate the vulnerability of an RSA GPU implementation to side-channel timing analysis. We build timing models for a sliding-window RSA implementation with Montgomery multiplications. We consider a parallel implementation run on an NVIDIA GPU and report on the execution characteristics, capturing the the impact on the execution time of complex interactions among threads. The attack methods are designed based on the timing model. We obtain attack results from NVIDIA K40 hardware for different key sizes, and work across different levels of parallelism. Our results show that an RSA implementation on GPU is vulnerable to side-channel timing attacks, calling for both general countermeasures and GPU-specific countermeasures.

67 Chapter 5

Side-channel Analysis of ECC on Embedded Systems

5.1 Introduction and Motivation

Elliptic Curve Cryptography (ECC), initially proposed by Koblitz [82] and Miller [83], is an efficient public-key cipher. Compared with other popular public-key ciphers (e.g., RSA), ECC features a shorter key length for the same level of security. For example, a 256-bit ECC cipher provides 128-bit security, equivalent to a 2048-bit RSA cipher [84]. The proliferated embedded devices in Internet-of-Things (IoT) and platforms, require efficient and low-power secure communi- cations between edge devices and gateways/clouds. ECC has been widely adopted in IoT systems for authentication of communications, while RSA, which is much more costly to compute, remains the standard for desktops and servers. Various side-channel power analysis attacks of ECC on embedded systems have been presented in recent years. Standard implementations of ECC are known to be vulnerable to both simple power analysis (SPA) and differential power analysis (DPA) [85]. SPA can be applied where the execution is unbalanced for key bit values 0 and 1. For balanced implementations, DPA analysis can exploit data-dependent power consumption. State-of-the-art implementations adopt different algorithmic measures to defend against SPA and DPA, including double-and-always-add, a Montgomery ladder, regularized scalar multiplication algorithms [86, 87], and randomized scalar multiplication [85]. Template attacks are a more powerful class of side-channel attacks compared to SPA or DPA. A template attack requires the attacker to have full control of the target device to build power

68 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS consumption templates. The first template attack on ECC [88] extracts several bits of the scalar, with the remaining bits recovered by a lattice attack [89]. The Online Template Attack [90] builds the template traces adaptively after a single target trace is obtained. This method was used on a real-world ECC implementation, mbedTLS [91]. Nascimento et al. [20] proposed another real-world ECC template attack targeting conditional move of μNaCl library. However, the template attacks could be defeated by scalar and input point randomization. In addition, a fully controllable device may not be practical. To defeat DPA countermeasures that utilize scalar and input point randomization, horizontal and collision [92, 93, 94, 95, 96, 97] attacks can be developed that use a single trace to reveal the private key, where segments of the traces are compared to extract the scalar bits. The basic assumption is that two operations (usually multiplication) at different times should produce similar power profile segments if they are using the same operands (i.e., a data collision). Many of the attacks have only been demonstrated in controlled environments or on simulator traces [93, 95], while others [94, 96, 97] have demonstrated these attacks on specific ECC implementations. To the best of our knowledge, none of them has been applied to real-world ECC libraries. Here I propose practical side-channel power attacks targeting a real-world ECC library, micro- ecc [2]. Micro-ecc is a compact and efficient ECDH and ECDSA library implemented in C. It supports the standard NIST curves secp160r1, secp192r1, secp224r1, secp256k1 and secp256r1.Itis designed with considerations for resource-constrained devices - small code footprint, no requirement for dynamic memory allocation. Different ECC libraries were compared in terms of runtime overhead, firmware size and energy consumption [98], and the results show that micro-ecc is well suited for embedded systems. It also serves as the basis for many more efficient ECC implementations on specific embedded systems [99, 100, 101]. It also claims to be resistant to known side-channel attacks. As a result, it is widely used on IoT systems [102, 103, 104, 105, 106] including Intel’s IoT cryptographic library TinyCrypt [107]. However, after a close and careful inspection of the implementation, we have discovered a number of weaknesses which can lead to practical attacks to recover the full private key. First, micro- ecc is still vulnerable to simple power analysis because of the overflow of large numbers during modular additions and subtractions. Second, the underlying ECC algorithm presents a collision attack vulnerability which leaks scalar bit values directly. In this chapter, I describe how these two forms of leakage can be exploited to launch practical attacks on both AVR and ARM embedded systems. We also propose countermeasures to patch the library to prevent these attacks.

69 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

5.2 Preliminary

We begin with a brief introduction to elliptic curve cryptography, the algorithm used by micro-ecc, followed by several countermeasures it incorporates.

5.2.1 ECC Background

An elliptic curve defined over a finite prime field, Fp, for p>3 can be described by the reduced Weierstraß equation: 2 3 y = x + ax + b (5.1) x, y ∈ F2 O All the points ( ) p satisfying Equation (5.1), together with the infinity , form an additive

Abelian group, denoted as E(Fp). The point addition, R = P + Q, is defined as: 2 x3 = λ − x1 − x2 (5.2) y3 = λ(x1 − x3) − y1 where R =(x3,y3), P =(x1,y1), Q =(x2,y2), λ =(y1 − y2)/(x1 − x2) if P = Q, otherwise 2 λ =(3x1 + a)/2y1.

Because the inversion operation of a number over Fp is an expensive computation, the Jacobian projective coordinate system [108] is usually used with an additional z coordinate. A point in Jacobian coordinates, (X, Y, Z), is equivalent to (X/Z2,Y/Z3) in the affine coordinates (the original representation with two coordinates).

5.2.1.1 Elliptic Curve Algorithm

The Elliptic Curve Digital Signature Algorithm (ECDSA) generates a signature with a private key for a digest message. The generated signature can be verified using the corresponding public key to authenticate that it is indeed generated by the owner of the private key. With the public curve parameters G and n, which are the base point and the group order, respectively, the signing begins by generating a random integer scalar k as the ephemeral key. The scalar multiplication is carried out to generate a second point on the curve, P = k × G =(xp,yp), where xp must be nonzero. The signature consists of two parts. The first part is r = xp mod n, and the second part is s = k−1 × (z + r × d)modn, where z is the digest information, d is the private key. Although the ephemeral key is randomly generated, it still should be kept private. The goal of a power analysis attack on ECC is to retrieve the ephemeral key, k, and then the private key can be calculated as:

−1 d =(s × k − z) × r mod n (5.3)

70 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

5.2.1.2 The Scalar Multiplication Algorithm of micro-ecc

Scalar multiplication is the major operation carried out during ECDSA, and it dominates the overall computation time. In micro-ecc, scalar multiplication is implemented using a Montgomery ladder with a co-Z addition formulae [109, 110], and can defend against a SPA attack that targets unbalanced branching, as shown in Algorithm 3. In the algorithm, the input point P and output point

Q are in affine coordinates, while the intermediate points R0 and R1 are in Jacobian coordinates with the same Z coordinate value. To reduce the memory footprint, the Z coordinate is not stored during the computation, so Rb is expressed as (Xb,Yb) (b =0or 1).

Algorithm 3 Montgomery ladder with (x,y)-only co-Z addition

Input: P ∈ E(Fp),k =(kl−1, ..., k0) with kl−1 =1 Output: Q = kP

1: (R1,R0) ← XYCZ-IDBL(P )

2: for i = l − 2 downto 1 do

3: b ← ki

4: (Rb,R1−b) ← XYCZ-ADDC(Rb,R1−b)

5: (R1−b,Rb) ← XYCZ-ADD(R1−b,Rb) 6: end for

7: b ← k0

8: (Rb,R1−b) ← XYCZ-ADDC(Rb,R1−b)

9: λ ← FinalInvZ(R0,R1,P,b)

10: (R1−b,Rb) ← XYCZ-ADD(R1−b,Rb) 2 3 11: return (X0λ ,Y0λ )

The function XYCZ-IDBL(P ) computes the initial value of R1 and R0, which are equal to 2P and P in Jacobian coordinates with the same Z coordinate. Lines 2-6 are processing the scalar, bit-by-bit iteratively, for the intermediate l − 2 bits, as the highest bit kl−1 is always 1. Lines 7-10 are processing the LSB of the scalar, k0. The X and Y coordinates of R0, X0 and Y0, are used to 2 3 compute the final output as (X0λ ,Y0λ ). Each scalar bit processing involves two operations: 1) XYCZ-ADDC and 2) XYCZ-ADD on two points in different order. They both include point additions. For (A,B)=XYCZ-ADDC(A, B), A = A − B, B = A + B, and the two output points have a new Z coordinate. For (A,B)= XYCZ-ADD(A, B), A = A, B = A + B with a new Z coordinate. The computation steps of these

71 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS two operations are important for our later analysis, and we present the details in Algorithm 4 and 5, respectively.

Algorithm 4 XYCZ-ADDC : (X,Y)-only co-Z conjugate addition

Input: R0(X0,Y0) and R1(X1,Y1) s.t. P =(X0,Y0,Z) and Q =(X1,Y1,Z) for some Z ∈ Fp,

P, Q ∈ E(Fp)             Output: R0(X0,Y0) and R1(X1,Y1) s.t. P − Q =(X0,Y0,Z ) and P + Q =(X1,Y1,Z ) for  some Z ∈ Fp, P, Q ∈ E(Fp) 2 1: A ← (X1 − X0)

2: B ← X0A

3: C ← X1A 2 2 4: D ← (Y1 − Y0) ; F ← (Y1 + Y0)

5: E ← Y0(C − B)  6: X1 ← D − (B + C)   7: Y1 ← (Y1 − Y0)(B − X1) − E  8: X0 ← F − (B + C)   9: Y0 ← (Y0 + Y1)(X0 − B) − E     10: return (X0,Y0), (X1,Y1)

5.2.2 Side-Channel Countermeasures of micro-ecc

Micro-ecc is claimed to be resilient to known side-channel attacks, including a couple of countermeasures against SPA that target unbalanced branching and DPA respectively. Montgomery Ladder. The Montgomery ladder presented in Algorithm 3 regularizes the scalar multiplication over each bit of the ephemeral key. In the original simple double-add method, each bit processing starts with a doubling operation, followed by a conditional addition depending on the key bit value. The unbalanced branch reveals the key bit directly in simple power analysis. While in the Montgomery Ladder implementation, the operations for bit values 0 and 1 of the scalar bit are the same, with only the input and output arguments switched, as shown in Lines 4-5 of Algorithm 3. This feature removes simple power leakage. Scalar Regularization. In scalar multiplication of ECC, the scalar could be the private key or ephemeral key generated randomly on the fly, and may have several leading zeros. The normal computation starts with the most significant nonzero bit of the scalar, with the leading zeros skipped.

72 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

Algorithm 5 XYCZ-ADD : (X,Y)-only co-Z addition with update

Input: R0(X0,Y0) and R1(X1,Y1) s.t. P =(X0,Y0,Z) and Q =(X1,Y1,Z) for some Z ∈ Fp,

P, Q ∈ E(Fp)             Output: R0(X0,Y0) and R1(X1,Y1) s.t. P =(X0,Y0,Z ) and P + Q =(X1,Y1,Z ) for some  Z ∈ Fp, P, Q ∈ E(Fp) 2 1: A ← (X1 − X0)

2: B ← X0A

3: C ← X1A 2 4: D ← (Y1 − Y0)

5: E ← Y0(C − B)  6: X1 ← D − (B + C)   7: Y1 ← (Y1 − Y0)(B − X1) − E  8: X0 ← B  9: Y0 ← E     10: return (X0,Y0), (X1,Y1)

As a result, the number of leading zeros could be detected by an attacker from the length of the power trace. This leakage may not necessarily lead to the full recovery of the scalar, but reduces the security strength of ECC. In micro-ecc, the scalar is regularized so that the resulting scalar is always one bit longer than the modulus n. Since the regularized scalar is equivalent to the original one for numerical computation, leaking this value also leads to full private key recovery. Input Point Randomization. To prevent differential power analysis, the input point for ECDH (Elliptic Curve Diffie-Hellman) scalar multiplication is obfuscated with a random initial Z in Jacobian coordinates. When Z is set to a random number, the X coordinate multiplies Z2 and the Y coordinate multiplies Z3, so that the intermediate variables in the later computation will also be randomized. As a result, the attacker would not know the input nor the intermediate variables in Jacobian coordinates, even he/she knows the input point in affine coordinates. Across multiple runs of ECDH, Z will differ, which effectively defeats a differential power analysis. The ESDSA is already inherently resilient against DPA since the scalar is an ephemeral key generated randomly every time. For power savings and performance, the micro-ecc library chooses not to randomize the input point for ECDSA computation.

73 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

5.3 Novel Simple Power Analysis

Even though the Montgomery ladder multiplication algorithm prevents leaking the scalar bit values directly during a simple power analysis, a careless design of the modular addition and subtraction yields another leakage. In this section, we exploit this weakness and devise a new attack method which can retrieve the private key with a single power or EM trace. Leakage analysis. Our first observation is that the ECDSA input point is not randomized. Its initial Z coordinate is always set to 1 for efficiency. This means that in scalar multiplication, all the three coordinates of the input points (X, Y, Z) are known. Hypothesizing the scalar bits one by one from the most significant one, the power leakage can be predicted, and statistical analysis can be conducted to recover the scalar bit. The second observation is that the modular addition and subtraction involved in Algorithm 4 and 5 are not protected. They are implemented as ordinary large number additions (subtractions), with a conditional subtraction (addition) if overflow (underflow) occurs. Because addition and subtraction are similar operations and take almost the same amount of time to execute, we refer to both of them as addition in the rest of this chapter. Assuming the input operands for modular addition are random, with a uniform distribution from 0 to q, the possibility of overflow of the ordinary addition is equal to 0.5. When it overflows, it takes about twice the time of a modular addition without overflow. The most import thing is that the occurrence of extra additions depends on the scalar’s value. Leakage observation. We run the latest micro-ecc on an Arduino Uno board and a Zero board, which feature ATmega328p and ARM Cortex M0+ micro-controllers, respectively. ATmega328p is an 8-bit processor running at 16MHz frequency, while the ARM Cortex M0+ is a 32-bit processor running at 48MHz. We collect power traces from ATmega328p and EM traces from ARM Cortex M0+ respectively, using a LeCroy WaveRunner 640zi oscilloscope. The reason we collected different kinds of traces for different boards is due to the board hookups for power measurement. The traces are collected from running secp160r1, while other elliptic curves implemented by micro-ecc will give similar results as the underlying implementations are identical. Figure 5.1 shows a sample power trace of the Arduino, where by visual inspection we can identify the modular multiplications and additions, since modular multiplication takes much longer and presents different power profile from modular addition. The simple power leakage helps us partition the trace into segments belonging to different bit processing steps of the scalar. From the code, we learn that there are 14 modular multiplications for each bit, except for the first and last bit. Between every two modular multiplications, the number of modular additions varies between 0 and 3.

74 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

(a) P ac AT a328 100 multiplication multiplication multiplication 80

60

40

20

0 Normali ed oltage -20 additions additions -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 105

(b) EM ac ARM C M0+ 150 multiplication multiplication 100

50

0

-50 Normali ed oltage -100 additions additions

-150 012345678910 Sampleing points 104

Figure 5.1: Modular multiplication and addition

The most useful information observed from the trace is the number of extra overflow additions that are performed for modular addition, as shown in Fig 5.2 (a). The power trace of the ATmega328p shows that there is a total number of 5 additions between the two modular multiplications. This is surprising, since the source code only indicates that there are three modular additions algorithmically. The difference between the observed implementation and the general algorithm in terms of the number of operations is due to the number of overflows, i.e., there are 2 overflows. Note that we cannot tell which modular addition incurs an extra operation. From the EM trace of the ARM Cortex M0+, we can make a similar observation. The modular multiplication and addition can easily be differentiated, as shown in Figure 5.1 (b). To get the number of extra addition operation, we can directly count the number of EM peaks, as shown in Figure 5.2 (b). Automatic Leakage Identification Although the leakage can be easily identified in the power trace by counting the number of modular additions between every two modular multiplications manually and visually, it requires a lot of labor and is error prone. Here we introduce a pattern recognition algorithm to identify the leakage automatically and reliably. The algorithm first identifies the positions of the end of modular multiplications in the trace, which is also the beginning of modular additions. It then recognizes end position of each modular multiplication and counts the

75 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

(a) P ac AVR AT a328P 80

60

40

20

0 Normali ed lotage -20 5 additions -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 104

(b) EM ac ARM C M0+ 100

50

0

-50 Normali ed lotage 2 e tra additions -100 0 0.5 1 1.5 2 2.5 3 Sampling points 104

Figure 5.2: Simple power leakage from power and EM traces. number modular additions, saving the results as an vector for later private key extraction. We know that for each bit processing of scalar multiplication, there are 14 modular multiplications with additions between any two multiplications. The goal of our automatic leakage identification algorithm is to detect all the numbers of additions. We do a little preprocessing of the power trace to partition it into segments corresponding to each bit processing, and repeat our identification algorithm on each segment. Next we focus on one bit-processing segment.

We can measure the rough length of one modular multiplication, lm, and keep ending part of the first modular multiplication as the model pattern, e.g., the part of the trace from sampling point 0 to 10000 in Figure. 5.2. We do not use the entire modular multiplication power profile as the pattern for two reasons: (1) the entire profile is too long, and using it as a pattern to to recognize others would take too much time; (2) we found the beginning part of the profile shows high noise and variation, and only the ending part demonstrates a discernible and reliable pattern. Next we do a pattern recognition in our one-bit processing segment, using the reference pattern for part of a modular multiplication. Assuming there are nm points in the reference pattern, for every nm points of our power segment, we calculate the Pearson correlation between the reference pattern and the selected points of interest. We choose points from the segment in a sliding-window

76 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS fashion. Fig. 5.3 shows the result, where we identify 13 modular multiplications with the end indexes labeled by X with a much higher correlation value. Next we look into points after these ends of multiplications to count number of additions.

0.25

0.2

0.15

0.1

0.05

0

-0.05 Correlation Coefficient -0.1

-0.15

-0.2 123456 Sampling Points 10 5

Figure 5.3: Correlation of power trace with sliding multiplication pattern

Similarly we first obtain a reference power pattern for the addition operation, with the number of points in it as la. From each maximum index shown in Fig. 5.3, we keep power points within the range of 7la, because we know the maximum number of additions is 6 and we add one as the guard margin. We apply a moving average filter for correlation between the chosen la points and the reference addition power profile,, and count the number of correlation peaks that are above the threshold, which is set to 0.2 in our experiment. In Figure. 5.4, we show the counting result after on modular multiplication, which is 3 additions here, indicating no overflow for this set of three modular additions. Private Key Extraction. With the information leakage (number of overflows) identified, here we present our method to extract the full private key of ECDSA. The first step is to obtain possible ephemeral keys by using the leakage obtained from a power or EM trace. Equipped with the information of extra additions between modular multiplication obtained from the trace, we run a simulation of ECDSA, trying to find the ephemeral key candidates which produce the same extra addition patterns. We may get multiple ephemeral key candidates. The second step is to compute private keys from the ephemeral key list using Equation (5.3), and select the correct one which yields

77 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

0.4

0.3

0.2

0.1

0

-0.1

Correlation Coefficient -0.2

-0.3

-0.4 1000 3000 5000 7000 9000 Sampling Points

Figure 5.4: Count number of additions after modular multiplication the correct public key. To find the ephemeral key candidates, we calculate them bit by bit. Assume we already obtained a list of the number of extra additions between every two modular multiplications, using either the power or EM trace. Starting from the second bit, and excluding the last bit, the processing of each scalar bit needs 14 modular multiplications. For each bit processing, we obtain a vector of 13 numbers, for all the intermediate l − 2 bits (excluding the first and last bit).

A = {al−2,1, ··· ,al−2,13, ··· ,a1,1, ··· ,a1,13}, where l is the bit length of the ephemeral key k. Note, because there are some consecutive modular multiplications without modular additions between them, there are many zeros in the list A.

We know the first bit of the scalar kl−1 is always 1 because of using the scalar regularization countermeasure, so we have one candidate for the first bit as C = {{1}}. Next, we try to get the second bit value of kl−2. We first assign kl−2 to 0, then run a simulation of ECDSA micro-ecc k {a , ··· ,a } for first two bits, and record the number of extra additions for l−2 as l−2,1 l−2,13 .If {a , ··· ,a } {a , ··· ,a } A k l−2,1 l−2,13 matches l−2,1 l−2,13 in list ,wesay l−2 could possibly be 0, and extend the candidate set as C = {{1, 0}} . However, kl−2 can also be 1 if the same procedure applies with a match, given the assumption that kl−2 =1. Then we will have the candidate set:

78 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

C = {{1, 0}, {1, 1}}. After this, we have at least one, and at most two candidates, in the set. Assume now we have a set of m candidates for {k ,k ,k , ··· ,k } {{ ,k1 ,k1 , ··· ,k1 }, l−1 l−2 l−3 i+1 as 1 l−2 l−3 i+1 { ,k2 ,k2 , ··· ,k2 }, ··· , { ,km ,km , ··· ,km }} 1 l−2 l−3 i+1 1 l−2 l−3 i+1 , and the simulation with any of the can- didates will generate the same number of extra additions as in list A before ki. Next we evaluate ki. With one more key bit, whose possibles values of 0 or 1, each old candidate is expanded into two new candidates, with 0 and 1 appended respectively. As a result, the number of candidates grows to 2m. For each candidate, the simulation produces a new prediction of the extra additions, in order to compare it with {ai,1, ··· ,ai,13}. If they match, the candidate remains; otherwise, the candidate is removed. Bit idx Candidates Patterns

1 ݇௟ିଵ ǡܽ௟ିଶǡଵଷሽڮሼܽ௟ିଶǡଵǡ ݇௟ିଶ 0 1 ǡܽ௟ିଷǡଵଷሽڮሼܽ௟ିଷǡଵǡ ݇௟ିଷ 0 1

ǡܽ௟ିସǡଵଷሽڮ௟ିସ 0 1 1 ሼܽ௟ିସǡଵǡ݇

ڭ ڭڭڭڭڭڭ ڭ ڭ ǡܽ௜ାଵǡଵଷሽڮ௜ାଵ 1 0 1 1 0 1 ሼܽ௜ାଵǡଵǡ݇ ڭ ڭ ڭڭڭڭڭڭ ڭ ǡܽଵǡଵଷሽڮଵ 1 0 1 1 0 1 ሼܽଵǡଵǡ݇

Figure 5.5: Ephemeral key candidate search

The attack process could be modeled as a conditional traverse of a binary tree, as shown in Figure 5.5. Each level of the tree represents one scalar bit index, and there is a pattern of extra additions associated with it, except for the root. Each node represents one candidate, up to the bit index of its level. Starting from the root, which is 1 for kl−1, we check the two children of the root, which are (1, 0) and (1, 1) for the first two key bits to see if the resulting patterns from simulation match the pattern extracted from the power trace. The matching children will be kept in the tree, while removing any mismatches. We repeat the process for the next bit, until we have the candidates for k1. As we evaluate the key bits one by one, the number of candidates is pruned in polynomial time, since many candidates (tree nodes) are removed by the pattern matching. When completed, we will have the list of candidates for {kl−2, ··· ,k1}. We set kl−1 to 1, k0 to 0 and 1, to get a full guess

79 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS of the scalar. Then we use the guesses to compute the private key using Equation (5.3) and further compute the corresponding public key. This will eventually select the correct scalar and also the corresponding private key. We expect to have multiple candidates of the scalar ephemeral key, while our experimental results always yield only the correct key. This is due to having 13 extra addition numbers for each bit to match. The high accuracy of our attack gives us a margin of error for noisy power measurement. If the power trace is noisier than the results in Figure 5.1 and Figure 5.2, the number of extra additions between two modular multiplication extracted from the power trace may contain errors. We can relax the “matching” criterion during the guessing and pattern matching steps, i.e., we won’t get an exact match, but within some short Euler distance. Countermeasures To prevent the simple power attack, two countermeasures are proposed. The first countermeasure balances the modular additions to remove the leakage. A dummy extra addition can be added to the modular addition to hide the case when there is no overflow. In our experiment with this countermeasure implemented, the running time increase 1.3%. There is no information leakage from the modular addition any more. The second countermeasure adds randomization to mask the intermediate values. Our simple power attack relies on simulating with known initial values to extract the ephemeral key, bit by bit. With the initial values randomized, the simulation would not work anymore. The existing countermeasure of Z coordinate randomization should be turned on for ECDSA, even though it was designed for defeating differential power analysis initially. The overhead of the countermeasure is four modular multiplication to set the initial Jacobian coordinates. In our experiment with the countermeasure implemented, the running time increase by 0.16% (secp192r1), which is negligible. The search of the ephemeral key cannot proceed anymore with the unknown input points’ coordinates.

5.4 Collision Attack

Protected by the additional countermeasures proposed in Section 5.3, micro-ecc is still vulnerable to a collision attack. In this section, we present the source of the leakage, how it can be used to retrieve the private key, and finally discuss countermeasures to guard against this leakage. Leakage Analysis. According to our analysis of the algorithm, within the operations for two consecutive bits of the ephemeral key, an identical instruction sequence with identical data could be executed twice under specific conditions. The conditions to trigger this collision is dependent on the ephemeral key value, which makes it a serious information leakage. As a result, the ephemeral key

80 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS information can be obtained by using collision analysis. With this extra information, the private key can by easily inferred.  In Algorithm 5, at Line 7 for XYCZ-ADD, a modular subtraction B − X1 is performed. The     value of B is actually X0, because at Line 8, B is assigned to X0. X0 and X1 are part of the output, and will be the input of XYCZ-ADDC for the next bit processed. However, the order of the input points for the XYCZ-ADDC operation is dependent on the next scalar bit, as shown in Lines 3-4    of Algorithm 3. If the bit is zero, X0 is assigned to X0 and X1 is assigned to X1. Otherwise, X0 is  assigned to X1 and X1 is assigned to X0. In Algorithm 4, at Line 1, a modular subtraction X1 − X0  is performed. So, it is possible that operation of B − X1 is exactly the same as the operation of

X1 − X0 for the following bit.

Considering bit processing for ki and ki−1, we only have 4 possible combinations, (ki,ki−1) ∈

{(0, 0), (0, 1), (1, 1), (1, 0)}. We check the input positions of R0 and R1 in the two-point addition operations, XYCZ-ADDC and XYCZ-ADD, and also look into their associated operations. We  observe that if two consecutive bits of the scalar are the same, then the operation of B − X1 in

XYCZ-ADD of the first bit will be same as X1 − X0 in XYCZ-ADDC of the second bit; otherwise, they are different operations with the order of operands switched. There will be operation collisions across two consecutive bit processings if ki = ki−1. We omit the proof of this due to page limits. Leakage Observation. Here, we present a method to identify the collision. First, we identify  the rough position of B − X1 and X1 − X0 in the power and EM traces. As stated in Section 5.3, since we can identify the trace segments where modular multiplications and modular additions are  performed, the positions of B − X1 and X1 − X0 can be easily deduced with the help of micro-ecc’s  source code. Second, we align B − X1 with the next X1 − X0, using cross correlation. Then we calculate the differential between their values to see if they are the same operation, as shown in Figure 5.6. For the power trace from the ATmega328P, the difference is almost 0 if the the two operations are the same; otherwise, we see a significant signal in the differential trace. For the EM trace from the ARM Cortex M0+, we have similar results. The differential EM trace of the same operation has significantly fewer peaks as compared to a trace when executing different consecutive key bits. Automatic Leakage Identification Here we reuse the pattern recognition algorithm presented  Section 5.3 to obtain the trace segments corresponding to B − X1 at Line 7 of Algorithm 5 and

X1 − X0 at Line 1 of Algorithm 4. The pseudo-code shows that there are eight and six modular  multiplications in Algorithm 2 and Algorithm 3, respectively. Operations B − X1 and X1 − X0 are modular additions interleaving multiplications. We use the closest multiplications identified to

81 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

(a) P d c AVR AT a328P 60 k = k 40 i i-1 ki ki-1 20

0

-20

-40

Normalidifference ed oltage -60 0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b) EM d c ARM C M0+ 150 k = k 100 i i-1 ki ki-1 50

0

-50

-100

Normalidifference ed oltage -150 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Sampling points

Figure 5.6: Power and EM trace collision localize the power profiles of these two operations. Here we assume the modular addition leakage is already fixed by using constant-time computation (always has an additional overflow operation),  once we localize the operations of B − X1 and X1 − X0, we can use correlation calculation to determine if there is a collision between the two power profiles, deducing whether ki = ki+1. Private Key Extraction. With the information obtained from our collision analysis, we can extract the private key without much effort. Assume we have a bit vector that represents collisions from the leakage observation, expressed as B = {cl−2,cl−3, ..., c2}. We set cl−2 to 0 if a collision is  found for the first two trace segments of B − X1 and X1 − X0, otherwise, it is set to 1. The rest of  ci is set similarly with later trace segments of B − X1 and X1 − X0. Through the leakage analysis, we know that ci = ki ⊗ ki−1, where k is the ephemeral key. The first bit of the scalar kl−1 is known to be 1. With a guess of kl−2, we can compute kl−3 to k1, one at a time. For the last bit k0, we also have two guesses of either 1 or 0. In total, we have only 4 candidates. The correct scalar can be determined by checking if it yields the correct public key, just as we did in Section 5.3. The correct private key can be computed with the correct ephemeral key at the same time. Countermeasure. The Z coordinate randomization protection will not prevent our collision

82 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS

Countermeasure SPA√ DPA Matching SPA Collision Attack × × × Regularized Multiplication √ √ × × Coordinate Randomization √ × × × Dummy Addition √ Result Reuse × × ×

Table 5.1: Attacks and Countermeasures attack, since the attack does not require knowledge of the value of the intermediate variable. The collision solely depends on the ephemeral key value. Therefore we propose a countermeasure strategy to avoid executing the same instructions with the same data. Since in XYCZ-ADDC, the result of the subtraction is squared, it does not matter if it is X0 − X1 or X1 − X0. Thus, we can save the subtraction result in XYCZ-ADD, and reuse it in XYCZ-ADDC for the next bit. In our implementation of the this countermeasure, the overhead is the increase of memory usage of key size to store the result of X0 − X1 or X1 − X0. There is no increase of running tim. We actually reduce one modular subtraction for each bit of the scalar, but it is negligible to the entire running time. During the attack, since there is only one X0 − X1 or X1 − X0 operation, no collision can be found.

5.5 Discussion

Here we summarize all the power analysis attacks and countermeasures on ECC in Table 5.1. Each column represents an attack, and each row represents a countermeasure and whether the countermeasure works against different attacks. The first two columns are classic SPA (utilizing unbalanced branches) and DPA. The third and fourth columns are our proposed attack, “Matching SPA” and “Collision Attack”, respectively. The countermeasures proposed in this chapter to prevent matching SPA and collision attack are labeled as “Dummy Addition” and “Result Reuse”. As no countermeasure is resisting all possible attack, to fully protect micro-ecc, all the countermeasures should be applied.

5.6 Summary

In this chapter, we evaluate side-channel vulnerability of a state-of-the-art ECC library commonly used in IoT applications, micro-ecc. Despite the countermeasures incorporated in micro-ecc designed to defend against classic simple and differential power analysis, we still discover several attack

83 CHAPTER 5. SIDE-CHANNEL ANALYSIS OF ECC ON EMBEDDED SYSTEMS surfaces for compromising the underlying algorithm. We propose a practical simple power attack and a collision attack to extract the full private key on two embedded systems, an ATmega328p and an ARM Cortex M0+, using a power and EM trace, respectively. We also present countermeasures to avoid such attacks - countermeasures that introduce little overhead. We strongly recommend that future embedded systems using micro-ecc patch their systems with our proposed countermeasures to protect them from future attacks.

84 Chapter 6

Conclusion

In this dissertation, I presentd my research approaches and results on novel side-channel analysis of the emerging cryptography and compute systems. The cryptography algorithms consist of AES and its variant XTS-AES, RSA and ECC. The compute systems range from FPGA, Micro-controller to GPGPU. All these algorithms running on specific compute systems are vulnerable to certain side-channel attacks. Capturing properties of the algorithms and compute systems, a series of attack methods are designed to perform the effective side-channel analysis for key retrieval. The analysis methods adopted include simple power analysis, correlation power analysis, timing and collision attack. In Chapter 2, I evaluate vulnerabilities of the XTS-AES algorithm to side-channel power analysis. For a software implementation, I analyze its simple power analysis vulnerability from the conditional branch, and propose a direct forward fix. For a hardware implementation on FPGAs, I design two different attacks to retrieve the block tweaks and therefore two encryption keys. Through power analysis of modular multiplication, i.e., horizontal attack, we can obtain the block tweak value T0 by SPA with error detection and correction. I apply two statistical test methods: ML-based test and Bayesian test. In the CPA, i.e., vertical attack, I attack the first two data blocks of each sector, and extract the tweaked key values, T0 ⊕ Rkey and T1 ⊕ Rkey. I then utilize the relationship between

T0 and T1 to recover T0 and the round key Rkey with very low complexity. In Chapter 3, I present side-channel power analysis on a GPU AES implementation. I describe a process to obtain power consumption measurements on an NVIDIA GPU. challenges of power analysis on a GPU are highlighted. To overcome these difficulties, I have proposed effective strategies to process the power traces for a successful correlation power analysis. The corresponding power model is built based on the CUDA PTX assembly code. I begin the analysis of the attack assuming

85 CHAPTER 6. CONCLUSION control over the plaintext, and analyze its scalability as we increase the size of plaintext. I find a linear relationship between the amount of plaintext and the the number of traces needed, though the computation complexity grows exponentially. The attack results show that a GPU, a highly-popular but very complex and parallel computing device, is vulnerable to side-channel power analysis attacks. In Chapter 4, I evaluate the vulnerability of an RSA GPU implementation to side-channel timing analysis. I build timing models for a sliding-window RSA implementation with Montgomery multiplications, considering a GPU’s parallel execution characteristics. The attack methods are designed based on the timing model. I obtain attack results from GPU K40 hardware, using different key sizes and working across different levels of parallelism. The results show that an RSA implementation on GPU is vulnerable to side-channel timing attack, calling for both general countermeasures and GPU-special countermeasures. In Chapter 5, I evaluate side-channel vulnerability of a state-of-the-art ECC library commonly used in IoT applications, micro-ecc. Despite the countermeasures incorporated in micro-ecc designed to defend against classic simple and differential power analysis, I still discover several attack surfaces for compromising the underlying algorithm. I propose a practical simple power attack and a collision attack to extract the full private key on two embedded systems, an ATmega328p and an ARM Cortex M0+, using only one power and EM trace, respectively. I also present countermeasures against such attacks - countermeasures with little overhead. This dissertation has made several significant contributions in side-channel research, demonstrat- ing exploitable side-channel vulnerabilities of new ciphers and emerging computing systems, whose applications in critical systems and infrastructures increasing on a daily basis. With vulnerabilities reveled, effective protections against these newly invented side-channel attacks, at various implemen- tation levels, should be considered early in the design and implementation state. In this dissertation, I also propose and implement many countermeasures to protect these new ciphers or new comput- ing systems for cryptographic workload. Many countermeasures come with slight performance degradation, offering trade-off between side-channel security and performance/cost. Throughout both attacks and countermeasures, accurate security/leakage evaluations are also conduced with our leakage metrics and analytic success rate models. For the future work, regarding new ciphers and computing platforms, it would be the research of how to design the cipher and hardware to be resilient to side-channel attack. On the other hand, the coming of quantum computing requires post-quantum cryptography. How to design a post-quantum cipher that is secure with quantum computing and side-channel attack would be challenging but necessary.

86 Bibliography

[1] N. Leischner, V. Osipov, and P. Sanders, “Nvidia fermi architecture white paper,” 2009. [Online]. Available: http://www.nvidia.com/content/pdf/fermi white papers/nvidia fermi compute architecture whitepaper.pdf

[2] K. MacKay, “micro-ecc: Ecdh and ecdsa for 8-bit, 32-bit, and 64-bit processors,” 2017, accessed: 2017-12. [Online]. Available: https://github.com/kmackay/micro-ecc

[3] P. C. Kocher, “Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems,” in Advances in Cryptology — CRYPTO ’96: 16th Annual International Cryptology Conference Santa Barbara, California, USA August 18–22, 1996 Proceedings, N. Koblitz, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 104–113. [Online]. Available: https://doi.org/10.1007/3-540-68697-5 9

[4] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, “Mutual information analysis,” in Crypto- graphic Hardware & Embedded Systems, 2008, pp. 426–442.

[5] S. Chari, J. R. Rao, and P. Rohatgi, “Template attacks,” in Proc. Int. Conf. on Cryptographic Hardware & Embedded Systems. Springer, 2002, pp. 13–28.

[6] J.-F. Dhem, F. Koeune, P.-A. Leroux, P. Mestre,´ J.-J. Quisquater, and J.-L. Willems, “A practical implementation of the timing attack,” in CARDIS’98, 2000.

[7] “IEEE standard for cryptographic protection of data on block-oriented storage devices,” IEEE Std 1619-2007, pp. c1–32, Apr. 2008.

[8] M. Dworkin, “Recommendation for block cipher modes of operation: The XTS-AES mode for confidentiality on storage devices,” in NIST Special Publication, SP 800-38E, 2010.

87 BIBLIOGRAPHY

[9] J. Jaffe, “A first-order DPA attack against AES in counter mode with unknown initial counter,” in Cryptographic Hardware & Embedded Systems, 2007.

[10] D. Jayasinghe, R. Ragel, J. A. Ambrose, A. Ignjatovic, and S. Parameswaran, “Advanced modes in AES: Are they safe from power analysis based side channel attacks?” in IEEE Int. Conf. on Computer Design, 2014, pp. 173–180.

[11] T. Unterluggauer and S. Mangard, “Exploiting the physical disparity: Side-channel attacks on memory encryption,” in Constructive Side-Channel Analysis & Secure Design, 2016.

[12] J. Daemen and V. Rijmen, “AES proposal: Rijndael,” 1999.

[13] S. Bela¨ıd, P.-A. Fouque, and B. Gerard,´ “Side-channel analysis of multiplications in GF (2128),” in Int. Conf. on the Theory & Application of Cryptology & Information Security, 2014, pp. 306–325.

[14] S. Bela¨ıd, J.-S. Coron, P.-A. Fouque, B. Gerard,´ J.-G. Kammerer, and E. Prouff, “Improved side-channel analysis of finite-field multiplication,” in Cryptographic Hardware & Embedded Systems, 2015, pp. 395–415.

[15] P. Rogaway, “Efficient instantiations of tweakable blockciphers and refinements to modes OCB and PMAC,” in Advances in Cryptology-ASIACRYPT, 2004, pp. 16–31.

[16] M. V. Ball, C. Guyot, J. P. Hughes, L. Martin, and L. C. Noll, “The XTS-AES disk encryption algorithm and the security of ciphertext stealing,” Cryptologia, vol. 36, no. 1, pp. 70–79, 2012.

[17] L. Martin, “XTS: A mode of AES for encrypting hard disks,” IEEE Security & Privacy, no. 3, pp. 68–69, 2010.

[18] Seagate, “Transition to Advanced Format 4K Sector Hard Drives,” http://www.seagate.com/ tech-insights/advanced-format-4k-sector-hard-drives-master-ti/, [Online; accessed 2-Sept- 2016].

[19] C. Luo, Y. Fei, and A. A. Ding, “Side-channel power analysis of xts-aes,” in Design, Automa- tion Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 1330–1335.

[20] E. Nascimento, Ł. Chmielewski, D. Oswald, and P. Schwabe, “Attacking embedded ecc implementations through cmov side channels,” in International Conference on Selected Areas in Cryptography. Springer, 2016, pp. 99–119.

88 BIBLIOGRAPHY

[21] J. Neyman and E. S. Pearson, “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Royal Society of London Philosophical Transactions Series A, vol. 231, pp. 289–337, 1933.

[22] H. Stark and J. W. Woods, Probability, statistics, and random processes for engineers. Pear- son, 2012.

[23] T. Katashita, A. Satoh, T. Sugawara, N. Homma, and T. Aoki, “Development of side-channel attack standard evaluation environment,” in European Conf. on Circuit Theory & Design, Aug 2009, pp. 403–408.

[24] M. Renauld and F.-X. Standaert, “Algebraic side-channel attacks,” in Int. Conf. on Information Security & Cryptology, 2009, pp. 393–410.

[25] B. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition, 2nd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2013.

[26] W.-m. Hwu, GPU Computing Gems Emerald Edition, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[27] D. L. Cook, J. Ioannidis, A. D. Keromytis, and J. Luck, “Cryptographics: Secret key cryp- tography using graphics cards,” in Topics in Cryptology–CT-RSA 2005. Springer, 2005, pp. 334–350.

[28] D. Cook and A. D. Keromytis, Cryptographics: exploiting graphics cards for security. Springer Science & Business Media, 2006, vol. 20.

[29] S. Manavski, “CUDA compatible GPU as an efficient hardware accelerator for AES cryptog- raphy,” in IEEE Int. Conf. on Signal Processing & Communications, Nov. 2007, pp. 65–68.

[30] K. Iwai, T. Kurokawa, and N. Nisikawa, “Aes encryption implementation on cuda gpu and its analysis,” in 2010 First International Conference on Networking and Computing, Nov 2010, pp. 209–214.

[31] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and analysis of aes encryption on gpu,” in 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems, June 2012, pp. 843–848.

89 BIBLIOGRAPHY

[32] R. Szerwinski and T. Guneysu,¨ “Exploiting the power of GPUs for asymmetric cryptography,” in Cryptographic Hardware and Embedded Systems. Springer, 2008, pp. 79–99.

[33] J. Gilger, J. Barnickel, and U. Meyer, “GPU-acceleration of block ciphers in the OpenSSL cryptographic library,” in Information Security. Springer, 2012, pp. 338–353.

[34] R. D. Pietro, F. Lombardi, and A. Villani, “CUDA leaks: A detailed hack for CUDA and a (partial) fix,” ACM Transactions on Embedded Computing Systems (TECS), vol. 15, no. 1, p. 15, 2016.

[35] C. Maurice, C. Neumann, O. Heen, and A. Francillon, “Confidentiality issues on a GPU in a virtualized environment,” in Financial Cryptography and Data Security. Springer, 2014, pp. 119–135.

[36] F. Lombardi and R. Di Pietro, “Towards a GPU cloud: Benefits and security issues,” in Continued Rise of the Cloud. Springer, 2014, pp. 3–22.

[37] A. Moradi and G. Hinterwalder,¨ “Side-Channel security analysis of ultra-low-power FRAM- based MCUs,” Proc. Int. WkShp on Constructive Side-channel Analysis & Secure Design, Mar. 2015.

[38] T. S. Messerges, E. A. Dabbish, and R. H. Sloan, “Power analysis attacks of modular ex- ponentiation in smartcards,” in Cryptographic Hardware & Embedded Systems, 1999, pp. 144–157.

[39] S. B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel, “Power-Analysis attack on an ASIC AES implementation,” in Int. Conf. on Info. Tech.: Coding & Computing, vol. 2, Apr. 2004, pp. 546–552.

[40] P. Luo, Y. Fei, X. Fang, A. A. Ding, M. Leeser, and D. R. Kaeli, “Power analysis attack on hardware implementation of MAC-Keccak on FPGAs,” in Int. Conf. on ReConFigurable Computing and FPGAs (ReConFig), Dec. 2014, pp. 1–7.

[41] S. B. Ors,¨ E. Oswald, and B. Preneel, “Power-analysis attacks on an FPGA–first experimental results,” in Cryptographic Hardware & Embedded Systems, 2003, pp. 35–50.

[42] C. Luo, Y. Fei, P. Luo, S. Mukherjee, and D. Kaeli, “Side-channel power analysis of a GPU AES implementation,” in IEEE Int. Con. on Computer Design (ICCD). IEEE, Oct. 2015, pp. 281–288.

90 BIBLIOGRAPHY

[43] Z. H. Jiang, Y. Fei, and D. Kaeli, “A complete key recovery timing attack on a gpu,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 394–405.

[44] NVIDIA, “CUDA C Programming Guide,” 2015. [Online]. Available: http://docs.nvidia.com/ cuda/pdf/CUDA C Programming Guide.pdf

[45] J. Daemen and V. Rijmen, “AES proposal: Rijndael,” 1998.

[46] P. Margara, “Engine-CUDA, a cryptographic engine for CUDA supported devices,” 2015. [Online]. Available: https://code.google.com/p/engine-cuda/

[47] D. Genkin, A. Shamir, and E. Tromer, “RSA key extraction via low-bandwidth acoustic cryptanalysis,” in Advances in Cryptology–CRYPTO 2014. Springer, 2014, pp. 444–461.

[48] P. Kocher, J. Jaffe, B. Jun, and P. Rohatgi, “Introduction to differential power analysis,” Journal of Cryptographic Engineering, vol. 1, no. 1, pp. 5–27, 2011.

[49] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with a leakage model,” in Cryptographic Hardware & Embedded Systems, 2004, vol. 3156, pp. 16–29.

[50] M. R. Jan, C. Anantha, and N. Borivoje, “Digital integrated circuits: a design perspective,” 2003.

[51] C. Clavier, J.-S. Coron, and N. Dabbous, Differential Power Analysis in the Presence of Hardware Countermeasures. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 252–263. [Online]. Available: https://doi.org/10.1007/3-540-44499-8 20

[52] Y. Fei, A. A. Ding, J. Lao, and L. Zhang, “A statistics-based success rate model for DPA and CPA,” Journal of Cryptographic Engineering, vol. 5, no. 4, pp. 227–243, 2015.

[53] S. Mangard, Hardware Countermeasures against DPA – A Statistical Analysis of Their Effectiveness. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 222–235. [Online]. Available: https://doi.org/10.1007/978-3-540-24660-2 18

[54] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, Feb. 1978.

[55] A. Moss, D. Page, and N. P. Smart, “Toward acceleration of RSA using 3d graphics hardware,” in Cryptography and Coding, 2007.

91 BIBLIOGRAPHY

[56] Y. Yang, Z. Guan, H. Sun, and Z. Chen, “Accelerating RSA with fine-grained parallelism using GPU,” in ISPEC, 2015.

[57] K. Jang, S. Han, S. Han, S. Moon, and K. Park, “SSLShader: Cheap SSL acceleration with commodity processors,” in NSDI’11. Berkeley, CA, USA: USENIX Association, 2011, pp. 1–14.

[58] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: The case of AES,” in CT-RSA, 2006.

[59] Y. Yarom and K. Falkner, “Flush+ reload: A high resolution, low noise, L3 cache side-channel attack,” in USENIX Security, vol. 2014, 2014.

[60] M. S. Inci, B. Gulmezoglu,¨ G. I. Apecechea, T. Eisenbarth, and B. Sunar, “Seriously, get off my cloud! Cross-VM RSA key recovery in a public cloud,” IACR Cryptology ePrint Archive, vol. 2015, p. 898, 2015.

[61] Y. Yarom, D. Genkin, and N. Heninger, “Cachebleed: A timing attack on OpenSSL constant time RSA,” in CHES 2016, 2016.

[62] P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computa- tion, vol. 44, no. 170, pp. 519–521, 1985.

[63] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems,” in CRYPTO ’96, 1996.

[64] J.-F. Dhem, F. Koeune, P.-A. Leroux, P. Mestre,´ J.-J. Quisquater, and J.-L. Willems, “A practical implementation of the timing attack,” in CARDIS’98, 2000.

[65] R. Toth,´ Z. Faigl, M. Szalay, and S. Imre, “An advanced timing attack scheme on RSA,” in Int. Telecommunications Network Strategy and Planning Symposium, vol. Supplement, Sept 2008, pp. 1–9.

[66] D. Brumley and D. Boneh, “Remote timing attacks are practical,” in Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12, ser. SSYM’03. Berkeley, CA, USA: USENIX Association, 2003, pp. 1–1. [Online]. Available: http://dl.acm.org/citation.cfm?id=1251353.1251354

92 BIBLIOGRAPHY

[67] B. B. Brumley and N. Tuveri, “Remote timing attacks are still practical,” in ESORICS 2011, 2011.

[68] O. Aciic¸mez, W. Schindler, and c. K. Koc¸, “Improving brumley and boneh timing attack on unprotected SSL implementations,” in Proc. ACM Conf. on Computer & Communications Security. New York, NY, USA: ACM, 2005.

[69] C. Arnaud and P.-A. Fouque, “Timing attack against protected RSA-CRT implementation used in PolarSSL,” in Cryptographers Track at the RSA Conference. Springer, 2013, pp. 18–33.

[70] W. Schindler, “A timing attack against RSA with the Chinese Remainder Theorem,” in CHES 2000, 2000.

[71] C. Chen, T. Wang, and J. Tian, “Improving timing attack on RSA-CRT via error detection and correction strategy,” Information Sciences, 2013.

[72] Z. H. Jiang, Y. Fei, and D. Kaeli, “A novel side-channel timing attack on gpus,” in GLSVLSI ’17. New York, NY, USA: ACM, 2017, pp. 167–172.

[73] c. K. Koc¸, “High-speed RSA implementation,” Technical Report, RSA Laboratories, Tech. Rep., 1994.

[74] C. K. Koc¸, “Analysis of sliding window techniques for exponentiation,” Computers & Mathe- matics with Applications, vol. 30, no. 10, pp. 17–24, 1995.

[75] G. Knuth, “The art of computer programming, seminumerical algorithms, vol. 2, addition wesley,” Reading, Massachusetts, 1998.

[76] H. Park, K. Park, and Y. Cho, “Analysis of the variable length nonzero window method for exponentiation,” Computers & Mathematics with applications, 1999.

[77] K. Jang, S. Han, S. Han, and K. Park. (2015) libgpucrypto. [Online]. Available: https://github.com/lwakefield/libgpucrypto

[78] Y. Fei, A. A. Ding, J. Lao, and L. Zhang, “A statistics-based success rate model for dpa and cpa,” Journal of Cryptographic Engineering, vol. 5, no. 4, pp. 227–243, 2015.

[79] D. Coppersmith, “Small solutions to polynomial equations, and low exponent RSA vulnerabil- ities,” Journal of Cryptology, vol. 10, no. 4, pp. 233–260, 1997.

93 BIBLIOGRAPHY

[80] C. D. Walter, “Montgomery exponentiation needs no final subtractions,” Electronics Letters, vol. 35, no. 21, pp. 1831–1832, Oct 1999.

[81] G. Hachez and J.-J. Quisquater, “Montgomery exponentiation with no final subtractions: Improved results,” in CHES 2000, 2000.

[82] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of computation, vol. 48, no. 177, pp. 203–209, 1987.

[83] V. S. Miller, “Use of elliptic curves in cryptography,” in Conference on the Theory and Application of Cryptographic Techniques. Springer, 1985, pp. 417–426.

[84] I. F. Blake, G. Seroussi, and N. P. Smart, Advances in elliptic curve cryptography. Cambridge University Press, 2005, vol. 317.

[85] J.-S. Coron, “Resistance against differential power analysis for elliptic curve cryptosystems,” in Proc. Int. Conf. on Cryptographic Hardware & Embedded Systems. Springer, 1999, pp. 725–725.

[86] E. Brier and M. Joye, “Weierstraß elliptic curves and side-channel attacks,” in International Workshop on Public Key Cryptography. Springer, 2002, pp. 335–345.

[87] B. Chevallier-Mames, M. Ciet, and M. Joye, “Low-cost solutions for preventing simple side- channel analysis: Side-channel atomicity,” IEEE Transactions on computers, vol. 53, no. 6, pp. 760–768, 2004.

[88] M. Medwed and E. Oswald, “Template attacks on ecdsa.” in WISA, vol. 5379. Springer, 2008, pp. 14–27.

[89] P. Q. Nguyen and I. E. Shparlinski, “The insecurity of the elliptic curve digital signature algorithm with partially known nonces,” Designs, codes and cryptography, vol. 30, no. 2, pp. 201–217, 2003.

[90] L. Batina, Ł. Chmielewski, L. Papachristodoulou, P. Schwabe, and M. Tunstall, “Online template attacks,” in International Conference in Cryptology in India. Springer, 2014, pp. 21–36.

94 BIBLIOGRAPHY

[91] M. Dugardin, L. Papachristodoulou, Z. Najm, L. Batina, J.-L. Danger, and S. Guilley, “Disman- tling real-world ecc with horizontal and vertical template attacks,” in International Workshop on Constructive Side-Channel Analysis and Secure Design. Springer, 2016, pp. 88–108.

[92] K. Schramm, T. Wollinger, and C. Paar, “A new class of collision attacks and its application to des,” in FSE, vol. 2887. Springer, 2003, pp. 206–222.

[93] P.-A. Fouque and F. Valette, “The doubling attack-why upwards is better than downwards,” in Proc. Int. Conf. on Cryptographic Hardware & Embedded Systems, vol. 2779. Springer, 2003, pp. 269–280.

[94] C. Clavier, B. Feix, G. Gagnerot, M. Roussellet, and V. Verneuil, “Horizontal correlation analysis on exponentiation.” in ICICS, vol. 6476. Springer, 2010, pp. 46–61.

[95] A. Bauer, E. Jaulmes, E. Prouff, J.-R. Reinhard, and J. Wild, “Horizontal collision correlation attack on elliptic curves,” Cryptography and Communications, vol. 7, no. 1, pp. 91–119, 2015.

[96] N. Hanley, H. Kim, and M. Tunstall, “Exploiting collisions in addition chain-based exponenti- ation algorithms using a single trace.” in CT-RSA, vol. 9048, 2015, pp. 431–448.

[97] E. Wenger, T. Korak, and M. Kirschbaum, “Analyzing side-channel leakage of rfid-suitable lightweight ecc hardware,” in International Workshop on Radio Frequency Identification: Security and Privacy Issues. Springer, 2013, pp. 128–144.

[98] M. Mossinger,¨ B. Petschkuhn, J. Bauer, R. C. Staudemeyer, M. Wojcik,´ and H. C. Pohls,¨ “Towards quantifying the cost of a secure iot: Overhead and energy consumption of ecc signatures on an arm-based device,” in World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2016 IEEE 17th International Symposium on A. IEEE, 2016, pp. 1–6.

[99] R. de Clercq, L. Uhsadel, A. V. Herrewege, and I. Verbauwhede, “Ultra low-power imple- mentation of ecc on the arm cortex-m0+,” in 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014, pp. 1–6.

[100] Z. Liu, H. Seo, A. Castiglione, K. R. Choo, and H. Kim, “Memory-efficient implementation of elliptic curve cryptography for the internet-of-things,” IEEE Transactions on Dependable and Secure Computing, pp. 1–1, 2018.

95 BIBLIOGRAPHY

[101] C. Franck, J. Großschadl,¨ Y. L. Corre, and C. L. Tago, “Energy-scalable montgomery-curve ecdh key exchange for arm cortex-m3 microcontrollers,” in 2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), Aug 2018, pp. 231–236.

[102] J. Bauer, R. C. Staudemeyer, H. C. Pohls,¨ and A. Fragkiadakis, “Ecdsa on things: Iot integrity protection in practise,” in Information and Communications Security. Springer, 2016, pp. 3–17.

[103] W. Shang, A. Afanasyev, and L. Zhang, “The design and implementation of the ndn protocol stack for riot-os,” in Globecom Workshops (GC Wkshps), 2016 IEEE. IEEE, 2016, pp. 1–6.

[104] P. Das, D. B. Roy, and D. Mukhopadhyay, “Secure public key hardware for iot applications,” in Circuits and Systems (MWSCAS), 2016 IEEE 59th International Midwest Symposium on. IEEE, 2016, pp. 1–4.

[105] S. A. Panda, “Preventing man-in-the-middle attacks in near field communication by out-of- band key exchange,” Ph.D. dissertation, 2016.

[106] C. Huth, R. Guillaume, P. Duplys, K. Velmurugan, and T. Guneysu,¨ “On the energy cost of channel based key agreement,” in Proceedings of the 6th International Workshop on Trustworthy Embedded Devices. ACM, 2016, pp. 31–41.

[107] Intel, “Tinycrypt,” 2017, accessed: 2017-12. [Online]. Available: https://github.com/01org/ tinycrypt

[108] D. V. Chudnovsky and G. V. Chudnovsky, “Sequences of numbers generated by addition in formal groups and new primality and factorization tests,” Advances in Applied Mathematics, vol. 7, no. 4, pp. 385–434, 1986.

[109] R. R. Goundar, M. Joye, and A. Miyaji, “Co-z addition formulæ and binary ladders on elliptic curves,” in Proc. Int. Conf. on Cryptographic Hardware & Embedded Systems, vol. 10. Springer, 2010, pp. 65–79.

[110] M. Rivain, “Fast and regular algorithms for scalar multiplication over elliptic curves.” IACR Cryptology ePrint Archive, vol. 2011, p. 338, 2011.

96