Quick viewing(Text Mode)

Framework and Countermeasures For

Framework and Countermeasures For

FRAMEWORKAND COUNTERMEASURESFOR CACHEAND

POWER BASED ATTACKS

by

ANKITA ARORA

ATHESIS

SUBMITTEDIN ACCORDANCEWITHTHE REQUIREMENTS

FORTHE DEGREEOF

MASTEROF ENGINEERING

SCHOOLOF COMPUTER SCIENCEAND ENGINEERING

THE UNIVERSITYOF NEW SOUTH WALES

MAY 2013 ©Copyright by Ankita Arora 2013 All Rights Reserved

ii Statement of Originality

‘I hereby declare that this submission is my own work and to the best of my knowledge contains no materials previously published or written by another person, nor material which, to a substantial extent, has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledge- ment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged’.

Ankita Arora May 2013

iii Copyright Statement

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use, in future works (such as articles or books), all or part of this thesis or dissertation.

I have either used no substantial portions of copyright material in my thesis or I have ob- tained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation’.

Ankita Arora May 2013

iv Authenticity Statement

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format’.

Ankita Arora May 2013

v ‘This thesis is dedicated to my parents, Mr. R L Arora and Mrs. Gayatri Arora whose love and blessings brought my dream to reality. Jai Gurudev’.

vi Acknowledgements

‘Feel infinitely indebted for your body, for knowledge, for the things you have received, for your own life. Then you will bask in the abundance of the Creator’...H.H.Sri Sri Ravishankar

Words are not enough to express my gratitude towards divine, H.H.Sri Sri Ravi Shankar, the founder of The Art of Living Foundation for being with me all along. Heartiest thanks to my parents and siblings for supporting me in every phase of life. I am fortunate to be a part of Professor Sri Parameswaran’s research group. The vast experience and knowledge of Professor Sri and seniors lead to the execution of new ideas, publishing papers and extending to further research. I am grateful to my co-supervisor Dr. Jude Angelo Ambrose for brainstorming sessions and extending hand in experimentation. Thankful to seniors, Dr. Roshan Ragel, Dr. Jorgen Peddersen and university colleagues, Dr. Harris Javaid, Sumyatmin Su, Liang Tang, Shihab, Dr. Xin He, James, Darshana Jayasinghe and Tuo Li for their help in understanding tools and environment. Really appreciate timely advice, knowledge and support from my work Manager, Dr. Steve Avery. I could not have done without it. I am very thankful to seniors, Dr. David Goodall and Michael De Nil for their encouragement, experience sharing and proof read- ing my papers/thesis apart from aggressive work schedule. The feedback was very impor- tant and useful. I am grateful to have Dr. Farhana Shahid as my role model for her bravery and strength. Her mature advice, love and blessings made me grow both academically and personally. I am thankful for the encouraging words and support from Jerastin Dubash who always assured her presence at odd times of life. I will always remember the food feast from Supriya Singla, Mrs. Bina Rach, Su Lyn and Ruchi Rach at time of submis- sions. Thanks to Bushra for her blessings, prayers and supporting me in every possible way on this academic journey. I am thankful to Rajat Kulshrestha for moral support, keep- ing an undying smile and passing positive vibrations. Special thanks to all Art of Living teachers for leading towards enthusiasm, optimism and perseverance.

vii Fortunate and blessed with a beautiful family, I want to thank the angels of my life, mum, Mrs. Gayatri Arora for uplifting me and being the strongest pillar, dad, Mr. Ramesh Arora for working hard and exposing me to the secrets of success, elder sister, Mrs. San- jeeta Bhatia for being my best friend, showering unconditional love and care, brother-in- law, Mr. Vijay Bhatia for help and support in reaching my goals, younger sister Mukta Arora for her sacrifices, trusting my dreams and strengthening my spirits. Last but not least, beloved nephew Sanchit Bhatia for bringing life to the family. It is a dream comes true. Thanks Almighty.

viii Abstract

Advancements in technology, the need for automation and ease of manufacturability, have made embedded systems ubiquitous. One of the preeminent challenges in embedded systems is maintaining the privacy of sensitive information being passed and keeping it secure. Security is taken care of by the deployment of state-of-the-art cryptographic al- gorithms to encrypt confidential data, which is then decrypted at the receiving end. Some embedded systems are increasingly attacked by adversaries for financial gain, or to obtain personal information. Internal computations are often revealed by external manifestations such as processing time [1], power consumption [2], electromagnetic emission [3] and faults [4]. Such manifestations can be exploited by an adversary to obtain secret keys of cryptographic algorithms, and the process of obtaining secret keys using this mechanism is called a Side Channel Attack (SCA).

SCAs [5, 6] are categorized based on the characteristics used for the attack. Two of the main side channel attacks are cache based attacks and power based attacks. Cache based side channel attacks are built using cache behavior of the system when data is exchanged between the processor and the main memory. A Cache is a smaller and faster memory placed between the processor and main memory and stores the information needed for computations in the processor to reduce memory transaction time. Cache based attacks are further classified as time-driven attacks [7] and access-driven attacks [8]. Time-driven attacks use the time during the execution of cryptographic algorithm in the processor while access-driven attacks are performed when the adversary gets access to the data stored in the cache. Power based attacks are mounted by measuring power variations during the encryption/decryption of a cryptographic algorithm. A successful recovery of the secret allows the adversary to fake identities and gain benefits. Power based attacks are classified into Simple (SPA) and Differential Power Analysis (DPA) attacks. In SPA [9], internal data is retrieved directly by analyzing the power magnitude, while in DPA [10], much advanced statistical analysis is performed to predict the secret key.

ix Several solutions exist to counter both cache based and power based side channel at- tacks. Cache attacks can be avoided by using architectural modifications [11, 12], time skewing [13], cache warming [13], use of maximum cache line size [13],etc. The coun- termeasures used against power based attacks are masking [14], sense amplifier based logic [15], wave dynamic differential logic [16, 17, 18], dual rail circuits [19], etc. Exist- ing techniques to counter cache based and power based attacks are either costly in terms of power and area or involve much complexity, hence lack practicality. In this thesis, the author has implemented a fast trace-driven cache attack, and in- corporated this attack into a flexible framework containing an extensible processor. The processor used is the Tensilicas Xtensa LX2 with modifiable architecture which allows changes in cache architecture, instruction set and addition of extra hardware. On the framework, the author implemented a hardware/software countermeasure and has shown that it is difficult to differentiate the cache misses for differing . The proces- sor with the countermeasure is 30% more energy efficient, 17% more power efficient and 15% faster when compared to processor without the countermeasure. However, there is an area overhead of 7.6%. To protect the system from power based side channel attack, the author proposed a double width algorithmic balancing using a single core to obfuscate power variations re- sulting in a DPA resistant system. The countermeasure only includes code/algorithmic modifications, hence can be easily deployed in any embedded system with a 16 bit wide (or wider) processor. The DPA attack is demonstrated on the Double Width Single Core (DWSC) solution. The attack proved unsuccessful in finding the secret key. The in- struction memory size overhead is only 16.6% and the data memory increases by 15.8%. The future extensions of the above two countermeasures involve the merging of both and improvements to combat both cache based and power based side channel attacks in one system.

x Thesis Publications

• Ankita Arora, Roshan Ragel, Darshana Jayasinghe and Sri Parameswaran. A Hardware/Software Countermeasure and a Testing Framework for Cache Based Side Channel Attacks. ICESS, 2011.

• Ankita Arora, Jude Angelo Ambrose, Jorgen Peddersen and Sri Parameswaran. A Double-width Algorithmic Balancing to prevent Power Analysis Side Channel Attacks in AES. ISVLSI, 2013.

Contributions of this Thesis

• A simulation framework where the simpler version of Fournier’s cache attack [20] has been automated. Countermeasures can be implemented in the framework, to test the countermeasure’s efficacy.

• A hardware/software based countermeasure against cache based attack which en- hances security, power efficiency and performance in terms of processing time. This is an easily implementable system using commercial tools which are readily available.

• A novel design, Double Width Single Core (DWSC) algorithmic balancing ap- proach is proposed against Differential Power Analysis (DPA) attacks which can be easily deployed in most of the embedded applications because no architectural changes are needed and only software modifications are performed (a 32 bit pro- cessor is used).

xi List of Acronyms

ASIC Application Specific Integrated Circuit

AES Advanced Encryption Standard

CPU Central Processing Unit

CPA Correlation Power Analysis

CMOS Complementary Metal Oxide Semiconductor

DPA Differential Power Analysis

DES

DWSC Double Width Single Core

DEMA Differential Electromagnetic Analysis

DDL Dynamic Differential Logic

DWDDL Differential Wave Dynamic Differential Logic

DNS Domain Name System

DRAM Dynamic Random Access Memory

DCache Data Cache

FIPS Federal Information Processing Standard

FCFM Feedback Current Flattening Module

FPGA Field Programmable Gate Array

GF Galois Field

IPSec Internet Protocol Security

xii ISMA Internet Streaming Media Alliance

ISA Instruction Set Architecture

IC Integrated Circuit

KS Key Scheduling

LSB Least Significant Bit

LRU Least Recently Used

MRI Magnetic Resonance Imaging

MDS Maximum Distance Separable

MSB Most Significant Bit

MSS Modifiable System Simulator

MPSoc Multi Processor System On Chip

NBS National Bureau of Standards

NIST National Institute for Standards and Technology

NFI Non-Functional Instruction

NED Normalized Energy Deviation

PDA Personal Digital Assistant

PCFM Pipeline Current Flattening Module

RSA Ron Rivest. Adi Shamir and Leonard Adleman

RAM Random Access Memory

ROM Read Only Memory

xiii RISC Reduced Instruction Set Computer

SCA Side Channel Attack

SPA Simple Power Analysis

SSL Secure Socket Layer

SAFER Secure and Fast Encryption Routine

SNR Signal to Noise Ratio

SEMA Simple ElectroMagnetic Analysis

SDDL Simple Dynamic Differential Logic

SOC System On Chip

SABL Sense Amplifier Based Logic

3DES Triple Data Encryption Standard

TRNG True Random Number Generator

TIE Tensilica Instruction Extension

VPN Virtual Private Network

WDDL Wave Dynamic Differential Logic

xiv Contents

Statement of Originality ...... iii Copyright Statement ...... iv Authenticity Statement ...... v Acknowledgements ...... vii Abstract ...... ix Thesis Publications ...... xi Contributions of this Thesis ...... xi List of Acronyms ...... xii Table of Contents ...... xiv List of Tables ...... xviii List of Figures ...... xix

1 Introduction 1

2 Literature Review 10 2.1 Side Channel Attacks ...... 10 2.2 Encryption Algorithms ...... 12 2.2.1 Data Encryption Standard ...... 13 2.2.2 Advanced Encryption Standard ...... 14 2.2.3 Vulnerability of AES ...... 17 2.3 Power Based Attacks ...... 19 2.3.1 Simple Power Analysis (SPA) ...... 19

xv 2.3.2 Differential Power Analysis (DPA) ...... 24 2.4 Power Based Attacks: Countermeasures ...... 32 2.4.1 Masking ...... 32 2.4.2 Architectural Level Hardware Countermeasures ...... 37 2.5 Cache Based Attacks ...... 45 2.5.1 Processor Architecture ...... 45 2.6 Classification of Cache Based Attacks ...... 48 2.6.1 Time-Driven Attacks ...... 48 2.6.2 Access-Driven Attacks ...... 51 2.7 Cache Based Attacks: Countermeasures ...... 54 2.7.1 Simple Countermeasures ...... 54 2.7.2 Partition Locked cache (PLcache) ...... 56 2.7.3 Security Issues of PL cache ...... 57 2.7.4 Random Permutation cache (RPcache) ...... 57 2.7.5 Security Issues of RP cache ...... 58 2.8 Summary ...... 59

3 Cache SCA Framework 60 3.0.1 The attack overview ...... 60 3.0.2 System Overview ...... 65 3.1 Experimental Setup ...... 66 3.2 Results ...... 67 3.2.1 Comparative Analysis ...... 68 3.3 Assumptions ...... 68 3.3.1 Implementation in a real system ...... 69 3.4 Summary ...... 69

4 Cache Attack Countermeasure 70 4.0.1 Methodology ...... 70 4.0.1.1 Hardware Addition ...... 71

xvi 4.0.1.2 Software Modification ...... 72 4.1 Experimental Platform ...... 74 4.2 Results ...... 75 4.2.1 Security Analysis ...... 75 4.2.2 Performance, Power, and Footprint ...... 78 4.3 Summary ...... 78

5 DWSC 79 5.1 Methodology ...... 81 5.2 Software Modifications ...... 82 5.3 Experimental Setup ...... 87 5.4 Results ...... 88 5.4.1 DPA Attack on original and Balanced AES ...... 88 5.4.2 Comparative Analysis ...... 89 5.5 Discussion ...... 90 5.6 Summary ...... 90

6 Conclusions 91 6.1 Future Work ...... 92

Bibliography 94

xvii List of Tables

3.1 Performance Comparison ...... 69

4.1 Performance Comparison ...... 78

5.1 Comparative Analysis ...... 89

xviii List of Figures

1.1 Growth in Embedded systems and PC Shipments indicated by Redmond[21] 2 1.2 Security Model from User’s point of view [22] ...... 2 1.3 Placement of Cache in processor architecture ...... 6 1.4 DPA Measurements (a) Setup (b) Board ...... 7

2.1 Side Channel Attack (SCA) [23] ...... 11 2.2 Data Encryption Standard (DES) Algorithm ...... 13 2.3 Advanced Encryption Standard (AES) Algorithm ...... 14 2.4 AddRoundKey Operation ...... 15 2.5 SubBytes Substitution Operation ...... 16 2.6 ShiftRows Operation ...... 16 2.7 MixColumns Operation ...... 17 2.8 Vulnerable SBox look-up Table ...... 18 2.9 Cross sectional view of npn transistor [2] ...... 19 2.10 Simple Power Analysis on entire DES [9] ...... 20 2.11 Simple Power Analysis on entire DES with individual clock cycles [9] . . 20 2.12 Power Consumption of DES [24] ...... 21 2.13 Voltage transitions recorded using HC05-based [24] ...... 21 2.14 Simple Power Analysis based on AES Key Scheduling [25] ...... 23 2.15 Differential Power Analysis trace on DES. [9] ...... 25 2.16 Differential Power Analysis results [24] ...... 28 2.17 Differential Power Analysis results [26] ...... 29

xix 2.18 Differential Power Analysis results [27] ...... 30 2.19 SubBytes Substitution of AES [28] ...... 34 2.20 Transformation from Boolean to Multiplicative mask and vice versa [28] . 35 2.21 Regular and WDDL circuit [17] ...... 39 2.22 Simple Dynamic Differential Logic [18] ...... 39 2.23 Output Voltage of Single Ended vs. WDDL circuits [18] ...... 40 2.24 Dual-rail protocol:(a) Random order of spacers (b) Alternative order of spacers [19] ...... 41 2.25 Multiprocessor Balancing Technique: MUTE [29] ...... 43 2.26 DPA plots at load (a) Typical Processor (b) MUTE-AES [29] ...... 44 2.27 How cache fits in Processor Architecture ...... 45 2.28 Structure of Cache ...... 46 2.29 Cache Internal Structure ...... 46 2.30 Number of data packets versus key combinations [30] ...... 49 2.31 Cache based Attack results [31] ...... 50 2.32 Access-driven cache based Attack [31] ...... 53 2.33 Partition Locked cache architecture[11] ...... 56 2.34 Random Permutation Cache Architecture [11] ...... 57

3.1 AddRoundKey step of AES ...... 61 3.2 Ideal Pattern ...... 65 3.3 Framework ...... 65 3.4 Experimental Setup ...... 66 3.5 Ideal Pattern ...... 68 3.6 Results ...... 68

4.1 (a) Conventional Processor (b) Processor with extra hardware added using TIE instructions ...... 71 4.2 (a) Original AES code (b) Modified AES code to add extra hardware through TIE ...... 73

xx 4.3 Tensilica Intruction Extension (TIE) language code ...... 74 4.4 Experimental Setup ...... 74 4.5 Comparison of SBox bytewise accesses (x axis) vs. cycle delays (y axis) for both processors (a) Conventional Processor (b) Processor with coun- termeasure ...... 76 4.6 Plaintexts (x axis) vs. Data Cache (DCache) misses (y axis) on conven- tional processor (a) With 100 plaintexts (b) With 20 plaintexts ...... 77 4.7 Plaintexts (x axis) vs. Data Cache (DCache) misses (y axis) on processor with countermeasure (a) With 100 plaintexts (b) With 20 plaintexts . . . . 77

5.1 Algorithmic Balancing ...... 80 5.2 Methodology ...... 81 5.3 AddRoundKey Operation ...... 82 5.4 SubBytes Substitution Operation ...... 82 5.5 ShiftRows Operation ...... 83 5.6 MixColumns Operation ...... 84 5.7 Key Scheduling Operation ...... 86 5.8 The Experimental Setup ...... 87 5.9 DPA Plot of original AES (a) At LW (Load) instruction (b) At SW (Store) instruction ...... 88 5.10 DPA Plot of Balanced AES (a) At LW (Load) instruction (b) At SW (Store) instruction ...... 88

xxi ‘Computing is no longer confined to your computer, it’s everywhere’. Otellini [32]

Chapter 1

Introduction

The computing industry has grown exponentially due to its innovative designs, faster con- nectivity, reliability, robustness and intuitive user interfaces. From mobile phones to note- books, medical devices to industrial control systems, cars to computers, almost every de- vice is controlled by an embedded computing device. Embedded systems are specialized computing devices encapsulated in larger systems to automate and provide desired func- tionality with particular constraints. Increasingly permeating our lives, embedded systems have become an indispensable part, whether it is shopping using a smart card, making financial transactions on an iPod, reading on Kindle, monitoring using - based heart beat system or imaging on magnetic resonance imaging (MRI) system. As depicted in Figure 1.1 by Redmond in [21], the growth in the shipment of embedded sys- tems far exceeds the mainstream system shipments and is expected to continue in the near future.

Embedded systems are ubiquitous computing devices with the goal of enhancing greater systems anywhere, anytime. The development of the embedded system involves hardware design, algorithmic implementation and embedding specific application soft- ware in the device. Depending on the specific application, customized systems are de- signed and embedded in respective devices. For certain applications, users need to pro- vide personal and confidential information to the system (e.g., booking tickets, financial

1 2 CHAPTER 1. INTRODUCTION

Embedded Systems and PC Shipment 9000 8000 7000 6000

5000 Embedded Systems 4000 Shipment 3000 Personal Computers Millions Millions of units 2000 Shipment 1000 0 2005200620072008200920102011201220132014 Years

Figure 1.1: Growth in Embedded systems and PC Shipments indicated by Redmond[21] transactions, shopping etc.). To protect sensitive information being passed through em- bedded systems, security features are added to such devices. Typically, the perception of security is to protect the sensitive information being passed through insecure channels, but the requirements change depending on the role in the development chain of embedded system. For example, from a manufacturer’s perspective, the most important factor is the protection of firmware being used while the end user is concerned about the secrecy of information being passed. Broadly, the requirements of security in most of the embedded systems are authentication, data integrity, data confidentiality and denial of service [33].

Basic Security Functions

User Identification Tamper Resistance

Secure Network Content Security Access

Secure Storage Availability

Figure 1.2: Security Model from User’s point of view [22] 3

The security model from an end user’s viewpoint is shown in Figure 1.2 [22]. The se- curity requirements include ‘Basic Security Functions’ ensuring confidentiality, integrity and privacy of data; ‘User identification’ restricting the access to authorized users; ‘Con- tent Security’ protecting the rights of content used; ‘Tamper resistance’ to maintain the security even when the system is under attack; ‘Availability’ referring to the denial of service to the end user; ‘Secure Network Access’ is the authorized usage of system; and ‘Secure storage’ is protecting information in storage devices [22].

The most common challenges in designing security features of an embedded system are: maintaining functional integrity and secrecy of data, implementing security in a re- source constrained system and insecure operating environments. Handheld devices such as cell phones, PDAs and networked sensors face the challenge of battery life and accom- modating security features in small size and memory. Yet another challenge is the oper- ation of embedded systems through insecure communication channels such as internet. It is not possible to test embedded software under all possible conditions of networking which makes system and data exchanges vulnerable to attacks and thefts. According to Cyber Crime & Security Survey Report 2012 prepared by Australian Government [34], 32% of cyber incidents are due to theft of mobile devices, 28% are worm infection, 16% are denial of service attack, 18% are unauthorized access, 21% are Trojan malware and 17% are breach of confidential information.

Common techniques used to secure integrity and privacy of data and applications are cryptographic algorithms classified as - cryptographic hash functions, symmetric ciphers and asymmetric ciphers [35, 36].

• Cryptographic hash functions [35]: Cryptographic hash algorithms map the data of variable length to fixed length unique hash values. The technique is most commonly used for constructing hash tables to find a database using search keys and to build cache for a larger set of data stored in bigger memory media. Such an algorithm faces the challenge of hash values collision, finding similar records. Most com- monly used cryptographic hash functions are MD5 [37], SHA1 [38] and SHA2 [39]. 4 CHAPTER 1. INTRODUCTION

• Symmetric Ciphers [36]: The algorithms encrypt the information (usually known as plaintext) and send the encrypted data (referred as ) to the receiver end. At the reception, the data gets decrypted using the same secret key. Since it passes through insecure channels, it is important to implement an efficient and reliable encryption algorithm to protect the confidentiality of information being sent. Ex- amples of symmetric ciphers are Data Encryption Standard (DES) [40], Triple Data Encryption Standard (3DES) [40], Advanced Encryption Standard (AES) [41] and RC4 [42].

• Asymmetric Ciphers [36]: The public-key algorithms also known as asymmetric ciphers use a pair of keys where the public key (i.e., the one known to the world) is used to encode the information during transmission, while the private key at the receiving-end is used to decode the original information. Since, asymmetric ciphers are more mathematically intensive; it is a common practice to use a combination of symmetric and asymmetric ciphers where an asymmetric cipher encrypts the session key which is used to encrypt the data using symmetric algorithm. Examples of asymmetric ciphers are RSA [43], Diffe-Hellman [44] etc.

Depending on various requirements, security solutions are being devised and new de- signs are developed using cryptographic algorithms. For example, IPSec [45], SSL [46], VPN [47], Digital certificates [48], Digital Rights Management [49], ISMA [50] etc. As more advanced and complex security features are added to the embedded systems using cryptographic algorithms and other security protocols, more sophisticated attacks are be- ing developed to get access to the confidential information. An unauthorized access of system results in leakage of personal information leading to thefts and other losses. Se- curity attacks can be classified as physical, logical and side channel. Physical attacks are feasible when the attacker has direct (physical) access to the computing devices (e.g., depackaging [23], reverse engineering of ASIC [51] etc.). Logical attacks refer to the software-based attacks where a malicious code is read in the system to gain access to the information (e.g., access-driven cache attacks [8, 52, 53] etc.). The attacks based on the 5 physical characteristics of the device to extract the confidential information are known as Side Channel Attacks (e.g., fault attack [23], frequency-based attack [54], cache-memory attacks [55] and [56]). The principle of Side Channel Attacks is the correla- tion of the physical manifestations and internal computations. While the data is being en- crypted and decrypted at transmitting and receiving ends respectively, some information is leaked through side channels such as encryption time, frequency, power consumption, error message, sound, cache memory transactions, light etc. The side channel informa- tion is recorded and used to recover the secret key using mathematical analysis to build Side Channel Attacks. Broadly, Side Channel Attacks can be classified as invasive, semi- invasive and non-invasive attacks. Invasive attacks are physical attacks, while the semi- invasive attacks involve access to the device without making an electrical contact [23]. Non-invasive attacks are based on close observation of device’s operation such as mea- suring processing time [56], power consumption [9], electromagnetic radiation [57] and observing cache memory transactions with CPU [55].

The focus of this Thesis is ‘Cache based Side Channel Attacks’ and ‘Power based Side Channel Attacks’. Cache based attacks exploit the data stored in cache during the processing of cryptographic algorithms (e.g., DES [40], AES [41] etc.) in the processor while Power based attacks are mounted using the power consumption as recorded while the encryption algorithm is running in a processor. An introductory explanation of both the attacks is given below.

Cache is the fastest and smallest memory in an embedded system placed between pro- cessor and main memory. During the processing of cryptographic algorithms in the sys- tem, the data used, is loaded or stored in main memory which consumes large amount of time. While the data is being brought or stored in main memory, it is also copied in cache which enables easy access, resulting in shorter memory transaction time. Figure 1.3(a) shows the conventional embedded system while Figure 1.3(b) shows the placement of cache in the system. The data stored in cache is based on temporal locality or spatial locality. Temporal locality refers to the locality in time which means the data being ac- cessed in the current processes while Spatial locality is locality in space referring to the 6 CHAPTER 1. INTRODUCTION

Processor Processor

Address and Data buses Cache

Main Memory Main Memory

(a) (b)

Figure 1.3: Placement of Cache in processor architecture data close to that which is currently being accessed. During the execution of an algorithm in the processor, if the data requested is available in cache (due to spatial or temporal lo- cality), it is brought to the processor in shorter time and referred to as a ‘Cache Hit’ while if the data is to copied from memory, it takes larger transaction time and referred to as a ‘Cache Miss’ [58]. As the data gets stored in the cache (based on Temporal or Spatial lo- cality) during the encryption process execution in an embedded system, it is vulnerable to side channel attacks. An adversary attacks the cache to get access to the confidential data which is further used to recover the secret key. Cache based side channel attacks are clas- sified as: Time-driven attacks and Access-driven attacks. Time-driven attacks are cache attacks implemented using the encryption time (i.e., cache Hit or Miss). Kocher et al. first proposed time-driven cache attack [7]. He showed the relation of memory access with the execution time. D.J. Berntein [56] mounted cache attack remotely building logical rela- tions with secret key elements leading to complete secret key recovery. A detailed expla- nation of Time-driven cache attacks along with already implemented attacks are provided in Chapter 2. Access-driven attacks are implemented while a user gets access to the data stored in cache during encryption algorithm execution in an embedded system. Bertoni et al. [53], Kong et al. [31] and others devised access-driven cache attacks to recover the 7 secret key. A detailed explanation of access driven cache attacks with literature review is provided in Chapter 2. To combat cache based side channel attacks, countermeasures have been devised by researchers (e.g., cache flushing, time skewing [13], architectural modifications [12, 11] etc.). Details of countermeasures are given in Chapter 2.

Oscilloscope Probe

Circuit under test Chip under test

(a) (b)

Figure 1.4: DPA Measurements (a) Setup (b) Board

The second type of attack included in this thesis is the Power based attack which is mounted by using power consumption while an encryption algorithm is being executed in the processor. The principle of power based attacks is the variation in power while the data transits from 0 → 1 and 1 → 0 during execution of a cryptographic algorithm in the processor. The power variations are recorded and undergo statistical analysis to build the secret key. Depending on the analysis being used to build the key, power based attacks are classified as: Simple Power Analysis (SPA) [9, 24, 25] and Differential Power Analysis (DPA) [9]. A typical example of measuring power variations is shown in Figure 1.4 where (a) shows a cryptographic device is with the encryption algorithm being executed along with and an oscilloscope to record the power variations and (b) shows the board with the chip under measurement along with probe.

SPA is built by directly observing power variations across the system while the cryp- tographic algorithm is being executed in the processor. The principle of SPA is the direct 8 CHAPTER 1. INTRODUCTION mapping of power wave with the instructions during the execution of a cryptographic al- gorithm in an embedded system. Depending on the capabilities of attacker and available access to the system, the instructions are identified. After the determination of instruc- tions, the data being executed is computed using Hamming weight [10] (which calculates the number of 1 bits i.e., the more the number of 1s, the higher the power consumption) resulting in recovery of the secret key. Kocher et al. [9] first recorded the corresponding power consumption and mounted an attack on a system while executing DES. A detailed explanation along with literature review is provided in Chapter 2. DPA is a more sophisticated and accurate power based side channel attack. The at- tack is mounted by statistical analysis of power variations across the embedded system. Among various steps of cryptographic algorithms (e.g., DES, AES etc.), an attack point is chosen and power is recorded across the chosen point along with the corresponding bit value. The recorded power values are divided in two sets (i.e., for bit ”0” and bit ”1”). DPA is calculated by computing the difference in the average of the two sets of power variations. The highest peak in DPA plot exhibits the secret key. Kocher et al. [9] first proposed DPA on DES. An explanation of DPA attacks is provided in Chapter 2. Re- searchers have devised various techniques as countermeasure of SPA and DPA including architectural modifications, introduction of new instructions etc. A literature survey of countermeasures is provided in Chapter 2. Thesis Organization:

• Chapter 2 presents literature survey of cache based side channel attacks, power based side channel attacks and their respective countermeasures. The Chapter be- gins with concepts of side channel attacks followed by stepwise explanation of encryption algorithm (i.e., DES, AES) which are the basis of understanding cache based and power based attacks. The main sections include power based side chan- nel attacks (i.e., SPA and DPA) along with their countermeasures and cache based attacks (i.e., timing based and access driven) along with their countermeasures.

• Chapter 3 demonstrates a framework of cache based side channel attack. The author 9

implemented a cache based attack and recovered the complete secret key using a commercial processor provided by ‘Tensilica’. The framework is used to test the efficacy of the countermeasure.

• Chapter 4 shows a countermeasure of cache based side channel attack which in- cludes introducing registers in the processor to save the important information while bypassing the cache completely. The countermeasure is implemented on crypto- graphic algorithm, Advanced Encryption Standard (AES) where the new group of registers stores the values of SBox look-up table and bypasses the cache at the most vulnerable point of AES execution. The countermeasure is tested using the cache based attack framework explained in Chapter 3.

• Chapter 5 presents ‘Double Width Single Core’ algorithm balancing to prevent power based side channel attack, DPA. The technique includes the software mod- ifications in cryptographic algorithm AES to balance the power variations. The countermeasure is successfully verified against DPA attack.

• Chapter 6 includes future extensions of countermeasures to combat cache based side channel attacks and power based side channel attacks. It explains new ideas to extend the research and develop advanced techniques such as ‘Architectural modifi- cations’ and combination of the two countermeasures explained in Chapter 4 and 5 to get a single chip solution. ‘A researcher cannot perform significant research without first understanding the literature in the field’. Boote and Beile, 2005

Chapter 2

Literature Review

2.1 Side Channel Attacks

Embedded systems have become an integral part of our lives and are widely used in con- sumer, commercial, industrial and military applications. As part of some applications (e.g., private communications, financial transactions etc.) users may seek secrecy. In order to maintain privacy of the information being passed through such devices, cryp- tographic algorithms have been developed. The algorithms are ‘mathematical models’ devised to ensure confidentiality of the information being processed [44]. They offer security to the system in terms of: integrity (maintaining the data unaltered); privacy (en- suring no leakage of the data); authentication (ensuring the identity); and, non-repudiation (proving that the message is really sent from sender) [59].

The cryptographic algorithms are designed such that input data is combined with a secret key and undergoes an encryption process before sending from one end. This com- plex and coded data is sent to the other end through secure channels where it is decoded to retrieve the original information. Despite the fact that modern embedded systems are capable of encrypting information, it has been shown that they are prone to attack. Such attacks can reveal the secret key making the system vulnerable. It is certainly difficult to reverse engineer the encoded data to recover the secret key but an adversary can attack the

10 2.1. SIDE CHANNEL ATTACKS 11 system targeting the weaknesses of cryptographic algorithm’s implementation. As seen

Electromagnetic Radiation Sound Power Consumption Execution Time

Plaintext Ciphertext Plaintext Encryption Decryption

Ka Kb

Visible Light Memory Access Time Heat Frequency Error Message Faulty Outputs

Figure 2.1: Side Channel Attack (SCA) [23]

in Figure 2.1, the input data (i.e., plaintext) is encoded at one end using secret key Ka and gets decoded at the receiving end using secret key Kb in a typical cryptographic al- gorithm. While the data undergoes encryption or decryption process, there is information which gets leaked through side channels (i.e., power, sound, error messages, fault, electro- magnetic radiations, cache/memory transactions and others). Side Channel information can be used to recover the secret key after careful statistical analysis [23, 9, 24, 25]. Such attacks, built using external characteristics of encryption algorithms are known as Side Channel Attacks. The principle of Side Channel Attack (SCA) is that the side channels are exploited to record the intermediate data which is correlated with input data and secret key. The SCAs can be classified based on the side channels used to build the attack (e.g., Cache Attack, Power Attack, Timing Attack, Electromigration Attack, Acoustic Attack, fault Attack and others). The first SCA was reported by P.Wright (a scientist with GCHQ in 1986) in [23] when a microphone was placed near the cipher-motor machine to spy the click-sound of the machine that was further used to deduce the core position of 2 or 3 motors. This information helped to break the cipher, and British intelligence agency M15 could successfully spy on Egyptian embassy’s communications.

Side Channel Attacks are more effective and easier to mount than the conventional attacks based on mathematical analysis [23]. The ways to mount the SCAs vary from acquiring physical access followed by reverse engineering, to remote attacks followed 12 CHAPTER 2. LITERATURE REVIEW by statistical analysis of time or power values. To understand the basis of Side Chan- nel Attack, it is necessary to know the encryption algorithms involved, explained in the following section.

2.2 Encryption Algorithms

Cryptography is science of encoding the data being transmitted over communication chan- nels in order to protect against unauthorized access of information [44]. It is widely used in embedded systems to maintain confidentiality of the message being sent over un- trusted channels. Typically, Cryptographic schemes are categorized as, Secret Key (or symmetric) cryptography, Public Key (or asymmetric) cryptography and Cryptographic hash functions. Secret Key Cryptography and Public key Cryptography use secret key to encrypt and decrypt the data. Secret Key Cryptography uses a single key for both encryp- tion and decryption while Public Key Cryptography employs different keys for encryption and decryption [59]. In all these types of algorithms, the input data is called plaintext and the encrypted data is named ciphertext. Secret Key Cryptography can be further catego- rized as Stream ciphers and Block ciphers. Stream ciphers execute one bit at a time of the plaintext and include a feedback mechanism, such that the key is constantly changing, Block ciphers operate on one block at a time and use the same key to encrypt the whole block of data. Hence, for a specific plaintext, the ciphertext for a will be the same, while the ciphertext is different for every iteration of . Commonly used Block Ciphers are: Data Encryption Standard (DES) [40]; Triple Data Encryption Standard (3DES) [40]; AES (Advanced Encryption Standard) [41]; Blowfish [60]; Two fish [61]; [62]; MISTY1 [63]; SAFER+ (Secure And Fast Encryption Rou- tine) [64]; KASUMI [65]; SEED [66]; ARIA [67]; [68]. The author used AES for research (as it is one of the most commonly used Block ciphers) and following sections will focus on the steps involved and the vulnerable points which become target of Side Channel Attacks. The DES implementation is given first since it is simpler to understand. 2.2. ENCRYPTION ALGORITHMS 13

INPUT First Second Final OUTPUT Round Round Round

64 R0 R1 R15 R16 Reverse Bit Logic

Block K0 F K1 F K15 F 64 Permu Bit - Permu- -tation L0 L1 L15 L16 -tation

Figure 2.2: Data Encryption Standard (DES) Algorithm

2.2.1 Data Encryption Standard

Data Encryption Standard (DES) [40] was designed by IBM in 1970s and adopted by National Bureau of Standards (NBS), now National Institute for Standards and Technol- ogy (NIST) as Federal Information Processing Standard 46 (FIPS 46-3) for commercial applications. DES is a symmetric encryption algorithm which is most widely used in em- bedded systems [59]. The algorithm involves fractioning of plaintext (input) into 64-bit (eight octet) blocks, followed by initial permutation of the block as shown in Figure 2.2. The blocks are further split into two parts: L0 (left) and R0 (right). The XOR of subkey K1 and the part of data in R0, is fed to the SBox look-up table. The resulting data is further used to XOR with the part of data in L0 and the output is placed in R1. The R0 part of data is passed to L1. This process of XOR and swapping of data is repeated for 16 rounds. In the end, the data in R15 and L15 are passed through inverse permutation resulting in ciphertext (output). The of DES is 56-bits, which makes it vulnerable to Brute force attacks. With the advancement in computing power and speed, it is possible to try 256 key values and then compare with the encrypted values to find the correct ones. One of the solutions to this attack is using Triple DES [40], where the data is encrypted three times using the methodology of DES. A plaintext (input data) is encrypted sequen- tially using key1, followed by key2 and finally using key3. But this approach requires a large number of resources which is the major disadvantage of 3-DES. It is possible to exploit other weaknesses of DES so that the key could be recovered using lesser number 14 CHAPTER 2. LITERATURE REVIEW of samples [8]. To overcome these weaknesses, Advanced Encryption Standard (AES) was developed in the year 2000, which is considered more secure and reliable than DES.

2.2.2 Advanced Encryption Standard

Plaintext Secret Key

AddRoundKey

Point of Attack SubBytes Substitution Key Scheduling N-1 Encryption ShiftRows Rounds MixColumns

AddRoundKey

SubBytes Substitution

Last Encryption ShiftRows Round

AddRoundKey

Ciphertext

Figure 2.3: Advanced Encryption Standard (AES) Algorithm

Advanced Encryption Standard (AES) [41] is an improved block cipher. It accepts a plaintext and a secret key to encrypt the data and send as ciphertext across communication channels. The key sizes could be 128, 192 or 256 bits which determines the number of encryption rounds as 10, 12 and 14 respectively. Figure 2.3 shows the steps followed in the AES algorithm where the plaintext is encrypted using the secret key. AES operates on a 4x4 byte matrix, called state which forms the basic data structure. A brief explana- tion is given below, though the interested reader is referred to [69] for a more thorough explanation. 2.2. ENCRYPTION ALGORITHMS 15

AES algorithm involves SubBytes Substitution, ShiftRows, Mix Columns and Ad- dRoundKey Operation as shown in Figure 2.3. A separate Key scheduling function gen- erates all the sub-Keys used for subsequent rounds of encryption. AddRoundKey is the first step of AES, followed by SubBytes Substitution, Shift Rows and MixColumns. All four steps are iterated a number of times (number of rounds being dependent on key size) except the last round in which MixColumns is omitted. Considering the number of secret key bits as 128, 10 encryption rounds are required for AES algorithm.

p[0] p[1] p[2] p[3] k[0] k[1] k[2] k[3] r[0] r[1] r[2] r[3]

p[4] p[5] p[6] p[7] k[4] k[5] k[6] k[7] r[4] r[5] r[6] r[7]

p[8] p[9] p[10] p[11] k[8] k[9] k[10] k[11] r[8] r[9] r[10] r[11]

p[12] p[13] p[14] p[15] k[12] k[13] k[14] k[15] r[12] r[13] r[14] r[15]

Plaintext Resultant (16 Bytes) Unknown Secret Key

Figure 2.4: AddRoundKey Operation

AddRoundKey: AddRoundKey is a simple XOR operation of plaintext and roundkeys (derived from Key Scheduling process). For example, p[z] ⊕ k[z] = r[z] where p[z] is plaintext byte, k[z] is secret key byte, r[z] is the resultant and z = 0..15 as shown in Figure 2.4. In the first iteration, the original secret key is used which is also fed to Key Scheduling process. The resultant keys of Key Scheduling process are fed into later rounds of AddRoundKey respectively. The initial AddRoundKey in first encryption round is proven to be vulnerable to side channel attacks [26, 9]. SubBytes Substitution: SubBytes Substitution is a non-linear byte substitution in which each input byte is replaced by a value in look-up table, known as SBox as seen in Figure 2.5. The resultant of AddRoundKey acts as indices of the SBox (a fixed look-up table) whose corresponding value is passed to the Shift Rows operation. SBox look-up table has a precalculated set of values derived from Multiplicative inverse in Rijndael’s finite field followed by an affine transformation, details documented in [69]. For example, 16 CHAPTER 2. LITERATURE REVIEW

r[0] r[1] r[2] r[3] 0x63 0x7C 0x77 .... 0x76 s[0] s[1] s[2] s[3]

r[4] r[5] r[6] r[7] ...... s[4] s[5] s[6] s[7]

r[8] r[9] r[10] r[11] ...... s[8] s[9] s[10] s[11]

r[12] r[13] r[14] r[15] 0x8C 0xA1 0x89 .... 0x16 s[12] s[13] s[14] s[15] Sbox look-up

Figure 2.5: SubBytes Substitution Operation

SBox{r[z]} = s[z] where r[z] is the input (i.e., resultant of previous step, AddRound- Key), s[z] is resultant of SubBytes Substitution and z = 0..15 as shown in Figure 2.5.

s[0] s[1] s[2] s[3] s[0] s[1] s[2] s[3]

s[4] s[5] s[6] s[7] s[5] s[6] s[7] s[4]

s[8] s[9] s[10] s[11] s[10] s[11] s[8] s[9]

s[12] s[13] s[14] s[15] s[15] s[12] s[13] s[14]

Figure 2.6: ShiftRows Operation

ShiftRows: In ShiftRows, the rows are shifted cyclically with varying offsets. As shown in Figure 2.6, first row stays unchanged while the bytes of second row is shifted by two positions and third row by three positions. For example, s[5] has taken place of s[4] in second row after moving by one byte, s[10] was moved by two bytes and took place of s[8] and s[15] was moved by three offsets and is placed as first byte of fourth row in resulting array as seen in Figure 2.6. The resulting array of 16 bytes is fed to the next step (i.e., MixColumns). MixColumns: In MixColumns, the columns are multiplied by a Maximum Distance Separable (MDS) matrix in a finite field. As seen in Figure 2.7, the input matrix b(x) is multiplied by a fixed matrix c(x), resulting in single column matrix a(x). For example, a[0] = 02(b[0])⊕03(b[1])⊕01(b[2])⊕01(b[3]) where a[0] is first byte of resulting matrix, b[0], b[1],b[2] and b[3] forms a column of input matrix multiplied by MDS matrix. Mul- tiplying by 01, results in the same matrix while by 02 means shifting the matrix left by 2.2. ENCRYPTION ALGORITHMS 17

02 03 01 01 b[0] a[0] 01 02 03 01 . b[1] = a[1] 01 01 02 03 b[2] a[2] b[3] a[3] 03 01 01 02 c(x) b(x) a(x)

Figure 2.7: MixColumns Operation one byte and by 03 means left shifting by one followed by XOR with unchanged value. In general terms, every column is treated as a polynomial over GF (28) and is then multiplied modulo x4 + 1 with a fixed polynomial c(x) = 0x03x3 + x2 + x + 0x02 [69]. The output value of MixColumns step is fed to AddRoundKey which completes one encryption round as seen in Figure 2.3. For a key size of 128 bits resulting in 10 encryption rounds of AES, all four steps (i.e., SubBytes Substitution, Shift Rows, MixColumns and AddRoundKey) are iterated for nine times while MixColumns operation is omitted in the last (i.e., 10th) round of encryption. Key Scheduling: The Key Scheduling process is used to derive nine roundkeys as used for every encryption round. The original secret key undergoes shifting followed by SubBytes Substitution via lookup table, SBox. The resultant is then XORed with a corre- sponding number of Rcon table which is an exponentiation of 2 to a user-specified value, details given in Rijndael’s document [41]. This operation is performed in polynomial form in Rijndael’s finite field. Initial AddRoundKey uses the input secret key while rest of the rounds use Round Keys derived from the Key Scheduling process.

2.2.3 Vulnerability of AES

Over a period of time, researchers have found that it is possible to recover the secret key in AES by using the vulnerable ‘first round’ [8] or ‘last round’ [31, 55]. The main area of interest is the first AddRoundKey and SubBytes Substitution operations in the first round of encryption. During the rest of the thesis, these two operations in the first round will be 18 CHAPTER 2. LITERATURE REVIEW referred to as AES-phase1 for convenience.

Plaintext

Sbox lookup

16 Bytes Key SubByte AddRoundKey Transformation

AES-phase1

Figure 2.8: Vulnerable SBox look-up Table

In AES-phase1 as shown in Figure 2.8, the plaintext (i.e., input) is XORed with the respective bits of the secret key. The result of this XOR operation acts as the ‘table indices’ of the SBox look-up table and the corresponding values in the table lookup are fetched for further rounds. At this stage, while these values are fetched from the SBox look-up table for the next round of encryption, the information gets leaked via various side channels. They leak power consumption while accessing the memory locations and the resulting values stored in cache. It is the storage in cache and power leakage which results in vulnerability. In terms of cache attacks, repeated access of an SBox look-up index will result in a hit. If a hit is detected, it means that a particular SBox look-up table element has been already accessed. This unique SBox look-up table element can be found out through trace-driven, access-driven or time-driven attacks. Since the plaintext is known (being input), the attacker can easily build a relation between the input plaintext and SBox look- up table element to find the secret key. In terms of power attacks, the power dissipation is recorded and used to build equations between secret key and plaintext (input value). Further, statistical analysis of such equations can reveal secret key using Simple Power Analysis (SPA) [9, 24, 25] or Differential Power Analysis (DPA) [9]. 2.3. POWER BASED ATTACKS 19

Figure 2.9: Cross sectional view of npn transistor [2]

2.3 Power Based Attacks

The basic building block of an ASIC design is transistor. Figure 2.9 shows a cross sec- tional view of an npn transistor which act as a voltage-controlled switch. When charge is applied at the gate, electrons start flowing from drain to source, across the substrate as shown by the curved arrow in Figure 2.9. The motion of electrons emits electromagnetic radiations and consumes power. Both of these characteristics are externally observable using an electromagnetic probe or oscilloscope. Hence, while an encryption algorithm is executed in an embedded system, the power dissipation of transistors is recorded followed by power analysis attacks to recover the secret key. As pointed out by Stefan Morgan in [10], power analysis attacks exploit the fact that the instantaneous power consumption of a cryptographic device depends on the data it processes and on the operation it performs. Such attacks are categorized as: Simple Power Analysis (SPA) attacks and Differential Power Analysis (DPA) attacks.

2.3.1 Simple Power Analysis (SPA)

Simple Power Analysis (SPA) attack is built on visual inspection of power traces across the output of a cryptographic device. In a smart card, the simplest attack is to mount a resistor across power (commonly named as VDD) and ground (commonly named as VSS). The voltage, current and power dissipation across the resistor can be observed using an oscilloscope. The power consumption is correlated to internal computations and hence can be utilized to reveal the corresponding functions being performed in the smart 20 CHAPTER 2. LITERATURE REVIEW card.

Figure 2.10: Simple Power Analysis on entire DES [9]

SPA was first proposed by Kocher et al. [9] proving that power consumption can be used to interpret the implementation of an encryption algorithm and that implementation became the basis of the attack. As seen in Figure 2.10, Kocher et al. [9] recorded power traces of DES during 16 encryption rounds. It is apparent from Figure 2.10 that 16 rounds follow the same power patterns since they execute the same set of instructions.

Figure 2.11: Simple Power Analysis on entire DES with individual clock cycles [9]

Another technique Kocher et al. frequently used was, trace-pair analysis where the attacker compares the two power traces and find the differences. From these power vari- ations, the attacker could derive a relation of power traces to the internal state leading to secret key recovery. DES implementations typically include conditional branching for generating new keys in the Key scheduling process and in the permutations. Kocher et al. pointed out that branching is risky since it can be easily revealed in power traces. Fig- ure 2.11 shows the power consumption for seven clock cycles at 3.5714MHz. The upper 2.3. POWER BASED ATTACKS 21 trace shows higher power dissipation than the lower one, since a jump instruction is exe- cuted in the upper one. The variation can be clearly seen at clock cycle 6 of Figure 2.11.

Figure 2.12: Power Consumption of DES [24]

Messerges et al. [24] examined power consumption of DES as implemented on a smart card and successfully mounted SPA. They used Hamming Weight Power Model [10], to recover the 56 bits of the secret key. As seen in Figure 2.12, at a known set of instructions, the power dissipation has been recorded. By some characterization tests, the attacker could map the calculated height of pulses to the Hamming weight substituting the values in the following equation to recover the key, ¯ Ak =w ¯, where w¯ is a vector of Hamming weights, wi, and Aij is a 0-1 matrix such that Aij is 1 only if weight wi includes key bit kj [24].

Figure 2.13: Voltage transitions recorded using HC05-based smart card [24]

Direct mapping of power traces to the execution of encryption algorithm is the secret of SPA. Messerges et al. [24] plotted the power profiles at LDA (load) instruction as seen 22 CHAPTER 2. LITERATURE REVIEW in Figure 2.13. They mounted an attack on a smart card with 8-Bit HC-05-based micro- processor. While a 8-bit data byte has been put on the bus to get transferred from memory to register, the corresponding voltage pulses reveal the number of transitions which is the Hamming distance [10] between previous and current data values. Similarly, Ham- ming weights can be found, which effectively reduces the brute force search space from 256 to 238 key bytes. They concluded that the leakage of information from a micropro- cessor depends upon the circuit design (i.e., type of operation being performed and the access of memory). Accordingly, an adversary can decide on the attack strategy. Park et al. [70] examined XOR instruction and recorded power traces for various Hamming weights. An 8 Bit XOR instruction has been analyzed and the power consumption of re- spective hamming weights (i.e., from 0x10 to 0xFF) are subtracted from the power value corresponding to 0x00 with hamming weight 0. This results in a table of power dissipa- tion and their respective Hamming weights. In a real system, the observed power value can be mapped to the power table and the corresponding Hamming weight can be used for power attack. Messerges, Dabbish and Sloan [71] listed the dangers of SPA and DPA on cryptographic algorithms. They acknowledged that Hamming weights information is useful in case of shifting as employed in key scheduling of DES but may not be sufficient for keys larger than DES (i.e., 56 bits). They claimed that the attacker needs to know the details of algorithm as well as the memory addresses of the registers being accessed dur- ing data transfer Moyyer-Sommer [72] proved that it is not necessary to know the storage addresses in order to determine Hamming weights. Additionally, she showed that if the device is operated at low frequency and high supply voltage, there is no need to average noise to obtain useful information related to the secret key. Mangard [25] mounted SPA utilizing power leakage during Key scheduling of AES [69]. He exploited the fact that most of the smart card processors leak power while generating roundkeys (result of Key Scheduling process) which can be used to reduce the search space for brute force to re- veal the secret key. The attack is based on the 128-bit AES employed on the 8-bit smart card processor. To implement the attack, power traces related to Key Scheduling has been recorded and separated from the rest of the power consumption traces. The most common 2.3. POWER BASED ATTACKS 23

Figure 2.14: Simple Power Analysis based on AES Key Scheduling [25] way is to extract the useful information and divide into two sets. One set contains power measurements by en/de-cryption of AES with different data blocks and same key while the second set is built by en/de-cryption of AES with same data block and different keys. By examining the variance of two set of results, the parts related to key and intermediate values of Key Scheduling can be found and the respective Hamming weights can be deter- mined. This set of hamming weights is compared with the possible list of keys as found by the attacker, which is less than brute force method. Further, Mangard has indicated dependencies of roundkeys among each other. As shown in Figure 2.14, the roundkey3 has been split into four parts that can be attacked individually. The five bytes of each part helps in the calculation of various other roundkeys and intermediate results. Table 2.14 shows such dependencies (i.e., the values of parts marked by bullets can be found using the values marked by circles). For part1, par2, part3, the bullets and circles are moved by modulo four accordingly. The notation used for Figure 2.14 is, B0,1,2,3, bytes of the keywords Wi. Sx is the output of SBox (Bx) and R0 = xor(S1, Rconi/4,0) (Rcon is an exponentiation of 2 to a user specified value [69]). R1,R2,R3 are not mentioned since their values are not different from corresponding S1,S2,S3 values. Comparing these val- with the known Hamming weights, the attacker gets a list of possible keys. Bertoni et 24 CHAPTER 2. LITERATURE REVIEW al. [53] recorded power profiles for the purpose of attack using cache misses in AES. SPA is the simplest attack but is not widely used due to drawbacks. SPA needs more resources to mount (e.g., oscilloscope and other measurement equipments) and the ad- versary needs better knowledge of the encryption algorithm. Moreover, noise and other imperfections might lead to incorrect results. A more powerful and accurate technique is Differential Power Analysis (DPA) attack which is elaborated in section 5.4.1.

2.3.2 Differential Power Analysis (DPA)

Differential Power Analysis (DPA) attacks are implemented by statistical analysis of power consumption as recorded at intermediate points of an encryption algorithm. A detailed explanation of DPA is given by Mangard et al. [10], Step 1 of DPA is to choose an intermediate result of the encryption algorithm which is a function of plaintext or ci- phertext (i.e., input or output) and a part of key. In next step (i.e., Step 2), the attacker records power consumption corresponding to the chosen intermediate point in Step 1. If plaintext is chosen, the adversary will set the trigger of the oscilloscope to the sending of plaintext to cryptographic device while if ciphertext is chosen, he sets the trigger to the sending of ciphertext from cryptographic device. Step 3 involves calculation of hypotheti- cal power values across all possible keys and plaintexts. Step 4 is mapping of power traces and intermediate values (i.e., data of Step2 and Step3). And the last step (i.e., Step 5) is the statistical analysis of data using correlation coefficient, difference-of-means, template and other DPA techniques. Kocher et al. [9] proposed the first DPA on DES. The attack is implemented by collecting values of ciphertext C1..m and power traces T1..m[1..k] over m encryption rounds of DES for k-samples. For DPA, selection function D(C, b, Ks) is considered where D is the selection function and is defined by computing the value of 0 ≤ b < 32 of DES intermediate function L at the beginning of the 16th round for cipher- text C and the six key bits entering SBox corresponding to bit b are shown by 0 ≤ K < 26. Their methodology is based on difference-of-means which involves difference of average power values when D(C, b, Ks) are 1 vs 0. Hence, ∆D is the average over C1..m due to 2.3. POWER BASED ATTACKS 25 the value represented by the selection function D on power dissipation measurements at point j, as defined by the following equation:

∑m ∑m D(Ci, b, Ks)Ti[j] (1 − D(Ci, b, Ks)Ti[j]) i=1 − i=1 ∆[j] = ∑m ∑m (2.1) D(Ci, b, Ks) (1 − D(Ci, b, Ks)) i=1  i=1  ∑m ∑m  D(Ci, b, Ks)Ti[j] Ti[j] ≈ 2  i=1 − i=1  (2.2)  ∑m m  D(Ci, b, Ks) i=1

While calculating differential values of various key bits, the correct ones show higher correlation and hence result in significant spikes while other data values or interference

(i.e., noise) will result in values close to zero. Hence, for incorrect values of Ks,

lim ∆D[j] ≈ 0 (2.3) m→∞

Figure 2.15: Differential Power Analysis trace on DES. [9]

The power traces of DES as recorded by Kocher et al. are shown in Figure 2.15 where first trace shows average power consumption over entire DES, second one is a 26 CHAPTER 2. LITERATURE REVIEW differential trace corresponding to the correct key value while the lower two are corre- sponding to incorrect key bits. The calculation is done over 1000 samples i.e., m= 1000.

A significant peak can be seen for the correct key value of Ks. Chari et al. [73] demon- strated DPA on AES candidate, Twofish [74] and showed the power analysis vulnerabil- ities on Rijndael [75], SAFER+ [64], CRYPTON [76], DEAL [77], [78], Mars [79], Magenta [80], [81], RC6 [82], LOKI-97 [83], [84], FROG [85], DFC [86] and CAST-256 [87] algorithms.

While Kocher et al.[9] showed the simulated results of DPA, Messerges et al. [24] mounted the attack on a smart card with software implementation of DES. To implement the attack, power traces of last three rounds of DES were recorded and characterized. The intermediate stage of encryption chosen for attack is a function of secret key and ciphertext, D as shown by the following equation,

D(C1,C6,K16) = C1 ⊕ SBOX1(C6 ⊕ K16) (2.4)

where C1 is one bit of ciphertext output CTOi which is XORed with the first bit of

SBox1 (i.e., first encryption round), C6 represents the six bits of CTOi that are XORed with six bits of last round subkey, K16 is last round’s subkey which is fed in SBox1 resulting in a value by looking up x in the look-up table (i.e., SBox1).

As the second step, the encryption algorithm was executed with N plaintexts PTIi and a discrete power Si[j] was recorded along with the respective ciphertext output CTOi where i is the index corresponding to the PTIi that produced Si[j] and j represents the time of the sample. Si[j] is sampled version of the power dissipation for the portion of algorithm chosen for attack. The recorded power values are divided into two sets based on the partition function D(.) :

S0 = Si[j]|D(.) = 0 (2.5)

S1 = Si[j]|D(.) = 1 (2.6) 2.3. POWER BASED ATTACKS 27

The values in each set are averaged,

1 ∑ A0[j] = Si[j] (2.7) |S0| Si[j]εS0 1 ∑ A1[j] = Si[j] (2.8) |S1| Si[j]εS0

where |S0| + |S1| = N and a discrete time DPA bias signal T [j] is obtained by sub- tracting the two averages,

T [j] = A0[j] − A1[j] (2.9)

The function point D is calculated at an intermediate point in algorithm and any change in dependencies of D (i.e., secret key bits or ciphertext) results in varying power values. The amount of power consumption depends whether D is 0 or 1. The difference of power value, E is ε if the instruction executing the bit manipulation of D function occurs at j’ (i.e., j = j’),

E{Si[j]|D(.) = 0} − E{Si[j]|D(.) = 1} = ε (2.10)

At other times, when instructions not varying D are executed, the power value is independent of secret key bits and ciphertext; hence when j ≠ j′,

E{Si[j]|D(.) = 0} - E{Si[j]|D(.) = 1} = E{Si[j]}

E{Si[j]|D(.) = 0} − E{Si[j]|D(.) = 1} = 0 (2.11)

With more number of plaintexts (i.e., N), equation 2.9 converges to the following,

lim T [j] = E{Si[j]|D(.) = 0} − E{Si[j]|D(.) = 1} (2.12) N→∞

Authors in [88] and [89] suggested that DPA bias signal is highest at ε and converges to 0 at times other than j’ while Messerges [24] proved that T [j] does not always converge 28 CHAPTER 2. LITERATURE REVIEW to zero due to small statistical biases in SBox outputs. The function D is calculated using

6 six bits of subkey K16, the attacker can use all possible 2 values, build a new partition and get new bias signal T [j]. For correct values of K16, the DPA bias signal will show significant spikes where D is being manipulated. Hence, from six bits of subkeys, an attacker can determine the resulting SBox output at the last round of DES. Following the same steps for seven other SBoxes, 48 bits of secret key can be identified and rest of the eight can be found using brute force method.

Figure 2.16: Differential Power Analysis results [24]

Figure 2.16 shows the results of attack executed on smart card by Messerges et al. in [24]. The DPA bias signals of correct key guess with respect to incorrect key guess has been shown and is clear that the signal has maximum value in case of correct key guess. A total number of 1300 plaintexts were used to build the plot where DPA biases is 6.5mV for the correct key guess while is the half size for incorrect key guess. Similar methodology can be followed for the other seven SBoxes of DES to get the complete 48 bit secret key. Other DPA attacks based on one bit of SBox output are analyzed in [90]. Ambrose et al. analyzed various instructions and investigated DPA on AES. They gave a stepwise description of DPA in [26] as shown in algorithm 1. The point of attack is chosen as the initial AddRoundKey step whose values are dependent on secret key and plaintext. Ambrose et al. chose input[3] and key[3] for the attack where input[3] is the 2.3. POWER BASED ATTACKS 29 fourth byte of input plaintext and key[3] is the fourth key byte.

Input: Power Values P Output: DPA bias S0 = 0; S1 = 0; CNT0 = 0; CNT1 = 0; for (Keyj = 0 to 255) do Simulate the AES to get SBox output bit biti Determine Corresponding Power Pi if (biti = 0) then S0 = Pi + S0; CNT0 = CNT0++; end if if (biti = 1) then S1 = Pi + S1; CNT1 = CNT1++; end if DPA biasj = |(S0/CNT0) − (S1/CNT1)|; end for Algorithm 1: Steps followed for Differential Power Analysis attack [26]

As seen in algorithm 1, for all possible values of a key byte (i.e., 28 = 256), the plaintext is varied from 0 to 255 and LSB of SBox output was recorded. If the LSB is

0, the respective power value is added to set S0 while in case of 1, the power value gets added to set S1. Further, DPA bias signal is calculated as difference of average of two sets

(i.e., S0 and S1).

0.0014

0.0012

0.001

0.0008

0.0006 DPA values (Watts) DPA values

0.0004

0.0002

0 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89 94 99 104 109 114 119 124 129 134 139 144 149 154 159 164 169 174 179 184 189 194 199 204 209 214 219 224 229 234 239 244 249 254 Key Guess

Figure 2.17: Differential Power Analysis results [26]

To implement the attack, lw(load) instruction was chosen and power profiles were 30 CHAPTER 2. LITERATURE REVIEW recorded at memory stage (i.e., when the output of FT3 SBox look-up table gets loaded from memory to CPU registers to be used in the encryption algorithm). Using recorded power values, DPA bias signal is calculated for all possible key guesses as shown in Figure 2.17. A significant peak can be seen at the correct key value (i.e., 14) as seen in Figure 2.17. Ambrose et al. proved that the correct key could be successfully revealed at the XOR function of first round of AES, at load instruction and with the mean of power values over the entire SBox look-up operation of AES. However, there was no significant peak at store instruction. Hence, the vulnerability of instructions to DPA depends on the architecture of the processor. Figure 2.17 shows DPA attacks for one bit of secret key while other bits can be recovered using the same methodology. The input samples needed for 128 key bits are 16 ∗ 28.

Figure 2.18: Differential Power Analysis results [27]

Ors et al. [27] were the first to mount DPA on an ASIC implementation of AES, fast- core. This attack was on a 128-bit AES following the steps mentioned in Section 2.2.2 with change in the order of ShiftRows and SubBytes Substitution for efficiency of hard- ware implementation. The attack was based on Correlation Coefficient methodology. At first, a matrix N X 1, M1 was computed with the number of bit transitions of eight MSBs after the initial AddRoundKey operation. The AddRoundKey operation was executed for N random plaintexts and a random secret key. As a second set of data, matrix NX2L, M4 2.3. POWER BASED ATTACKS 31 is drawn so that each column is the prediction of the bit changes for a particular guess of L attacked key bits. The correlation coefficient, c between M1 and M4 is calculated using the following equation,

ci = C(M1(1 : N, 1),M4(1 : N, i)) (2.13)

L where i = 0,...,2 − 1 and M4(1:N,i) is ith column vector of matrix M4

The correct value corresponding to L key bits shows the highest correlation as shown in Figure 2.18. The attack was implemented by averaging the measured power in order to avoid the impact of noise in recorded power values.

One of the practical challenges in power measurements is inclusion of noise which misleads the adversary in computing DPA bias signal for power analysis attacks. The common solution to overcome noise problem is having higher Signal to Noise Ratio (SNR). Messerges et al. [24] examined the effectiveness of Multiple Bit DPA Attacks by mounting attack on HC05-based smart card and found that 8-bit DPA results in 79.5mV of DPA bias as compared to 38.5mV and 9.3mV for 4-bit and 1-bit DPA respectively. Coron et al. [91] also demonstrated 4-bit DPA on DES. In fact, Coron et al.’s method uses 16 categories as compared to Messerges [24] who used two categories. More sophisticated DPA attacks are higher order attacks as introduced by Kocher at al [9] while Messerges at al [92] implemented the attacks on a smart card. Kocher et al. observed that if the cryptographic algorithm leaks the Hamming weights information, the absolute difference function used for preprocessing leads to successful results. Wandle and Wagner [93] came up with two higher-order attacks: Zero-Offset2DPA, for situations when power correla- tion points of two bits are at the same respective times, and FFT2DPA when the attacker has no information about the correlation.

Brier et al. [94] demonstrated Correlation Power Analysis (CPA) attack based on Hamming distance Model [10]. The authors investigated the reason for appearance of ‘ghost peaks’ (i.e., significant peaks at wrong key guesses) during DPA attack which mis- leads the adversary to make the right decision. The reasons for ‘ghost peak’ could be: 32 CHAPTER 2. LITERATURE REVIEW

(a), data handled along with the algorithm could be partially correlated with the target bit; (b), The distributions of an SBox output bit for two different values are deterministic and could be partially correlated and; (c), Wrong assumption of DPA that the word bits carried along with the targeted bit are uniformly distributed and are independent from the targeted bit. Brier et al. [94] concluded that the countermeasures against DPA are equally efficient for CPA.

2.4 Power Based Attacks: Countermeasures

Power Analysis Attacks are the most powerful Side Channel Attacks which can be practi- cally mounted on most embedded systems. They are a serious threat and need significant efforts to build efficient countermeasures. Researchers advised preventive measures in terms of software and hardware modifications embedded in the encryption algorithms. These measures are broadly categorized as Masking and Architectural level Hardware Countermeasures. This section compiles the findings of Researchers for countermeasures of power based side attacks.

2.4.1 Masking

Masking techniques involve computation in encryption or decryption with random val- ues to obfuscate the secret information from power profiles. The aim is to make the power traces independent of intermediate values in encryption algorithm. As stated by Stefan Mangard [10], ”A masked intermediate value vm is an intermediate value v that is concealed by a random value m; vm = v ∗ m. The attacker does not know the random number”. The masking operation used could be boolean (using XOR function), arithmetic (using subtraction/addition with modn where n is the word length) or both. Boolean op- eration suits best for linear part while for non-linear parts, both of boolean and arithmetic masking can be used. Messerges et al. [14] investigated various cryptographic implementations such as Twofish [74], 2.4. POWER BASED ATTACKS: COUNTERMEASURES 33

′ BooleantoArithmetic( x , rx) { randomly select: C = 0 or C = -1 B = C ⊕ rx; /∗ B = rx or B =r ¯x ∗/ ′ A = B ⊕ x ; /∗ A = x or A =x ¯ ∗/ A = A − B; /∗ A = x − rx or A =x ¯ − r¯x ∗/ A = A + C; /∗ A = x − rx or A =x ¯ − r¯x − 1 ∗/ A = A ⊕ C; /∗ A = x − rx ∗/ return(A); } Algorithm 2: Boolean to Arithmetic; Messerges et al. [14]

RC6 [82], Rijndael [75], Serpent [81] and Mars [79] and defined a masking model for the functions used in these algorithms. The authors defined boolean and arithmetic operations as:

Booleanmask : x′ = x ⊕ rx; (2.14)

Arithmeticmask : x′ = (x − rx)mod2n (2.15)

where x’ is masked value, x is unmasked value and rx is the random mask. For simple operations like bitwise boolean or permutations, boolean masking can be applied while for addition and polynomial multiplications, arithmetic masking is more appropriate. The boolean functions can be transformed to arithmetic ones as seen in Algorithm 2. The authors experimented the random masking scheme on a 32-bit ARM processor with above mentioned AES finalists. Comparing cycle counts, RAM/ROM consumption and security cost, Rijndael [75] and RC6 [82] were proved to be best in terms of performance and space storage overheads.

Coron and Goubin [91] pointed that ‘boolean to arithmetic’ conversion (and vice versa) as shown by Messerges [14] are not secure. The authors explained theoretically that 2-bit DPA attack can be mounted on ‘boolean to arithmetic’ and ‘arithmetic to boolean’ algorithms. Later, Coron et al. [95] and Goubin [96] proposed a secure methodology for ‘boolean to arithmetic’ transformation and vice versa. The ‘duplication method’ is inde- pendently suggested by Goubin and Patarin [97] and Chari et al. [98]. A secret sharing 34 CHAPTER 2. LITERATURE REVIEW scheme is used to split the data into several parts which are computed and recombined in final rounds. As an example, for a masked intermediate value vm = v ⊕ m, the in- termediate v is known by two shares, vm (i.e., the masked value) and m (i.e., the random mask) [10]. Hence, the intermediate v can be computed using vm and m. Cryptographic algorithms use both linear and non-linear functions which need to be masked. As seen in section 2.2.2, the AES SBox is an non-linear function whose value is multiplicative inverse of finite field: f(x) = x−1 [10] which is compatible to Mul- tiplicative masking. Akkar and Goubin [28] introduced ’transform masking method’ as an efficient way to switch between boolean mask and multiplicative masking. However, Goubin and Akkar [99] noticed that multiplicative masking is vulnerable to DPA Attacks and proved a second order DPA Attacks on ‘duplication method’ [97] and ‘transformation table method’ [96]. The authors suggested a ’unique masking method’ on DES which in- volves modifying SBoxes by generating a random α, performing a permutation (P −1) on the same and XOR a value (P −1α) to a table claiming that their methodology provides protection against High Order DPA. Akkar and Giruad [28] were the first ones to propose the masking of AES SBox look-up table. Figure 2.19 shows SubBytes Substitution step of AES which has a SBox look-up table calculated by multiplicative inversion and an affine transformation.

Modified Inversion Affine A X A -1 X B XI ij ij In GF(28) ij ij Transformation f ij ij

Figure 2.19: SubBytes Substitution of AES [28]

The transformation from boolean to multiplicative mask and vice versa is shown in

Figure 2.20. The algorithm starts from boolean masking of the input data Aij with a random mask, Xij subjected to a multiplicative mask to perform the inversion. Assuming 8 Yij as a 8-bit random mask other than zero and ⊗ denotes multiplication in GF (2 ) using the irreducible polynomial m(x) = x8 + x4 + x3 + x + 1 as modulus, the steps being followed are: (1) boolean mask is multiplied with Yij (2) XOR with Xij ⊗Yij.(3) inversion 8 ⊗ −1 in GF (2 ), to switch back to boolean mask: (4) XOR with Xij Yij (5) Multiply with 2.4. POWER BASED ATTACKS: COUNTERMEASURES 35

Yij. The authors claimed that at any stage of transformation, the intermediary values are not related to the input, Aij and hence can combat DPA.

Aij Xij

Yij

(Aij Xij ) Yij Xij

Xij Yij Yij

Aij Yij

Inversion In GF(28)

-1 (Aij Yij ) Xij Inversion X Y -1 Y ij ij In GF(28) ij

-1 -1 (Aij Yij ) (Xij Yij )

Yij

-1 Aij Xij

Figure 2.20: Transformation from Boolean to Multiplicative mask and vice versa [28]

Trichina et al. [100] gave a simplified version of Akkar and Giruad’s transformation method [28] in which the algorithm need not re-establish the unmasked value after each round rather it can be done in the end. It needs no extra inversion in GF (28), but only two extra multiplications and a squaring operation. A common problem in multiplicative masking is when the input is zero, the inverse results in zero as well. Hence, if an attacker can read the data before (i.e., Ai,j ⊗Xi,j) and after (i.e., (Ai,j ⊗ Xi,j) − 1) the inversion which is zero, the information on Ai,j can be revealed. This fact can lead to first order DPA as if no masking is applied. ′ The authors further advised to re-compute SBox look-up table T’ such that T [Ai,j ⊗ ⊕ 2 ⊗ Xi,j Xi,j] = T [Ai,j Xi,j]. Algorithm 3 shows a generalized algorithm to compute T’ from a given table T and random value X. Since the numbers in X are read in random 36 CHAPTER 2. LITERATURE REVIEW

Input: table T; random X = (x 7, ..., x 1, x 0) ′ ′ Output: table T such that T [b + X] = T [b] for b = 0..255 ′ T := T For every x i from (x 7, ..., x 0) in random order do: If x i = 1 then ′ (1) split T into blocks, each block containing 2(x i) subsequent elements from T; (2) swap pairwise j-th and j+1-st blocks; ′ (3) assign the result to T ; ′ Return T Algorithm 3: Look-up table re-computation by Trichina et al. [100]

order, the resulting values are protected during re-computations. The new SBox look-up table provides more security but may not be cost effective due to extra hardware needed in the implementation. Schramm and Paar [101] also concentrated on techniques to mask SBox look-up table effectively focussing on Higher-order DPA. The authors [101] pro- posed to compute a masked SBox look-up table based on another masked SBox look-up table. Considering the demands of security solutions in terms of memory requirement and efficiency, Golic and Tymen [102] proposed a practical Masking scheme in which GF(256) is randomly embedded in a larger algorithmic structure such that zero value is mapped to a set of values and all the operations are compatible with GF(256). Blomer et al. [103] investigated available Masking Techniques (i.e., additive [14], transformation masking method [28] and combinational logic design of masked AES [104]) and devised a new -and-multiply approach of masking the intermediate values. The aim of this technique is to perfectly mask the non-linear function of AES (i.e., SubBytes Substitution) which cannot be masked by simple arithmetic and boolean operations.

The authors assumed a True Random Number Generator (TRNG) whose outputs are not accessible to adversary. The SubBytes Substitution contains a multiplicative inverse function INV(x) where 2.4. POWER BASED ATTACKS: COUNTERMEASURES 37

{ −1 ∈ × INV (x) = x , ifx F256 (2.16) INV (x) = {0, ifx = 0 (2.17)

The methodology started with an additive masked value u+r resulting in INV (u)+r′ where r, r′ are uniformly distributed random masks. INV(x) is calculated as x254 using square-and-multiply methodology or an addition chain. In addition, Blomer et al. [103] concluded that their methodology is better in terms of performance and cost compared to the method in [104] since the later is susceptible to DPA if the input byte is zero and the former re-uses the already designed multipliers and adders for randomization. Os- wald et al. [105] proposed a simpler Multiplicative inverse approach where AES SBox look-up table is reduced to GF(4). Another efficient scheme is presented by Oswald and Schramm [106] where the inversion of a non-zero element is computed in a finite field using logarithm and exponentiation with the negated logarithm. A DPA was performed using 8-bit Reduced Instruction Set Computer (RISC) architecture proving the effective- ness of the solution. Trichina et al. [107] presented a technique based on masking logic cells (i.e., AND gate) while Golic and Menicocci [108] proposed an improved version with a shorter critical path. A common problem that arises in such practical solutions is glitching which is considered by Fischer and Gammel [109] with an assumption that the masked input value of a cell should arrive around the same time as the corresponding mask. This problem is overcome by solution proposed by Chen Zhou in [110] considering both early propagation and glitches.

2.4.2 Architectural Level Hardware Countermeasures

Digital designs resulting in constant power consumption are another secure countermea- sure against power analysis attacks. Sense Amplifier Based Logic (SABL) [15] utilizes Dynamic Differential Logic (DDL) circuits to maintain one switching event per cycle, which is independent of sequence and input. Through the combination of differential and 38 CHAPTER 2. LITERATURE REVIEW dynamic parts of the design, the circuit makes the four transitions (i.e., 0 → 0, 0 → 1, 1 → 0 and 1 → 1) equal to the first order. In addition, the output capacitance needs to be balanced to have similar charging times resulting in constant power dissipation. The authors of [15] designed the substitution box (S9-box) of KASUMI [65] algorithm the cryptographic algorithm in 3G cellular standard [111] in 0.18um CMOS technology re- sulting in SABL counterpart less than twice the size of regular S9-box. Comparing power consumption, Normalized Energy Deviation (NED) was computed, NED (Normalized Energy Deviation) = [Max(energy/cycle) - Min(energy/cycle)]/Max(energy/cycle) which ranges between 0 and 1. The smaller the number, the more measurements and highly ac- curate equipment is needed to mount the attack. The authors proved that NED of SABL design has been reduced by 80% though increased size and cost.

Tiri et al. [16, 17, 18] were the first to propose top-down synchronous design that pursues constant power dissipation. The constant power dissipation was achieved by using a logic style ‘wave dynamic differential logic’ (WDDL) and ‘differential routing technique’. The technique involves special WDDL gates with a parallel combination of two positive complementary gates, one resulting in true outputs with true inputs and the other with false outputs from false inputs. It has precharge and evaluation phases at positive and negative clock levels respectively. During precharge phase, the inputs are all set to zero turning the outputs to zero and passing to the next stage. Hence a zero-wave is passed, named as WDDL. While in evaluation phase, a differential output has been produced.

Another requirement is that the library should have 100% switching factor being a dual-gate. Figure 2.21 shows AOI32 gates for regular and WDDL circuit design where WDDL contains complementary gates combined with inverters. To get constant power consumption, the load capacitance needs to be same across all nodes. The capacitance is composed of: intrinsic capacitance of gates (i.e., output capacitance of driver, input ca- pacitance of load) and interconnect capacitive load. The intrinsic capacitances are taken care in circuit level designs. However, the interconnect capacitances are balanced using 2.4. POWER BASED ATTACKS: COUNTERMEASURES 39

Figure 2.21: Regular and WDDL circuit [17] differential pair of routing. It ensures that the same length of wires are being used, offer- ing similar resistive and capacitive load. Tiri et al. elaborated a System on Chip (SOC) design flow with two additional steps over the usual RTL-GDS as ’cell substitution’ and ’interconnect decomposition. ’Cell substitution’ step replaces all regular cells by WDDL counterpart with differential outputs and pins for differential routing. ’Interconnect de- composition’ duplicates and translates the fat wires in the design as seen in Figure 2.21. A prototype IC was designed with identical coprocessors with regular gates and routing vs. WDDL gates and differential routing followed by correlation analysis based DPA [112] showing a significant resistance to DPA attacks. The secure coprocessor with embedded AES algorithm, did not reveal five secret key bits while the other 11 key bits were found with an average of 255,000 measurements. On the other hand, with insecure coprocessor, 320 samples could recover the complete secret key.

Figure 2.22: Simple Dynamic Differential Logic [18] 40 CHAPTER 2. LITERATURE REVIEW

Tiri et al. [18] introduced Sense Amplifier based Logic (SABL) which maintains a constant power dissipation across the cryptanalytic device. The technique is based on two basic principle [113], first being Dynamic Differential Logic (DDL) which means one switching activity per clock cycle making output power independent of input values and sequence of instruction. Secondly, there is a balanced capacitance leading to the same amount of charge and discharge for every switching activity. Figure 2.22 shows a Simple Dynamic Differential Logic (SDDL) having two complementary gates combined with AND gates to add a precharge signal. It follows De Morgan’s principle that true inputs produce true outputs and vice versa but does not guarantee only one switching event in one clock cycle. As shown in Figure 2.22, both the differential outputs of XOR gate have one switching event even when one of the differential input has only one switching event. As opposed to SDDL, WDDL [18] which is designed by using AND and OR gates solves this problem so that when the inputs are precharged to zero, the outputs are switched to zero automatically without forcing. Further, Tiri et al. elaborated on Divided WDDL (DWDDL) where a WDDL circuit is divided into single-ended modules which can be derived from one another by inverting the inputs. In terms of layout, one of the module can be placed and routed while the second module can be designed by duplicating and replacing AND gates by OR gates. Thus DWDDL results in balanced capacitance and avoids the use of differential routing.

Figure 2.23: Output Voltage of Single Ended vs. WDDL circuits [18] 2.4. POWER BASED ATTACKS: COUNTERMEASURES 41

Authors of [18] compared the single-ended vs. WDDL circuits on an FPGA using Xil- inx Virtex-II Development kit and HP 54542C oscilloscope. Figure 2.23 shows the output voltage wherein the single-ended circuit has many glitches while WDDL has exactly one transition as expected and hence proves that their solution is DPA resistant.

Figure 2.24: Dual-rail protocol:(a) Random order of spacers (b) Alternative order of spac- ers [19]

Sokolov et al. [19] proposed dual rail circuits with two spacers where the switching follows the pattern: spacer→code word→spacer→code word. The spacers can be ar- ranged in random or alternative order as seen in Figure 2.24(a) and (b) respectively. The advantage of alternative order is that all bits are switched in each clock cycle. The bit transition ensures the energy balance between clock cycles operation. As compared to single-rail circuits in which one rail switches up and down (i.e., the same gate always switches), in dual-rail circuits, both the rails switch from all-zeros state to all-ones state with the intermediate states of code word. Hence, all gates are operational switching in pairs which makes the system more DPA resistant. The authors devised a software tool named, ‘Verimap design kit’ which converts the single-rail circuit to dual-rail circuit adopting either self-timed or clocked architecture. The authors of [19] simulated AES de- signs with single-rail and transformed dual-rail circuits. In terms of comparing switching activity, the single-rail circuits show minimum activity as zero for no inputs and maximum being 48% higher than the average values. While in case of dual-rail circuits, the switch- ing activity is maintained constant. Similarly, comparing the power profile, single-rail circuits clearly showed significant peaks at clocking ‘in and out’ while dual-rail circuits 42 CHAPTER 2. LITERATURE REVIEW show similar pattern throughout without showing any correlation to clocking. However, there is extra hardware overhead as: combinational logic increased by 102% -127%, num- ber of wires increase by 117-145%, flip-flops increased by 228-289% due to additional circuitry added for dual-rail and alternate spacers. A hardware current flattening architecture called PAAR is proposed in [114] to in- ternally flatten current at instruction level. The idea is to reduce the data and program dependent current variations to combat power attacks. Considering attacker’s point of observation [112] as core power supply, the current consumption is flattened by adding a non-functional instruction (NFI). It neither changes the functionality nor adds extra reg- isters. NFI for the processor SC140 are NOP; OR dn,dn; AND # 0,dn. The aim of using NFI instructions is to increase the execution time of the blocks which are producing high currents. At the same time, NFIs are also used in the instruction sets to increase the cur- rent consumption in the portions of program where the current values are low. The authors introduced real time architecture PAAR which is composed of two blocks: Feedback Cur- rent Flattening Module (FCFM) and Pipeline Current Flattening Module (PCFM). PCFM is responsible for inserting non-functional instructions while FCFM is used to measure the instantaneous current and send two feedback signals to the PCFM. The aim is to bring the current consumption variations between two programmable limits. The authors simulated PAAR with SC140 processor and found that their technique reduced peak-peak current by 78%. However there is increase in energy consumption by 74% and added hardware cost as well. Ambrose et al. [115] proposed a Hardware and Software approach to design a DPA resistant encryption algorithm. The technique involves injecting random code at ran- dom places of the most vulnerable step of AES (i.e., SubBytes Substitution explained in section 2.2.3). A flag instruction was added which enabled the randomization at the beginning of the step and disabling is combined with an instruction in Instruction Set Architecture (ISA) of the target architecture. The overhead in terms of average area is 1.98%, average runtime is 29.8% and average energy increase is 19.8%. Ambrose et al. [115] claimed that the technique can be applied 2.4. POWER BASED ATTACKS: COUNTERMEASURES 43 against Electromagnetic Attacks (e.g., SEMA and DEMA) as well. Non-deterministic processors [116, 117] schedule and execute instructions out-of- order, hence eliminating the correlation of the original program with the actual execution. The instruction level parallelism is limited to particular applications, restricting the level of security in non-deterministic executions.

INPUT Key

8 8 8 32 32 32 32 f AddRoundKey K1

Rcon <<8 SubBytes T

Round 1 Round ShiftRows SBOX T MixColumns i i i f AddRoundKey K2

SubBytes T ...... f ...

8 8 8

Ciphertext

Figure 2.25: Multiprocessor Balancing Technique: MUTE [29]

Ambrose et al. [29] offered a multiprocessor balancing technique, MUTE, where two complementary AES encryption algorithms are executed in parallel to balance the power variations due to signal transitions. Figure 2.25 shows the algorithmic modifications mainly in steps Key Scheduling and SubBytes Substitution of AES. In order to gener- ate inverted subkeys, an inverted secret key is fed which looks up a transposed SBox look-up table. Similarly, the SBox look-up table in SubBytes Substitution is inverted and transposed to get inverted output at each step. The idea is to run typical and inverted AES 44 CHAPTER 2. LITERATURE REVIEW in parallel, in two respective cores which is achieved by executing the same instructions with complemented data values. Since two cores were encrypting at the same time, some synchronization is needed which is ensured by using a signature detection unit, FUNIT.

(a) (b)

Figure 2.26: DPA plots at load (a) Typical Processor (b) MUTE-AES [29]

Ambrose et al. [29] employed two COREs with individual data and instruction mem- ories, FUNIT and three additional registers to: start the balancing; stop the balancing and; hold the process in case of an external interrupt. Figure 2.26 shows DPA plots of typical processor versus processor with two cores (i.e., MUTE) at load instruction. The typical processor DPA plot in Figure 2.26(a) shows significant peak at the correct key value while the balanced processor (i.e., with MUTE) obfuscates and does not produce a prominent peak as shown in Figure 2.26(b). DPA attacks executed at (a) XOR instruction in AddRoundKey step of AES and (b) with power average during SBox access step in SubBytes Substitution step of AES, showed the same results (i.e., prominent peak at the correct key value). The new architecture results in 0.42% performance overhead while area, dynamic power and leakage power are doubled. Furthermore, the lock-step imple- mentation is also complicated and needs accurate implementation for complete balancing. 2.5. CACHE BASED ATTACKS 45

Main Processor Memory

Cache

Th

Tm Side Channel Attack

Figure 2.27: How cache fits in Processor Architecture

2.5 Cache Based Attacks

2.5.1 Processor Architecture

With the advancement of technology, processor speed has exponentially increased yet improvement of memory access speed has lagged. In order to improve processor perfor- mance and increase computational speed, a smaller and faster memory known as cache, is placed between CPU and main memory. A Cache is memory storage which is primarily used to save time consumed in retrieving data and instructions from the main memory, as needed by the processor while executing code. In the early 1960s, cache first appeared in research machines and later in that decade, in production machines [58]. Figure 2.27 rep- resents how cache fits between the main memory and processor. Cache stores data based on temporal locality (i.e., the data accessed once by the processor is stored in cache and is available for the next instance) and spatial locality (i.e., the data adjacent to the referenced data is also stored in cache which is expected to be called as the program progresses). The organization of memory in processor architecture is such that the fastest and smallest memory is placed closer to the processor to have faster transactions while the slower and larger memory is placed at the next level. The minimum unit of information at any level of memory is called a block [118]. There are restrictions in the place- ment of blocks depending on the cache organization. A cache can be as direct-mapped, set-associative or fully-associative. To understand the three kinds of cache organization, 46 CHAPTER 2. LITERATURE REVIEW

Valid Tag Content

true 0b001 0x20 0x21 0x22 0x23 Data of address 0b001 false …….. ……………………………………………………... 00000 Empty Lines false …….. ……………………………………………………... true 0b000 0x00 0x01 0x02 0x03 Data of address 0b000 false …….. ……………………………………………………... 01101

Figure 2.28: Structure of Cache

Figure 2.29: Cache Internal Structure it is important to get familiar with the structure of cache line. As shown in Figure 2.28, the cache line has three main parts, ‘cache valid flag’ (”Valid” in Figure 2.28), ‘cache tag’ (”Tag” in Figure 2.28) and ‘cache data’ (”Content” in Figure 2.28) [118]. The ‘cache valid flag’ is set if the cache line has valid information. If this flag is unset, the respective cache line is not checked for the data. The ‘cache Tag’ is derived from the memory address. If the ”Valid” flag is set, ”Tag” is compared with the requested memory address. To speed the processing time, the Tags are compared in parallel. The ‘cache data’ contains data of the respective memory addresses. The internal structure of Direct mapped, Set associative mapped and Fully associative mapped cache is shown in Figure 2.29. In case of direct mapped cache, there is one dedicated place per block where cache line = (block address) modulo (Number of cache blocks). If the cache is Set associative, the block can be placed in a restricted set of places in cache. A set contains one or more blocks. So, a block is first mapped to a 2.5. CACHE BASED ATTACKS 47 set as calculated by: (block address) modulo (Number of sets), further it can take any place within the set. If a set contains ’n’ blocks, it is called n-way set-associative. In fully associative cache, the block can be placed anywhere in the cache. A direct mapped cache is also known as one-way set-associative and a fully associative cache containing m-blocks, is known as m-way associative cache. As shown in Figure 2.29, the block 12 is placed in cache with different structure. The internal structure of cache as explained above helps in understanding how the data is accessed in cache during the encryption and decryption process. During execution, when the processor needs data for processing, it requests the data from the cache. If the data is available in the cache (known as a ‘hit’), it is given to the processor in a short time, denoted by Th in Figure 2.27. If the data needs to be brought from the main memory, it takes a longer time, known as a ‘miss’ and denoted by Tm in Figure 2.27. In case of cache-hit, the data is available in one or two clock cycles, while in case of cache-miss, there is a miss penalty. The ‘miss penalty’ is the time taken to copy the block from the lower level of memory to the upper level and the time consumed to deliver this block to the processor [118]. This time can be in the order of 10s to 100s of clock cycles. In the process of writing the missing data, the cache might get full if all cache lines are already in use. In such a case, a cache line will be replaced with new data. There could be many possible strategies to choose which cache line needs to be evicted. Two such policies are Random selection and Least Recently Used (LRU). In the case of Random selection, the blocks to be replaced are randomly chosen while in the case of LRU, the blocks that are not used for a long time become candidate blocks for replacement. The replacement policy preserves the most frequently used blocks, in case it needs to be accessed again. As explained, the ‘cache hit’ occurs fast, hence takes less time while ‘cache miss’ takes longer time since it needs to bring data from the other level of memories. The cache based (time-driven) side channel attacks, explained in the following sections, exploit the time difference in accessing memory addresses in cache. 48 CHAPTER 2. LITERATURE REVIEW

2.6 Classification of Cache Based Attacks

Most general purpose processors use cache to reduce memory access time and hence the total execution time. Researchers have shown that the information stored in cache can be used to reveal the internal computations of encryption algorithms. Hence, the cache- based attacks pose a serious threat to the system. Two of main cache based attacks are Time-Driven and Access-Driven attacks depending on the method used to employ the attack.

2.6.1 Time-Driven Attacks

Kocher et al. [7] first noted that cache behavior can be used for side channel attacks. He mentioned the relation of memory access with the execution time. Kelsey et al. [119] showed that the timing dependent behavior of AES access in CAST [87], Blowfish [60] and Khufu [120] could leak confidential information to the attacker. Tsunoo et al. [121] proposed a cache attack in which cipher structure was used to obtain the complete secret key. The cache attack was used to recover the entire key by using 216 plaintexts in a real-world scenario. D. J. Bernstein [56] proved that cache attacks could be mounted remotely using without adversary having direct access to the machine. To implement the attack, the cache profiles Hit/Miss profiles were recorded using a known key and unknown key for millions of plaintext. Further, the two sets of data were analyzed to recover the complete secret key. According to Bernstein, a miss observed in the first instance (i.e., with a known key) will recur at a different instance as well. So, the comparative analysis of both sets of data shows the XOR relation between the key elements. The key elements can be computed further by resolving XOR logical equations. Wang et al. [11] experimented with D.J.Bernstein’s attack [56] using known plaintext and recording encryption time. A correlation analysis using encryption time for known Key and unknown Key resulted in recovery of the complete secret key used for AES encryption. Jayasinghe et al. [30] implemented Bernteins attack [56] on a real system 2.6. CLASSIFICATION OF CACHE BASED ATTACKS 49

Figure 2.30: Number of data packets versus key combinations [30] and recovered the complete secret key. The experimental setup includes two servers, one being victim server and the other being the replica of the first one. When the data packets are sent to the server, a time stamp is added and another time stamp is added after encryption using the servers´ secret key. The packet containing two time stamps and encrypted data are then sent back to the client. Using the two time stamps the encryption time is calculated. The process is repeated using an unknown key and the encryption timings are recorded. Using correlation program, the two sets of timing data are compared and possible key space is generated. For all possible keys identified by the experiment, the corresponding ciphertext values are recorded with some known plaintext-zero which is known as scrambled-zero. The resulting ciphertext values are compared with scrambled- zero ciphertext from the server with the unknown key. The matching ciphertext points to the correct secret key. As seen in Figure 2.30, the minimum number of data packets needed for a successful cache attack are 224.

However, these attacks could be avoided if the user has a large cache size, which could accommodate all of the table lookups [13]. The attack described in [11] is simple and can be easily deployed but has many shortcomings. First, the attacker needs to use a reference key and build the graph which needs to be followed by an analysis with an unknown key. In most cases, the analysis with a known key is not practically possible. Secondly, the 50 CHAPTER 2. LITERATURE REVIEW attack needs millions of plaintext to be encrypted and analyzed for the statistical analysis.

Figure 2.31: Cache based Attack results [31]

Additionally, some time-driven attacks are also based on ‘cache collision’. A ‘cache collision’refers to the state when two table lookups point to the same value. Bonneau et al. [55] suggested that higher number of cache collisions lead to a smaller number of cache misses in an encryption resulting in shorter encryption time compared to encryption with a smaller number of cache collisions. Kong et al. confirmed by experimenting with the last round AES attack explained in [31] on both a simulated processor model and a 4 machine. The attack works on data obtained by the last round of AES. If a cache collision occurs, it leads to equation 2.18,

ki ⊕ xi = kj ⊕ xj → ki ⊕ kj = xi ⊕ xj (2.18)

(where, ⊕ denotes XOR; ki and kj are last round key bytes while xi and xj are ci- phertext bytes). Using large number of samples, the adversary can predict that among all possible values of xi ⊕ xj (i.e., 256 values), the one with the smallest mean encryp- tion time implies the ’cache collision’ and hence shows the correct key difference (i.e., ki ⊕ kj). For the purpose of experiments, 16 million encryptions are executed to com- pute the mean encryption time, using the same key on both simulated processor and the machine. The result of such an attack is shown in Figure 2.31(a) and Fig- ure 2.31(b). The difference between ki and kj is 254. As seen in the graph of Figure 2.31, the smallest encryption time shows the correct key difference. The attack is possible only if the adversary can record the internal-collisions. Brumley and Boneh [122] proved that timing attacks are not only possible on smart cards but also on an OpenSSL-based web 2.6. CLASSIFICATION OF CACHE BASED ATTACKS 51 server running on a machine in local network. It has been proven that if two virtual ma- chines are running on the same machine, the network server virtual machine can recover the secret key from secure virtual machine. Felen et al. [123] mounted a cache attack using Web-browsing history and Web-cookies set accordingly. The attack uses the Web browser file cache and DNS cache which stores the information of websites visited and recent network addresses to which a machine was connected recently. Such attacks can be easily deployed remotely without victim’s knowledge. Koeunne and Quisquater [124] presented a timing attack using branch statements during the SubBytes Substitution step of AES implementation. Fournier et al. [20] proposed a trace-driven attack where cache Hit/Miss profiles were collected during the execution of an AES program using power analysis. The focus was to use the cache access information during SubBytes Substitution of AES while the values from SBox look-up table are fetched from cache (if available) or memory. Keeping the

first byte of plaintext p1 as constant, the second byte was varied (i.e., p2) till there is a cache hit which means that p1 ⊕ k1 ≈ p2 ⊕ k2 (considering cache granularity of 16 bytes) which leads to p1 ⊕ p2 ≈ k1 ⊕ k2. Similarly, the third byte of plaintext (i.e., p3) is varied till another cache hit. Similar XOR equations are derived between key bytes. Iterating this process few times, the highest nibble of each key byte is known as a function of high nibble of first byte. With 240 acquisitions, the search space to recover the secret key was reduced from 2128 to 268.

2.6.2 Access-Driven Attacks

By definition, Access-driven attacks can be implemented if an adversary gets access to the data stored in the cache. In 2002, Page [8] proposed an attack to recover the secret key based on the fact that cache hit would occur when the input to two SBoxes is equiva- lent. The attack was implemented on DES [125] by collecting the cache Hit/Miss profiles derived by measuring power consumption or electromagnetic radiations. The attack starts by finding a plaintext, which results in cache hit between the SBox0 of the first round and 52 CHAPTER 2. LITERATURE REVIEW the SBox0 of the second round. The step is repeated to find 32 plaintexts which satisfy ”input to SB0 of first round = input to SB0 of second round” [8] which is followed by an exhaustive search for key bits. The technique is repeated for Sbox1. Page demonstrated that the secret key could be recovered using 210 plaintexts. Page’s attack [8] is based on the assumption that the cache should be flushed before encryption and that the cache hits are detected by measuring power or magnetic force. If the user implements cache warm- ing [13] (i.e., filling the cache before encryption so that it will not be clear before the attack) or randomize encryption time [121], the attacker will not have an accurate number of cache Hit/Miss profiles.

Percival [52] discovered an access driven attack using the openssl implementation of RSA. The approach of ‘simultaneous processors’ was used where multiple programs could be executed at the same time. each sharing the cache. The attacker (i.e., the author, Percival) managed to run a program simultaneously with the victim’s RSA encryption. An array was repeatedly accessed which loaded the attacker’s data in cache and the execution time was measured for each access to record cache misses. The miss is due to the eviction of cache lines by the victim’s process. So, depending on which line is replaced, the attacker inferred the corresponding SBox entry accessed during RSA encryption. The SBox entry value gave the index value of the SBox look-up table and hence was used to recover the secret key.

Bertoni et al. [53] devised an access driven attack which was implemented by exploit- ing the SubBytes Substitution step of AES [126] encryption algorithm. As explained in Section 2.2.2, that in SubBytes Substitution, the input value (i.e., XOR of plaintext and secret key) is used as indices of SBox look-up table. In the attack, firstly, the encryption function is executed which fills the cache with the data of SBox and other computations. Further, the attacker reads one element of the array A (i.e., the spy program), replacing a single element of SBox at a known position. The encryption is executed again using same plaintext and the corresponding power is monitored. From these power numbers, cache profiles were found. A cache miss will show that the XOR resultant of key and plaintext has called ‘the SBox element’ which was replaced by the attacker. Hence, the 2.6. CLASSIFICATION OF CACHE BASED ATTACKS 53

Figure 2.32: Access-driven cache based Attack [31] attacker gets the information about the placement of data in SBox which can be mapped to the table indices and tracing back will lead to the secret key. Such steps are iterated to discover the secret key. The number of encryptions required to mount the attack are 2 ∗ 256 (since SBox has 256 elements) and power traces required are 256.

Kong, et al. [31] experimented with a typical access-driven attack on a real-world en- vironment (i.e., simulator processor and Pentium 4 processor). The attack starts by filling the entire cache with the spy program of the attacker as seen in Figure 2.32 which was achieved by loading the cache with an array of the size of the cache. Then, the encryption program (i.e., AES, in this case) is executed on the processor. The encryption program evicts some of cache lines and replaces the attacker’s spy program with the new values. Then, the attacker reads its program once again and records the cache Hit/Miss profiles. The cache lines that have been replaced by the encryption program data, will result in cache misses, as shown in Figure 2.32(b). By having knowledge of such cache lines (i.e., replaced by AES), the attacker can infer with SBox values have been accessed by the encryption algorithm and hence recover the secret key. A cache line contains multiple table elements and such elements are called iteratively in multiple rounds of encryption. Careful statistical analysis of data stored in cache is used to recover the complete secret key [127]. The above mentioned access-driven attack by Kong et al. is possible due to the sharing of cache among processes/threads.

There are new cache architectures [11, 12] and other countermeasures [13] suggested against the above mentioned attacks [31, 53, 20, 11]. Adopting the countermeasures, it is possible to avoid time-driven attacks but access-driven attacks are a major threat. Further 54 CHAPTER 2. LITERATURE REVIEW investigations and research is required to devise efficient countermeasures.

2.7 Cache Based Attacks: Countermeasures

The increasing number of cache based side channel attacks poses a serious threat to the security of . Many countermeasures were devised to obscure cache access patterns, thus reducing the chances of attack: These include improved cache architec- tures (e.g., Partition Locked cache [12], Random Permutation cache [11]) and other de- fence mechanisms like cache warming, large cache line size and time skewing [13]. The techniques employed and their associated security issues are elaborated in the following sections.

2.7.1 Simple Countermeasures

Simple and easy to implement countermeasures are listed below:

• Maximize cache line size [13]: If the cache line is increased in size, one line will house data of more than one address which will confuse the adversary who will not be able to correlate the cache profiles accurately with the addresses.

• Cache Warming [13]: If the data (of SBox in case of AES) is already loaded in cache before the start of encryption, there will not be any ‘miss’ and no variations in access time. Hence, cache warming can defeat both time-driven and access- driven attacks. A similar approach is cache normalization which includes bringing the entire look-up table into cache at any entry or exit from the encryption algo- rithm. The aim of the approach is to foil the time-driven cache attacks by making no information about bringing values from specific locations of SBox look-up table unavailable.

• Disable cache flushing [13]: The approach promotes the idea of maintaining a stable state of cache and does not provide an empty cache to the attacker. 2.7. CACHE BASED ATTACKS: COUNTERMEASURES 55

•Time Skewing [13]: In order to defeat time-driven attacks, it is possible to add a dummy computation to the encryption algorithm, so that the encryption time is not directly related to the cache Hit/Miss profiles.

• Minimize Timing Accuracy [13]: Denying the access to the global clock circuitry can minimize the accuracy of the attack. Without accurate timings, the attacker cannot build corresponding cache Hit/Miss profiles and hence fail to recover the secret key. One possible way to implement the solution is by removing the clocking circuitry [128] and using Asynchronous design [129]. An additional benefit of using asynchronous design is low power consumption which is required in smart card- based designs.

• Random order: Executing AES algorithm functions in random order makes it dif- ficult to build correlation between cache Hit/Miss profiles and the AES implemen- tation steps. The approach can be implemented both in software [130] or hard- ware [13]. Similar techniques are suggested by Tomer et al. in [131] which includes reordering of instructions or reading the entire table in the beginning and then us- ing only the entries which are needed. Fournier [20] demonstrated an algorithm to compute xtime (as needed in SubBytes Substitution of AES) on a 32-bit archi- tecture. The authors claimed it to be faster since it takes only 8 instruction cycles calculating 4 bytes in parallel and avoids any memory accesses.

• Oswald [105] and Paar [101] suggested masking techniques for AES where the intermediate data is masked using random values and hence the correlation analysis does not result in correct secret key. However, most of the masking techniques are significantly slower or proved or be insecure in real systems [131].

• Tromer [131] suggested providing a secure execution environment for a user’s pro- cesses. This includes marking a portion of code as ‘sensitive section’ which is handled in a special way by the operating system (i.e., either disabled task switch- ing and parallelism or cache in ”no-fill” mode where data with cache-hit is brought 56 CHAPTER 2. LITERATURE REVIEW

in from cache while those with cache-miss is copied directly from memory without filling the cache). Some similar special cases are discussed in [52]and [56].

The rest of the section explains more complex countermeasures with architectural changes.

2.7.2 Partition Locked cache (PLcache)

The concept of cache partitioning was first devised by Page [12]. The purpose of parti- tioning is to avoid cache interference among the protected processes. However, there still exists some interference within the same process, which is further addressed by locking the critical data inside the cache.

L ID Original cache line

Figure 2.33: Partition Locked cache architecture[11]

The additional hardware includes two new tags L and ID, as shown in Figure 2.33. The L bit indicates the locking state of cache and ID field depicts the owner of the cache line, usually a process. The cache hit handling process remains the same except that the L bit gets updated, if needed while in the case of a cache miss, the replacement policy will depend upon the status of L bit of the existing cache lines. Let D denote the new data and R denotes the cache line chosen to be evicted. The following cases will be considered [11]: (a), D replaces R, if D does not need to be locked and R is not locked as well; (b), If D needs to be locked, it can replace any line in the cache that is not locked; (c), If D does not need to be locked and R is locked, it cannot replace R. So, the data is written in the next level of memory and Least Recently Used (LRU) list gets updated so that R is not accessed next time for replacement. The above approach provides support against attacks based on cache interference and cache collisions. But at the same time, there could be problems such as excessive locking and the information can be leaked in initial 2.7. CACHE BASED ATTACKS: COUNTERMEASURES 57 preloading. All the cache lines could also become locked resulting in denial-of-service. The security issues of PL Cache are mentioned in the next section.

2.7.3 Security Issues of PL cache

The main weakness of PL cache is the denial-of-service due to its inherent locking prob- lem and the vulnerability of the initial loading procedure. To secure the PL cache, Page [13] suggested cache warming which means loading the critical data into the cache before the encryption starts. Further, Kong et al. [31] proposed the use of cache warming along with locking of critical data to avoid any interference. Another solution found was to allow the replacement of locked cache lines if the process is currently not active (or switched off). When the process becomes active again, the related cache lines will be reloaded and get locked. Both these solutions [31] need extra software in addition to new cache architecture, which is not feasible to implement in all systems.

2.7.4 Random Permutation cache (RPcache)

Wang et al. [11] suggested the randomization of memory addresses in cache which allows cache sharing but obfuscates the resulting interference which makes it difficult to derive the accurate cache Hit/Miss profiles. Similar to PL Cache, in RP Cache, each cache line is augmented by one protection bit P and ID field that denotes the process, as shown in the Figure 2.34.

Figure 2.34: Random Permutation Cache Architecture [11]

In case of interference between victim and attacker processes, instead of replacing the 58 CHAPTER 2. LITERATURE REVIEW cache line, say R, another line R’ is chosen from set S’ for eviction, which is randomly selected. RP Cache randomizes the mapping between cache sets and memory addresses. The new mapping of memory and cache is stored in a hardware permutation table as shown in Figure 2.34. Only the protected processes will use the permutation table and hence the address indirection. Randomization provides security against access-driven cache attacks. However, the approach is still vulnerable to time-driven (collision-based) attacks.

2.7.5 Security Issues of RP cache

Kong et al. [132] suggested a solution to the vulnerability of RP Cache to collision-based time-driven attacks. The solution proposes to use informing loads to access the critical data. Informing loads are special instructions which inform the system about the load ‘miss’ in the cache. In case of AES, the implementation of table look-up is implemented with informing loads while restricting the other data accesses to use normal load instruc- tions. When a cache miss occurs for the look-up tables, a user-level exception will be generated followed by the execution of an exception handler which loads the critical data into the cache. The aim is that after the exception handling, the accesses of critical data will be always hit avoiding any timing variations.

Most of the hardware or software countermeasures mentioned above provide security against Side channel attacks but they are less deterministic in nature. They do not pro- vide 100% protection against security attacks. For instance, the ‘cache warming’ method prevents the attacker to record the accurate cache Hit/Miss profiles but it cannot avoid the situation when an attacker runs a spy program which replaces the SBox look-up table indices in cache. Similarly, the ‘time skewing’ method will fail if the attacker discovers the constant factor added to encryption time. Recently, a number of hardware-based ap- proaches have been proposed as an instruction set architecture (ISA) extension to improve the performance of AES [133, 134]. Given that they use custom hardware to perform the 2.8. SUMMARY 59 whole encryption, they do not use stored SBox look-up table and therefore are not vulner- able to cache based side channel attack. However, they incur higher hardware overhead due to the extra computation logic used.

2.8 Summary

This Chapter examined some of the influential papers in the area of power and cache based side channel attacks and their countermeasures. Drawbacks as well as the strengths of the countermeasures were described as well. ‘There are no shortcuts to any place worth going’. Beverly Sills

Chapter 3

Cache SCA Framework

This chapter describes time-driven cache attack framework implemented on a commercial processor using cache Hit/Miss profiles. The attack is based upon a trace-driven cache at- tack first proposed by Fournier et al. in [20]. However, for the first time this proposal is being implemented on a processor to recover the complete secret key of an AES imple- mentation. The uniqueness of this approach is the lowest number of inputs used, leading to reduced efforts and improved speed. The framework can also be used as a testbench to qualify countermeasures for cache based Side Channel Attack. This chapter covers the overview of attack followed by methodology, implementation, results and comparative analysis with existing attacks.

3.0.1 The attack overview

Fournier and Tunstall [20] proposed a trace-driven cache attack using power analysis. Their attack is based on a collection of cache traces (hit and miss events) during the first round of AES encryption and doing further computations to recover the secret key with cache line size of 16 bytes. The author of this thesis implemented the attack with cache line size containing a single element of SBox look-up table and recovered the complete secret key. This attack was chosen because if it fails on a countermeasure, then it would also

60 61 fail other trace based and timing based cache attacks. Hence this attack can be used as a testbench to test cache based hardware or software countermeasures. A stepwise procedure of the attack is explained below. For attack implementation, a unique set of input (i.e., plaintext) is applied to AES- phase1 of AES algorithm 2.2.3 and cache Hit/Miss profiles are recorded. The first step (i.e., AddRoundKey) executes the XOR operation of ‘plaintext’ and ‘secret key bytes’, as seen in Figure 3.1.

p[0] p[1] p[2] p[3] k[0] k[1] k[2] k[3] r[0] r[1] r[2] r[3]

p[4] p[5] p[6] p[7] k[4] k[5] k[6] k[7] r[4] r[5] r[6] r[7]

p[8] p[9] p[10] p[11] k[8] k[9] k[10] k[11] r[8] r[9] r[10] r[11]

p[12] p[13] p[14] p[15] k[12] k[13] k[14] k[15] r[12] r[13] r[14] r[15]

Plaintext Resultant (16 Bytes) Unknown Secret Key

Figure 3.1: AddRoundKey step of AES

1. For the first byte of ‘plaintext’ and ‘secret key’ the resultant XOR of the values, r[0], is shown by Equation 3.1. Here, p[0] is chosen as a random number and k[0] is an unknown value. The resultant r[0] is used as table indices of SBox look-up table and the corresponding value fetched from the table is a new value and is brought into the cache from main memory, hence resulting in a miss (i.e., longer execution time).

p[0] ⊕ k[0] = r[0] (3.1)

2. For the second byte of ‘plaintext’ and ‘secret key’ the resultant XOR of the values, r[1], is shown in Equation 3.2. The second plaintext byte (i.e., p[1]) is varied from 0x00 to 0xff followed by recording the cache Hit/Miss profiles until the ideal pattern is achieved (i.e., Hit). It will happen only if r[1] = r[0] and the processor has already brought r[0] into the cache . 62 CHAPTER 3. CACHE SCA FRAMEWORK

p[1] ⊕ k[1] = r[1] (3.2)

Similarly, the rest of the plaintext bytes are varied (i.e., p[i](i ∈ [3, 15])) so as to achieve the results described by Equation 3.3.

r[15] = r[14] = .... = r[3] = r[2] = r[1] = r[0] (3.3)

Since all the resultant values are the same, the logical relation of the key bytes and plaintexts can be derived as shown in following equations:

p[0] ⊕ p[2] = k[0] ⊕ k[2] → k[2] = p[0] ⊕ p[2] ⊕ k[0]

p[0] ⊕ p[3] = k[0] ⊕ k[3] → k[3] = p[0] ⊕ p[3] ⊕ k[0]

p[0] ⊕ p[4] = k[0] ⊕ k[4] → k[4] = p[0] ⊕ p[4] ⊕ k[0]

p[0] ⊕ p[5] = k[0] ⊕ k[5] → k[5] = p[0] ⊕ p[5] ⊕ k[0]

. .

p[0] ⊕ p[14] = k[0] ⊕ k[14] → k[14] = p[0] ⊕ p[14] ⊕ k[0]

p[0] ⊕ p[15] = k[0] ⊕ k[15] → k[15] = p[0] ⊕ p[15] ⊕ k[0]

Hence, as seen from the above equations, all key bytes are related to plaintexts and key byte 0 where plaintexts are known values while k[0] is unknown. Therefore, by varying k[0] from 0x00 (i.e., 0) to 0xff (i.e., 255), other key bytes k[1], k[2]...k[15] 63

can be found resulting in 256 keys. Then, these keys are tested to choose the right one.

In order to select the right key from 256 possibilities as seen above, reference ci- phertext, D is computed by executing AES with a random plaintext, A and the unknown secret key B as shown in step 19. The 256 possible keys are then passed through AES with the same plaintext (i.e., A) resulting in 256 values of ciphertext C. Each value of C is compared with the reference ciphertext D and when they match the secret key has been discovered. Steps 21-27 show the validation process, starting from first key out of the list of 256 possible keys, computing the final en- crypted value (i.e., ciphertext) from AES and then comparing with the reference one till the correct match is found.

The attack is also portrayed in the form of Algorithm 4 for better understanding. As seen in the first five steps of attack Algorithm 4, the cache is flushed before starting the process. k[i] is a secret key byte where i vary from 0 to 15 forming the secret key, B. p[i] is a plaintext byte where i vary from 0 to 15. Furthermore, as shown in step 6, a random value (i.e.,S) is assigned to first byte of plaintext (i.e.,p[0]) with the rest of the plaintext bytes being 0x00. As shown in steps 9 and 10, the next plaintext byte is selected, adding 0x01 to the value (initially set as 0x00), execute AES-phase1(Section 2.2.3) using secret key B and record the cache Hit/Miss profiles. Then, the cache Hit/Miss profiles of the se- lected plaintext byte are matched against ideal ones (i.e., first miss and the rest hit) in step 11. These steps are repeated for each plaintext byte until ideal pattern profiles is achieved as shown in Figure 3.2 (i.e., first byte Miss and rest Hit, MHH...H). The ideal pattern fig- ure denotes the recorded cache profiles after the execution of AES-phase1(Section 2.2.3). The inputs are 16 bytes plaintext and same size secret key, resulting in cache profiles of the resultant 16 bytes which is shown by a 4x4 matrix. The theory behind ideal pattern (i.e., first miss and the rest hit) is as follows: for the first time the look up value corre- sponding to the XOR of input plaintext and secret key is brought from main memory to 64 CHAPTER 3. CACHE SCA FRAMEWORK

1: Assume: k[i](i ∈ [0, 15]) - secret key bytes, forming secret key B 2: Assume: p[i](i ∈ [0, 15]) - plaintext bytes, all initialized to 0x00 3: Assume: Π[i](i ∈ [0, 15]) - hit/miss pattern for p[i] 4: Assume: idealPattern - first ‘miss’ for p[0] followed by hits up to the byte concerned 5: Clear cache 6: p[0] = S, where S is any random value 7: j = 0 8: j = j + 1 9: p[j] = p[j] + 0x01 10: Π[j] = AES-phase1(plaintext p[j], k[j]) 11: if (Π[j]! = idealPattern) then 12: clear cache; go to 9 13: else 14: if (j! = 15) then 15: clear cache; go to 8 16: else 17: Use respective p[i](i ∈ [0, 15]) and build XOR relations between k[i](i ∈ [0, 15]) 18: Build 256 keys; Key[m],(m ∈ [0, 255]) 19: D = AES(plaintext A, secret key B) 20: m = 0 21: Key = Key[m] 22: C = AES(plaintext A, secret key Key[m]) 23: if (C! = D) then 24: m = m + 1 25: go to 21 26: else 27: Key[m] = B 28: end if 29: end if 30: end if Algorithm 4: Cache Attack

the processor for computation and is also copied into the cache. Keeping Π[0] as refer- ence, second plaintext byte p[1] is changed successively by adding 0x01 to 0x00 (i.e., the initial value) until the resulting cache profile after AES-phase1(Section 2.2.3) is matched against the ideal pattern (i.e., second time hit). The hit of the second byte means that the resultant already exists in cache (i.e., Π[0]). This relation helps in building equations between secret key and plaintext bytes. Similarly, the third byte of plaintext (i.e., p[2]) is varied until another hit and the rest of the plaintext bytes are modified to get the ideal pattern (i.e., MHH...H). 65

M H H H Plaintext SubBytes Substitution H H H H H H H H Secret Key H H H H

Figure 3.2: Ideal Pattern

After getting the ideal pattern by varying each plaintext byte from 0x00 to 0xff, the respective plaintext bytes are recorded. Using these values, a logical relationship is built among key bytes and plaintexts as shown in step 17 of the algorithm followed by step 18, resulting in 256 key guesses. The alternate keys undergo a validation process to get the correct secret key. It is a deterministic method of recovering the complete secret key by varying plaintext bytes. The number of plaintexts to be tried and hence the total execution time varies, depending how quickly the secret key is discovered.

3.0.2 System Overview

Plaintext Plaintext 0-255 0-255

Modifiable k[i]; Modifiable k[B] System Simulator i:0-255 System Simulator (AES) (AES) k[B]

Encrypted Record hit/miss Value after Ist phase; Relate key bytes; =?

Build keys k[x], x: 0-255 k[x] = k[B]

Implementation Validation Phase Phase

Figure 3.3: Framework 66 CHAPTER 3. CACHE SCA FRAMEWORK

The centerpiece of the framework is a Modifiable System Simulator (MSS) [135]. This simulator utilizes the Tensilica’s XTMP environment which is interfaced to DRAM- sim [136] which is a DRAM simulator (for both power and performance). Tensilica’s Xtensa processor range [137] can be simulated on it, with accurate timing and power results obtained for both the processor and memory. The elegance of this solution stems from the ability to customize instructions, register widths, cache configurations (line sizes, associativity, number of lines etc.), and to be able to add additional hardware units using System C. The author was able to measure power used by the memory of the processor, as well as monitor any part of the external hardware units or the busses, allowing the ability to observe multiple side channels. Into this MSS the author implemented AES as shown in Figure 3.3. On the left hand of Figure 3.3, 256 possible keys are chosen while on the right hand side, the comparative analysis isolates the correct key. Note that for this part, MSS is not being used, but a standard computer running AES, to which 256 possible keys are supplied with random plaintext. Hence, AES is being executed with 256 possible keys (k[0] to k[255]) with a known plaintext and the same plaintext with the key to be attacked within the MSS/AES. The output of one of the keys k[i] ((i ∈ [0, 255]) will match the out- put of the execution with the secret key (which is the correct secret key). This framework requires fewer executions of the AES algorithm than previous methods, allows testing of countermeasures and can be used to observe a variety of side channel manifestations.

3.1 Experimental Setup

Plaintext

Cache AES Hit/Miss Profiles Tensilica

Secret Key

Figure 3.4: Experimental Setup 3.2. RESULTS 67

Figure 3.4 shows the experimental setup used for the attack. The processor used was the Tensilicas Xtensa LX2 [137]. To speed up the experimental process, only the first round of encryption until the SubBytes Substitution step was used. The cache used for the experiments was a 4-way set-associative 32KB cache with 4 bytes cache line size. The execution of code results in encryption and filling of the cache with the related values. A profile report is generated which gives a detailed trace of execution of each instruction as stored in memory. This trace is used to capture the cache Hit/Miss profiles after each byte is fetched from the SBox look-up table. This is similar to monitoring the bus from the processor to the memory bus. By examining the code, the corresponding memory location of the instruction used for fetching the respective value from the SBox look-up table is known. Then, using the detailed execution trace, the number of cycles consumed to execute a specific instruction can be recorded. The number of cycles used gives an indication of ‘hit or ‘miss. In this attack framework, approximately 33 cycles were taken for a miss and 1 cycle for a hit (thus the difference was readily apparent between a hit and a miss). Using the information from the time taken for a hit and a miss, the framework was able to perform the attack for a variety of input keys. The whole framework resides on an Opteron quad core machine running with 8GB ram and executing at 2.15GHz. The simulator model has a LX2 processor with a clock speed of 563MHz. Each attack took 300 minutes in the worst case, though in many cases finished more quickly (i.e., 150 minutes).

3.2 Results

As mentioned in the previous section 3.0.1, unique plaintexts and an unknown secret key are fed to AES-phase1 (Section 2.2.3) and corresponding cache profiles are recorded until the desired ideal pattern is matched (i.e., MHH...H). The experimental results are shown in Figure 3.5 where a miss is denoted by 33 clock cycles while a hit is denoted by one clock cycle. Then, by resolving equations in Section 3.0.1, 256 possible keys are built as shown in Figure 3.6. 68 CHAPTER 3. CACHE SCA FRAMEWORK

Cycles vs Sbox Accesses

35

30

25

20

15

10

5

0 1 2 3 4 5 6 7 8 9 10111213141516

Figure 3.5: Ideal Pattern

Figure 3.6: Results

3.2.1 Comparative Analysis

The method of comparing speed of attacks is usually given as the number of encryptions needed to obtain a key. The number of keys needed is compared against that of previously known attacks in Table 3.1. Note that Tsunoos attack was on the MISTY encryption algorithm and was not on AES.

3.3 Assumptions

In order to carry out the attack, the cache should be clear before starting the AES engine. This gives the complete cache to store the intermediate data while AES is implemented 3.4. SUMMARY 69

Cache Attack Number of Encryptions Bernstein [56] 227.5 Kong et al. [31] 224 Tsunoo et al. [121] 216 Framework in Discussion 212 Table 3.1: Performance Comparison in the processor.

3.3.1 Implementation in a real system

In a real system (not a simulator) the same methodology can be used to execute the attack. Varying the plaintext in the similar way, the adversary can aim to achieve the ideal pattern (i.e., first miss and the rest hit) as shown in Figure 3.2. Such a system will result in a cache Hit/Miss pattern as shown in Figure 3.5. This can be measured by different ways. The power of the memory can be monitored, the bus to the memory can be monitored, or a combination of both techniques can be used to perform the attack.

3.4 Summary

This Chapter showed a framework or testbench for a working time-driven cache attack, which recovers the complete secret key with the least number of encryptions (as compared to previous attacks). This attack exploits the execution time variations during the first round of AES encryption. It has been implemented in a framework using a commercial processor simulator and can be used for the testing of countermeasures. ‘We need to be the change we wish to see in the world’. Mohandas Karamchand Gandhi

Chapter 4

Cache Attack Countermeasure

This chapter provides a hardware/software countermeasure to prevent cache based side channel attacks. The principle is to bypass the cache during the most vulnerable stage of AES execution, which is accomplished by adding new hardware and modifying existing AES software code. Most of the countermeasures which exists in the literature result in performance degradation in terms of processing time or have high hardware and power penalties. The countermeasure proposed by the author has been tested using the frame- work explained in Chapter 3 and proven to be secure and efficient against time-driven cache attack. In addition it has desirable properties such as low overheads.

4.0.1 Methodology

As shown in section 2.2.3, the vulnerable part of AES encryption, is AES-phase1 which includes an XOR and SubBytes Substitution. If the information is leaked from the cache at this stage, it is possible to perform reverse calculations and recover the secret key. In the proposed approach, the SBox look-up table used in SubBytes Substitution step of AES is implemented in hardware. Hence, during the SubBytes Substitution step, the system will not access memory at all and instead search for values in a hardware table within the processor. Since there is no external memory access, there will be no telltale

70 71 cache misses, thus securing this system against cache attacks. The approach is a hard- ware/software countermeasure based on commercially available tools [137]. The use of such commercially available tools enables relatively rapid verification of the system. The architectural changes, and the software modifications are given below.

4.0.1.1 Hardware Addition

In order to protect the system, some extra hardware has been added into the processor which will store values of the SBox look-up table as used in the SubBytes Substitution step of AES-phase1. This additional hardware will bypass the use of cache while using SBox look-up table in AES encryption.

Memory Cache Memory Cache

Hardware Unit

Sbox Table RF Load/ RF Load/ Hardware 0x63, 0x7c, 0x77...0x76 Store Store Unit 0xca, 0x82, 0xc9...0xc0 0xb7, 0xfd, 0x93...0x15 . 0x8c, 0xa1, 0x88...0x16

ALU ALU

(a) (b)

Figure 4.1: (a) Conventional Processor (b) Processor with extra hardware added using TIE instructions

Figure 4.1(a) shows a conventional processor while Figure 4.1(b) shows the addi- tional hardware unit used to store the SBox look-up values. In the conventional system approach, the SBox look-up is stored in main memory and as AES is executed, it fetches the values of SBox look-up from main memory while storing a copy in cache. Since this storage (i.e., cache) is based on temporal and spatial locality, the attacker can find these values and use statistical methodology as described in cache side channel attacks in section 2.5 to recover the secret key. The proposed approach adds extra hardware which holds the SBox look-up values. In the modified processor, AES encryption will read the 72 CHAPTER 4. CACHE ATTACK COUNTERMEASURE

SBox look-up values from the hardware instead of the main memory. Since there is no communication with memory, the cache is bypassed in the whole process. Therefore, the cache never holds the SBox look-up data and the attacker will never be able to use information derived from the cache to recover the secret key. The proposed approach is a countermeasure against both time-driven and access-driven cache attacks. Time-driven attacks are based on the encryption time. Hence, the adversary measures time to derive ‘hit’ and ‘miss’ patterns and hence derive a relationship among the key bytes. Similarly, in case of trace-driven attacks, the traces of cache (i.e., Hit and Miss) are used to attack and recover the secret key. When the extra hardware is added, the hits and misses after the first round are not related to the SubBytes Substitution, since the SBox look-up table is extracted from the hardware instead of memory. Since the processor does not fetch SBox look-up from memory, there is no partial SBox look-up table in the cache for exploitation by an attacker resulting in the failure of trace-driven, time-driven and access-driven cache attacks. Tensilica’s Xtensa processor [137] platform has been used for the purpose of exper- imentation. This platform enables additional instructions within the processor using a proprietary language called Tensilica Instruction Extension (TIE). This language is used to build a very simple additional instruction which when given an index, returns an SBox look-up value. Tensilica provides commercial tools for the creation, compilation of pro- grams, and for the simulation of such processors.

4.0.1.2 Software Modification

In terms of software modifications to the AES, the SBox look-up table has been replaced by an additional instruction which accesses the new hardware module that stores SBox look-up table. As shown in Figure 4.2(a), the original AES code contains the SBox look-up table in the SubBytes Substitution step. The code is modified to add a customized TIE module by using an ‘include’ statement at the beginning of the code and the module is instantiated in the SubBytes Substitution step, as seen in Figure 4.2(b). The code of the TIE module 73

int getSBoxValue(int num) { int sbox[256] = { 0x63, 0x7c, 0x77... 0x76, 0x70, 0x3e, 0xb5... 0x9e, . 0x8c, 0xa1, 0x89... 0x16 }; return sbox[num]; }

void AddRoundKey() {…}

void SubBytes() // state[i][j] represents 4X4 matrix. // getSBoxValue is the array of Sbox defined above //The input to getSBoxValue is taken from AddRoundKey step { state[i][j] = getSBoxValue (state[i][j]); }

void ShiftRows() {…}

void MixColumns() {…}

Int main() { AddRoundKey(0); ... SubBytes(); ShiftRows(); Mixcolumns(); }

(a)

#include // Above is the inclusion of TIE with the extra hardware

void AddRoundKey() {…}

void SubBytes() // state[i][j] represents 4X4 matrix. //Program Aes_STORE is built inside the processor using TIE //The input to Aes_STORE is taken from AddRoundKey step { state[i][j] = Aes_STORE (state[i][j]); }

void ShiftRows() {…}

void MixColumns() {…}

int main() { AddRoundKey(0); ... SubBytes(); ShiftRows(); Mixcolumns(); }

(b)

Figure 4.2: (a) Original AES code (b) Modified AES code to add extra hardware through TIE 74 CHAPTER 4. CACHE ATTACK COUNTERMEASURE

table sbox 8 256 { 8'h63, 8'h7c, 8'h77, 8'h7b, 8'hf2, 8'h6b, 8'h6f, 8'hc5, 8'h30, 8'h01, 8'h67, 8'h2b, 8'hfe, 8'hd7, 8'hab, 8'h76, //0 8'hca, 8'h82, 8'hc9, 8'h7d, 8'hfa, 8'h59, 8'h47, 8'hf0, 8'had, 8'hd4, 8'ha2, 8'haf, 8'h9c, 8'ha4, 8'h72, 8'hc0, //1 8'hb7, 8'hfd, 8'h93, 8'h26, 8'h36, 8'h3f, 8'hf7, 8'hcc, 8'h34, 8'ha5, 8'he5, 8'hf1, 8'h71, 8'hd8, 8'h31, 8'h15, //2 8'h04, 8'hc7, 8'h23, 8'hc3, 8'h18, 8'h96, 8'h05, 8'h9a, 8'h07, 8'h12, 8'h80, 8'he2, 8'heb, 8'h27, 8'hb2, 8'h75, //3 8'h09, 8'h83, 8'h2c, 8'h1a, 8'h1b, 8'h6e, 8'h5a, 8'ha0, 8'h52, 8'h3b, 8'hd6, 8'hb3, 8'h29, 8'he3, 8'h2f, 8'h84, //4 8'h53, 8'hd1, 8'h00, 8'hed, 8'h20, 8'hfc, 8'hb1, 8'h5b, 8'h6a, 8'hcb, 8'hbe, 8'h39, 8'h4a, 8'h4c, 8'h58, 8'hcf, //5 8'hd0, 8'hef, 8'haa, 8'hfb, 8'h43, 8'h4d, 8'h33, 8'h85, 8'h45, 8'hf9, 8'h02, 8'h7f, 8'h50, 8'h3c, 8'h9f, 8'ha8, //6 8'h51, 8'ha3, 8'h40, 8'h8f, 8'h92, 8'h9d, 8'h38, 8'hf5, 8'hbc, 8'hb6, 8'hda, 8'h21, 8'h10, 8'hff, 8'hf3, 8'hd2, //7 8'hcd, 8'h0c, 8'h13, 8'hec, 8'h5f, 8'h97, 8'h44, 8'h17, 8'hc4, 8'ha7, 8'h7e, 8'h3d, 8'h64, 8'h5d, 8'h19, 8'h73, //8 8'h60, 8'h81, 8'h4f, 8'hdc, 8'h22, 8'h2a, 8'h90, 8'h88, 8'h46, 8'hee, 8'hb8, 8'h14, 8'hde, 8'h5e, 8'h0b, 8'hdb, //9 8'he0, 8'h32, 8'h3a, 8'h0a, 8'h49, 8'h06, 8'h24, 8'h5c, 8'hc2, 8'hd3, 8'hac, 8'h62, 8'h91, 8'h95, 8'he4, 8'h79, //A 8'he7, 8'hc8, 8'h37, 8'h6d, 8'h8d, 8'hd5, 8'h4e, 8'ha9, 8'h6c, 8'h56, 8'hf4, 8'hea, 8'h65, 8'h7a, 8'hae, 8'h08, //B 8'hba, 8'h78, 8'h25, 8'h2e, 8'h1c, 8'ha6, 8'hb4, 8'hc6, 8'he8, 8'hdd, 8'h74, 8'h1f, 8'h4b, 8'hbd, 8'h8b, 8'h8a, //C 8'h70, 8'h3e, 8'hb5, 8'h66, 8'h48, 8'h03, 8'hf6, 8'h0e, 8'h61, 8'h35, 8'h57, 8'hb9, 8'h86, 8'hc1, 8'h1d, 8'h9e, //D 8'he1, 8'hf8, 8'h98, 8'h11, 8'h69, 8'hd9, 8'h8e, 8'h94, 8'h9b, 8'h1e, 8'h87, 8'he9, 8'hce, 8'h55, 8'h28, 8'hdf, //E 8'h8c, 8'ha1, 8'h89, 8'h0d, 8'hbf, 8'he6, 8'h42, 8'h68, 8'h41, 8'h99, 8'h2d, 8'h0f, 8'hb0, 8'h54, 8'hbb, 8'h16 } //F

operation STORE_Aes {out AR sbox_value, in AR i} {} { assign sbox_value = sbox[i];

Figure 4.3: Tensilica Intruction Extension (TIE) language code with embedded SBox look-up table is shown in Figure 4.3. On compiling the modified version of the code, an extra hardware unit is added in the processor which builds the SBox look-up table. It is important to note that there is no SBox look-up table in the modified AES code since the SBox values are now read from the hardware unit added by using this new customized code.

4.1 Experimental Platform

Custom TIE

AES Time, Framework Code Power

Figure 4.4: Experimental Setup

The experimental platform used for the countermeasure is shown in Figure 4.4. The processor used is Tensilica’s Xtensa LX2. To speed up the experimental process, only first round of encryption is used (i.e., until SubBytes Substitution step). Additional hardware 4.2. RESULTS 75 is implemented using Tensilica’s Xtensa processor to store the SBox look-up table. As seen in Figure 4.2(b), a customized TIE module has been added in the processor which carries the instructions for additional hardware. Compilation of the TIE module results in a new processor with additional hardware. The AES encryption program is modified to instantiate the new hardware so as to use its SBox look-up values instead of memory. In order to test the new protected processor, multiple plaintexts are used while record- ing results in terms of encryption time and data misses. Such results are used for security analysis as elaborated in the following sections. The protected processor with new hard- ware module and modified AES code is tested using the cache attack framework explained in Chapter 3. When using the attack framework on the modified processor, the resulting cache profiles do not match the ideal trace of attack and hence it could not recover the secret key. The comparative analysis of traces, seen in Figure 4.5 confirms the security features of the new processor.

4.2 Results

This section covers the security analysis of the countermeasure and its impact on perfor- mance, power and area of processor.

4.2.1 Security Analysis

The countermeasure is tested by implementing Fournier′s attack [20] and comparing the execution times (power profiles could be used as well however the Data cache misses give more insight into the efficacy of the countermeasure). A comparative analysis of profiles for Data Cache (DCache) misses of protected and unprotected processors is shown in Figures 4.5, 4.6 and 4.7.

• Using the framework from chapter 3, the cache attack has been implemented on the modified processor (i.e., processor with countermeasure) and as expected the secret key could not be recovered. The basis of the attack is the time difference due to 76 CHAPTER 4. CACHE ATTACK COUNTERMEASURE

the hits and misses after the first round of encryption (other rounds could also be exploited). The processor with the countermeasure uses SBox look-up values from the hardware instead of the main memory. Hence, there is no appreciable difference between hits and misses during DCache accesses. The ideal pattern expected when the plaintext is correctly aligned will be a miss followed by 15 hits. As can be seen in Figure 4.5, the graph on the left shows the conventional processor with first access to SBox look-up taking a 33 cycles and rest (15 additional accesses) taking one cycle. The graph on the right in Figure 4.5 shows that the processor with the countermeasure takes a constant one cycle per access (purely due to internal accesses). Note that the two y axes are not scaled equally.

Cycles vs Sbox Accesses Cycles vs Sbox Accesses

35 1 0.9 30 0.8 25 0.7 20 0.6 0.5 15 0.4 10 0.3 0.2 5 0.1 0 0 1 2 3 4 5 6 7 8 910111213141516 1 2 3 4 5 6 7 8 9101112131415

(a) (b)

Figure 4.5: Comparison of SBox bytewise accesses (x axis) vs. cycle delays (y axis) for both processors (a) Conventional Processor (b) Processor with countermeasure

• ‘DCache misses’(while executing AES encryption algorithm) of the conventional processor versus the one with the countermeasure are presented in Figures 4.6 and 4.7 respectively. In the conventional processor, the data misses vary according to different plaintexts, hence revealing the information being processed. Note that in Figure 4.6(a) 100 encryptions are used to show the difference, and Figure 4.6(b) zooms into the first 20 encryptions for greater clarity. As seen in Figure 4.6, as the processor executes AES encryption, the processor fetches data from main memory resulting in varying cache misses. The miss rate depends on the variation in the ‘resultant XOR’ value after the AddRoundKey step of AES-phase1’. 4.2. RESULTS 77

Dcache misses Dcache misses vs Plaintexts vs Plaintexts

25000 24900 24900 24800 24800 24700 24700 24600 24600 24500 24500 24400 24400 24300 24300 24200 24200 0 50 100 150 0 5 10 15 20 25

(a) (b)

Figure 4.6: Plaintexts (x axis) vs. Data Cache (DCache) misses (y axis) on conventional processor (a) With 100 plaintexts (b) With 20 plaintexts

Dcache misses Dcache misses vs Plaintexts vs Plaintexts

13060 13060 13040 13040 13020 13020 13000 13000 12980 12980 12960 12960 12940 12940 12920 12920 12900 12900 0 50 100 150 0 5 10 15 20 25

(a) (b)

Figure 4.7: Plaintexts (x axis) vs. Data Cache (DCache) misses (y axis) on processor with countermeasure (a) With 100 plaintexts (b) With 20 plaintexts

The data cache misses collected from the processor with countermeasure are shown in Figure 4.7. The graph shows that there are two possible ‘DCache miss’ values which do not reveal any relevant information about the encryption (same set of random plaintexts being used as to derive Figure 4.6). A clear picture is shown in Figure 4.7(b) with 20 encryptions. If the attacker records this information through time or power analysis, he/she cannot derive any relation between these patterns and the computation of the processor, thus misleading the adversary. 78 CHAPTER 4. CACHE ATTACK COUNTERMEASURE

4.2.2 Performance, Power, and Footprint

Table 4.1 compares various parameters for executing AES on a standard Tensilica pro- cessor and a modified Tensilica processor that includes the countermeasure. As seen in Table 4.1, the number of cycles are reduced by approximately 15%, energy consump- tion is reduced by approximately 30% and the dynamic power is reduced by 17%. The additional hardware takes 7.6% more area (without taking memory into account).

Features Typical Countermeasure No. of cycles 982736 834329 Gate Count 39K 42K Dynamic Power(mW) 5.378 4.458 Leakage(mw) 0.857 0.874 Energy (uJ) 51.16 35.75 Table 4.1: Performance Comparison

4.3 Summary

This Chapter described a novel hardware/software countermeasure implemented on a commercial processor and successfully tested using a cache attack testbench. The princi- ple of bypassing the cache during AES execution foils both time-driven and access-driven cache attacks. The countermeasure is energy efficient, fast and requires only a small area overhead. It is a versatile since the solution can be easily implemented using available commercial tools. ‘The greater the obstacle, the more glory in overcoming it’. Moliere

Chapter 5

A Double-width Algorithmic Balancing to prevent Power Analysis Side Channel Attacks in AES

In this Chapter, the author examines a novel Double-width algorithmic balancing tech- nique to counter Differential Power Analysis (DPA) attack of AES. Several solutions exist to counter power analysis attacks, such as random masking [14], hardware balanc- ing [138] and algorithmic balancing [139]. Masking techniques can still be attacked using advanced attacks, such as higher order DPA [14, 102]. Hardware balancing techniques are known to be very costly in terms of area and power consumption [138].

The algorithmic balancing technique proposed in MUTE [29] claims that it provides balancing, similar to hardware techniques, but reduces cost by utilizing an additional processor only when it is necessary. The additional processor is able to operate other tasks while not employed for balancing.

The MUTE solution proposed by Ambrose et al. [29], utilizes a Multiprocessor System- on-Chip (MPSoC) to execute the balanced AES algorithm. As shown in Figure 5.1(a), one processor executes the original AES program (with the original secret key) while a complemented AES program (with algorithm modified to maintain an exact instruction-

79 80 CHAPTER 5. DWSC

Original Compl. Double-width Compl.+Original

Core Core Core Lock-step (a) MUTE (b) DWSC

Figure 5.1: Algorithmic Balancing

level complementary execution) is executed in the second processor. The execution in the MPSoC is performed in parallel and in lock-step. As a result, when the adversary col- lects power traces of AES encryption, the complementary operation balances the bitflips and hence the secret key is not revealed. Even though MUTE is less costly than hardware techniques and provides better security than masking techniques, the complexity involved in performing lock-step execution, especially for the MPSoC with interrupts enabled, is very high. A slight imbalance might leak secure information.

As an alternative to the above countermeasure for power analysis attack, a simple and effective single core solution has been devised by the author, which is more practical and can easily be deployed. The new approach as shown in Figure 5.1(b), receives the input (i.e., plaintext) and the original secret key which are then converted to double-width (input duplicated, whereas the secret key is inverted and concatenated with the original), and executed with the modified double-width AES algorithm (section 2.2.2). Thus, for the first time the author has proposed a single core algorithmic balancing solution which obfuscates the bitflips, masking the secret key during encryption to defeat power analysis. For the purposes of comparison, the standard and the balanced AES encryption algorithms are implemented on a 32-bit processor since most popular embedded systems use 32-bit processors. The countermeasure only includes code/algorithmic modifications, hence can be easily deployed in any embedded system with a 16 bit wide (or wider) processor. This Chapter explains the new technique of Double Width Single Core (DWSC) solution to combat power analysis attack including the design, methodology followed with Hardware and software modifications needed, experimental setup and results. 5.1. METHODOLOGY 81

5.1 Methodology

The DWSC countermeasure is based on an algorithmic balancing method applied to AES in order to neutralize the power variations resulting from signal transitions (it is worth noting that the author is not trying to flatten the power profile, but to balance the pro- cessed information in the power profile). The underlying idea is to execute the original and complementary operations together, with a single operation and double-width data. For every 0→1 transition, during the AES execution a 1→0 transition is induced and vice versa, which is implemented by executing double-width (i.e., 16 bits) AES which is mod- eled in such a way that the two halves of 8 bits are complement of each other. Hence every intermediate step results in complementary signal transitions, balancing the power variations. As shown in Figure 5.2(a), the MUTE technique follows a similar balancing approach but requires a lock-step execution with a dual core setup.

8-bit blocks 16-bit blocks y e K

t e r c e

S Typical Inverted Typical Core Core Core

Typical Inverted Inverted Typical Original Balancing Single Processor (a) MUTE (b) DWSC

Figure 5.2: Methodology

In contrast to MUTE, as shown in Figure 5.2(b), DWSC solution expands the 8-bit blocks in the AES engine to 16-bit blocks, to execute in a 32-bit processor. The 8-bit input data is duplicated, whereas the 8-bit secret key is inverted and concatenated with the original to form a 16-bit secret key. Each step of the AES, as shown in Section 5.2, is modified to create a complete balanced algorithm. 82 CHAPTER 5. DWSC

5.2 Software Modifications

The AES operations mentioned in Section 2.2.2 are modified to perform the double-width balancing. Note that the modifications are based on an AES encryption with 8-bit oper- ations. Such 8-bit operations are typically performed in embedded systems. Figure 5.3

Plaintext:0xFF Plaintext:0xFFFF

Secret Key Secret Key 0x12 0xED 12

AddRoundKey AddRoundKey

SubByte SubByte Substitution Substitution (a) Original 8 Bits AddRoundKey (b) Modified 16 Bits AddRoundKey

Figure 5.3: AddRoundKey Operation shows the original approach and the balanced approach for AddRoundKey operation. In the original approach the plaintext 0xFF is XORed with the secret key 0x12. In the bal- anced approach the 8-bit plaintext is duplicated to form 16 bit 0xFFFF and the secret key 0x12 is concatenated with its own inverted value 0xED to create 0xED12. In the

16 X 16 Bytes 16 X 16 Bytes 0x63 0x7C 0x77 .... 0x76 0xE9 0x44 0xAB .... 0x73 ...... 0x8C 0xA1 0x89 .... 0x16 0x89 0x54 0x28 .... 0x9C (a) Original SBox (b) Inverted and Transposed SBox

16 X 16 Bytes 16 X 16 Double Bytes 0x16 0xBB 0x54 .... 0x8C 0x E9 16 0x 44 BB 0x AB 54 .... 0x 73 8C ...... 0x76 0xAB 0xD7 .... 0x63 0x 89 76 0x 54 AB 0x 28 D7 .... 0x 9C 63 (c) Transposed SBox (d) Merged SBox

Figure 5.4: SubBytes Substitution Operation

SubBytes Substitution step, the output of the AddRoundKey step is used as an index to 5.2. SOFTWARE MODIFICATIONS 83 look up a value in the SBox. Thus, in the original AES, the location SBox 0x12 XOR 0xF F is looked up. In the balanced AES, an expanded SBox look-up table is designed, for the index 0xFFFF XOR 0xED12. The original SBox look-up, shown in Figure 5.4(a), is inverted and transposed as shown in Figure 5.4(b), while only the transposed version is shown in Figure 5.4(c). Figure 5.4(d) shows the concatenated SBox look-up of the inverted transposed SBox look-up , and the transposed SBox look-up, generating a table with 256 elements. This enables the new 16-bit output from AddRoundKey in the bal- anced AES core to retrieve the correct value from the merged SBox look-up table. In

8bits 16bits 12 13 14 15 ED12 EC13 EB14 EA15 16 17 18 19 E916 E817 E718 E619 1A 1B 1C 1D E51A E41B E31C E21D 1E 1F 20 21 E11E E01F DF20 DE21

12 13 14 15 ED12 EC13 EB14 EA15 17 18 19 16 E817 E718 E619 E916 1C 1D 1A 1B E31C E21D E51A E41B 21 1E 1F 20 DE21 E11E E01F DF20

(a) Original 8 Bits ShiftRows (b) Modified 16 Bits ShiftRows

Figure 5.5: ShiftRows Operation

ShiftRows, the rows are cyclically shifted with different offsets. Row0 stays the same whereas Row1 is shifted by one position, Row2 by two positions and Row3 by three posi- tions as shown in Figure 5.5 (a). Figure 5.5 (b) depicts the Shift Rows step of the balanced method. As can be seen, there is no difference in shift order in the original and balanced versions other than the size of elements. In MixColumns step, the columns of the array are treated as polynomials over Rijndael’s Galois Field GF(28) [41] and are further multiplied by modulo x4 +1, with a fixed polynomial c(x), where c(x) = 03x3 +01x2 +01x+02. For the original AES, this can be written as matrix multiplication c(x) ⊕ b(x) = a(x), where 84 CHAPTER 5. DWSC

8bits 8bits 02 03 01 01 55 2B 01 02 03 01 D1 80 . = 01 01 02 03 CF 6B 03 01 01 02 26 AD c(x) b(x) a(x) (a) Original 8 Bits MixColumns 16bits 16bits 0002 0003 0001 0001 AA55 D42B 0001 0002 0003 0001 . 2ED1 = 7F80 0001 0001 0002 0003 30CF 946B 0003 0001 0001 0002 D926 52AD c(x) b(x) a(x) (b) Modified 16 Bits MixColumns

Figure 5.6: MixColumns Operation b(x) is the input column, c(x) is the fixed polynomial and a(x) is the transformed column as seen in Figure 5.6(a). Here, b(x) = matrix column with b0=0x55, b1=0xD1, b2=0xCF, b3=0x26 and a(x) = matrix column with a0=0x2B, a1=0x80, a2=0x6B, a3=0xAD and c(x) is MDS matrix with c0=0x02, c1=0x03, c2=0x01, c3=0x01. This can also be repre- sented in the form of equations as follows:

a0 = 2b0 ⊕ 3b1 ⊕ 1b2 ⊕ 1b3 (5.1)

a1 = 1b0 ⊕ 2b1 ⊕ 3b2 ⊕ 1b3 (5.2)

a2 = 1b0 ⊕ 1b1 ⊕ 2b2 ⊕ 3b3 (5.3)

a3 = 3b0 ⊕ 1b1 ⊕ 1b2 ⊕ 2b3 (5.4)

3b1 = (0x11)b1 = (0x10)b1 ⊕ (0x01)b1 (5.5) 5.2. SOFTWARE MODIFICATIONS 85

The balancing operation also uses the above procedure (as seen in Figure 5.6(b)) with some modification. As shown in equation 5.1, b0 is multiplied by 2 which is implemented by shifting left by one bit. If Most Significant Bit (MSB) after the shift is 1, the resultant is XORed with ‘0x1B’ in the original 8 bit algorithm. Similarly, b1 is multiplied by 03 which is done by having b1 multiplied by 02(0x10) and then XORed with 01(0x01) as shown in equation 5.5. In the balanced algorithm, the input to MixColumns is of 16 bits with original 8-bit value concatenated with the inverted 8-bit value. Shifting the original part towards left might cause overflow, corrupting the inverted part. This violates the complete balancing that is necessary. To eliminate such a possibility, the left shifting of the original part is kept restricted to the second half of the element. Hence, if the MSB of the original part is 1, it is not passed to the first half (i.e., the inverted part). The resultant of left shifting can be XORed with three possible values: (a), with ‘0x001B’ if only the original part’s MSB is 1; (b), with ‘0x1B00’ if only the inverted part’s MSB is 1; and (c), with ‘0x1B1B’ if both original and inverted part’s MSB is 1. An example of the balanced algorithm’s MixColumns is shown in Figure 5.6. It can be calculated as follows,

0xD42B = 02(0xAA55) ⊕ 03(0x2ED1) ⊕ 01(0x30CF )

⊕ 01(0xD926) (5.6)

03(0x2ED1) = 02(0x2ED1) ⊕ 01(0x2ED1) (5.7)

01(0x2ED1) = 0010111011010001 (5.8)

02(0x2ED1) = 0010111011010001 << 1 (5.9)

= 0101110110100010 (5.10)

= 0101110110100010 ⊕ 0x001B (5.11)

As seen in the example, in equation 5.9, left shifting of 0x2ED1 causes bit ‘1’ to enter as Least Significant Bit (LSB) of the first byte. Bit ‘1’ is prevented from entering the first 86 CHAPTER 5. DWSC byte and the whole element of 16 bits (i.e., 0x2ED1 is XORed by 0x001B) in equation 5.11.

16 X 16 Bytes 16 X 16 Double Byte 0x63 0x7C 0x77 .... 0x76 0x63 63 0x7C 7C 0x77 77 .... 0x76 76 ...... 0x8C 0xA1 0x89 .... 0x16 0x8c 8c 0xa1 a1 0x89 89 .... 0x16 16 (a) Original SBox (b) Modified SBox for 16 Bits KS

0x8d 0x01 0x02 .... 0x9a 0x8d 8d 0x01 01 0x02 02 .... 0x9A 9A ...... H ...... 0x61 0xC2 0x9F 0xCB 0x61 61 0xC2 C2 0x9F 9F 0xCB CB W H X W = 255 Double Bytes W (a) Original Rcon (b) Modified Rcon for 16 Bits KS

Figure 5.7: Key Scheduling Operation

The Key Scheduling (KS) process generates roundkeys as needed for the various en- cryption rounds of AES. As seen in Section 2.2.2, AES encryption steps are iterated 10 times in 10 rounds for a 128-bit secret key. The Key Scheduling step includes various operations: (a), Rotation of a 32 bit word; (b), Application of Rijndael’s Sbox look-up table as used in SubBytes Substitution step for every byte; (c), XOR with a corresponding number from Rcon table where Rcon is an exponentiation of 2 to a user-specified value, as described in Rijndael’s document [41]. The Key Scheduling operation is performed in Rijndael’s finite field in polynomial form to generate roundkeys to be used in encryption algorithm. Figure 5.7 shows SBox and Rcon look-up tables as used in this step for both the original and balanced algorithms. As seen in Figure 5.7(b), a merged SBox look-up table is used for the balanced algorithm which is generated by concatenating two original SBoxes. A new Rcon table is formulated by concatenating the individual elements of the original Rcon table. The Key scheduling process is carefully modified such that the resultant of the encryption algorithm step is balanced (i.e., when the roundkeys from Key scheduling are used in the AddRoundKey step, the resultant is in the form of ‘inverted typical’ values to maintain balancing at each step of the encryption). 5.3. EXPERIMENTAL SETUP 87

5.3 Experimental Setup

Figure 5.8 depicts the experimental setup to test the efficacy of DWSC solution against Differential Power Analysis (DPA) attack. A synthesizable single processor (in RTL) was developed to support the PISA Instruction Set Architecture (ISA), using the ASIPMeister tool flow [140], by implementing the instructions in micro-operations. SimpleScalar tool set [141], which also supports PISA ISA, is integrated with memory generation to create the instruction and data memory for the single processor (Harvard architecture). The original AES program and the balanced program are fed into the SimpleScalar compiler to create the respective memory contents. Manual code modification is performed on the AES algorithm to create the AES balanced program as explained in Section 5.2. The Synopsys Design Compiler [142] was used to synthesize the single processor and generate the gate level version. The author executed zero delay simulations of the synthesized processor and memory modules using ModelSim [143] to create the vcd file which is then fed into PrimeTime Px [144] to extract power. Since an 8-bit block of the secret key was the target of the attack, 256×2 power measurements were performed to attack the original AES implementation and the balanced AES. DPA attack was performed on the power measurements. The AES programs and the DPA attack were programmed in C.

ISA AES

SimpleScalar Code ASIPMeister Memory Modification Generation Synthesizable Design Synthesized Single Processor Compiler Single Processor

Power PrimeTime ModelSim Measurements Px AES Algorithm DPA knowledge

Figure 5.8: The Experimental Setup 88 CHAPTER 5. DWSC

5.4 Results

This section describes the results of DPA attack on a single processor with original 8-bit

AES followed by a comparative DPA attack for the balanced approach.

8

1

0

0 .

Secret 0 Secret 6 Key 1

0 Key

0 .

(18) 0 (18)

4

1

0

0

.

0

2

1

0

0

.

0

1

0

0

.

0

8

0

0

0

.

0

6

0

0

0

.

0

4

0

0

0

.

0

2

0

0

0

.

0 0

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1

1 1

1

1

0 2 4 3 1 2 3 5 4 6

5 6 8 9 7 8 0 7 9

2 4 3 5 1

89 78 67 56 45 34 23 12

1

1 1 1 1 1 1

1 1 1 2 2

2 2 2 2 254 243 232 221 210 199 188 177 166 155 144 133 122 111 100

(a) (b)

Figure 5.9: DPA Plot of original AES (a) At LW (Load) instruction (b) At SW (Store) instruction

Ghost Ghost Peak Peak (79) (254)

Secret Secret Key Key (18) (18) 0 0.0002 0.0004 0.0006 0.0008 0.001 0 0.0005 0.001 0.0015 0.002 0.0025 1 1 91 81 71 61 51 41 31 21 11 91 82 73 64 55 46 37 28 19 10 251 241 231 221 211 201 191 181 171 161 151 141 131 121 111 101 253 244 235 226 217 208 199 190 181 172 163 154 145 136 127 118 109 100

(a) (b)

Figure 5.10: DPA Plot of Balanced AES (a) At LW (Load) instruction (b) At SW (Store) instruction

5.4.1 DPA Attack on original and Balanced AES

The author described DPA attack for the first byte of the secret key. The rest of the bytes can be attacked in a similar fashion. It is worth noting that AES is a block cipher hence a byte of an input is associated with a specific byte of the secret key. A power trace is recorded for all possible values of the first input byte of the plaintext (i.e., 0 to 5.4. RESULTS 89

AES Algorithm Balanced Original Overhead Instr. Mem. Size (Bytes) 96624 80576 16.6% Data Mem. Size(Bytes) 7360 6192 15.8% Table 5.1: Comparative Analysis

255). The DPA was performed by choosing the least significant bit (LSB) of the SBox look-up output after the SubBytes Substitution step of first encryption round of AES as the selection bit. Further details on the selection bit and this attack point can be found in [145]. Figure 5.9(a) shows a DPA plot of original AES corresponding to the load (LW) instruction. The x-axis of the plot indicates the number of possible key guesses and the y-axis shows corresponding DPA values. The highest peak reveals the actual secret key used during encryption (i.e., 18 in decimal). Similar results are derived from the DPA plot of the store (SW) instruction, as seen in Figure 5.9(b) where the highest peak successfully reveals the actual secret key as well. Figure 5.10 shows the DPA plot of the balanced AES implementation. The attack at LW in Figure 5.10(a) shows a significant peak at 0x4F (i.e., 79 in decimal, referred as ghost peak [146]) while the actual secret key used is 0x12 (i.e., 18). Figure 5.10(b) shows a DPA plot at the SW instruction. As seen in this plot, the power variations look similar over all possible key guesses except for value 0xFE (i.e., 254) which is a ghost peak as well [146]. As described in Section 5.1, a balanced double- width encryption nullifies power variations resulting from original AES encryption. The DPA attack shows that the Double Width Single Core solution is protected from DPA.

5.4.2 Comparative Analysis

Table 5.1 shows a comparative analysis of the single core balancing approach vs. the approach without balancing (i.e., original AES). The balanced approach uses increased instruction memory footprint, which has been increased by only 16.6% while data mem- ory is increased by 15.8%. 90 CHAPTER 5. DWSC

5.5 Discussion

The author utilized a 32-bit processor for the purposes of comparison, since these are used in most modern embedded systems. The results will not change if implemented on 16-bit processor. However, the DWSC solution cannot be implemented on an 8-bit processor, though the standard AES algorithm can be implemented on such a processor. If 8-bit processor balancing is necessary, methods described in [29] have to be used.

5.6 Summary

This Chapter presented a single core Double Width Algorithmic Balancing approach against power based side channel attacks. The principle is to execute AES encryption al- gorithm both in typical and complementary form to balance the power variations. The au- thor modified the original 8-bit AES algorithm to 16-bit (i.e., double width) while adding complementary operations at every step. The solution was successfully tested using a DPA attack which did not reveal the secret key. As compared to the typical 8-bit AES, the double width results in 16.6% increase in instruction code size and 15.8% increase in data memory size. The approach does not add any extra hardware and the modifications are only software based which can be easily deployed in most of the modern embedded systems. ‘and the journey continues...’.

Chapter 6

Conclusions

This thesis examined cache based and power based side channel attacks and their respec- tive countermeasures. The key focus points discussed in the thesis are listed below.

• Cache based attack framework which exploits the execution time variations dur- ing the first round of AES encryption. The framework has been successfully ex- perimented using a commercial processor simulator implementing the Fournier’s attack [20] to recover the complete secret key. The framework uses 212 encryp- tions of 212 when compared to previous attacks (Bernstein with 227.5, Kong et al. with 224, Tsunoo et al. with 216). The framework also enables the verification of countermeasures.

• A hardware/software countermeasure against cache based attacks. The solution in- volves inclusion of registers in processor embedded with values of S-Box table used in AES. First encryption round of AES is considered vulnerable and the most com- mon point of attack. In the current countermeasure, as the first encryption round is executed, the data is found available in registers resulting in faster computation and bypassing cache which avoids both timing based and access driven cache attacks. The solution is implemented in a hardware/software. Hardware modifications in- volve the addition of extra registers implemented in the commercial processor Ten- silica using the proprietary language TIE (Tensilica Instruction Extension). In terms

91 92 CHAPTER 6. CONCLUSIONS

of software modifications, the S-Box in AES code has been replaced by the instan- tiation of the new hardware module whose values are used in the encryption. The hardware and software solution results in 15% reduction in number of cycles, 17% reduction in dynamic power and 30% reduction in energy consumption when com- pared with a typical processor while AES is being executed. The overhead in area due to extra hardware is 7.6% (without taking memory into account).

•A single core algorithmic balancing approach against power based Side Channel Attacks for AES cryptography was implemented. The underlying idea is to create a double-width algorithm, introducing complementary signal transitions in conjunc- tion with original ones in order to balance the power variations. The balancing is implemented at every stage of the AES algorithm so that there is no correla- tion between internal computations and power variations. The author experimented using DPA on an original AES and our double-width 16-bit AES on a 32 bit pro- cessor. The results show that the DPA plot of balancing approach does not reveal the secret key byte. The approach uses single core, compared to two cores used in MUTE [147], resulting in similar obfuscation of the secret key. Instruction code size increases by 20%, and data memory doubles. These increases are offset by savings in hardware cost (e.g., architectural modifications or addition of another core [17] [15] [29]). The approach requires no hardware modification which en- ables it to be easily deployed in most modern embedded systems.

6.1 Future Work

Following are the directives of extending research in future:

• The countermeasure against cache based attack and power based attack can be com- bined as a single solution. The implementation of solution includes both hardware 6.1. FUTURE WORK 93

and software modifications. The hardware modifications will be to add extra regis- ters using TIE language embedding the S-Box table values. The software modifi- cations will include replacing S-Box table by the instantiation of the new hardware module. Taking the modified AES as reference and changing it further to build a double width algorithm with original and complementary AES executing in paral- lel. The solution will bypass cache hence avoiding cache attacks and balance signal transitions to combat power based attacks.

• One of the implementation of AES software code includes four T-tables which can be used for building countermeasure against cache based attacks. The solution includes storing all four AES T-tables in registers and instantiating in the AES code resulting thus bypassing of cache for all encryption rounds of AES. It enables the processor to avoid cache based attacks for all rounds of AES encryption compared to the one developed by author against only first round of AES.

• Implementing the double width single core solution in Application Specific Inte- grated Circuit (ASIC) (i.e., a complete hardware approach). The technique of im- plementing AES in a customized hardware with typical and complementary execut- ing in parallel will result in balancing power variations.

• Finally, extend double width single core solution against CPA. Bibliography

[1] J. Kelsey, B. Schneier, D. Wagner, and C. Hall, “Side channel of product ciphers,” J. Comput. Secur., vol. 8, no. 2,3, pp. 141–158, 2000.

[2] P. Kocher, J. Jaffe, and B. Jun, “Introduction to differential power analysis and related attacks.” www.cryptography.com/public/pdf/DPATechInfo.pdf, 1998.

[3] J. Quisquater and D. Samyde, “ElectroMagnetic Analysis (EMA): Measures and Counter-Measures for Smart Cards.,” in E-smart, pp. 200–210, 2001.

[4] E. Biham and A. Shamir, “Differential fault analysis of secret key cryptosystems,” in Proceedings of the 17th Annual International Cryptology Conference on Ad- vances in Cryptology, CRYPTO ’97, (London, UK, UK), pp. 513–525, Springer- Verlag, 1997.

[5] J. Park, H. Lee, and M. Ahn, “Side-channel attacks against on active rfid de- vice,” in Convergence Information Technology, 2007. International Conference on, pp. 2163 –2168, 2007.

[6] N. Lawson, “Side-channel attacks on cryptographic software,” Security Privacy, IEEE, vol. 7, no. 6, pp. 65 –68, 2009.

[7] P. C. Kocher, “Timing attacks on implementations of diffie-hellman, , dss, and other systems,” in Proceedings of the 16th Annual International Cryptology Con- ference on Advances in Cryptology, CRYPTO ’96, (London, UK), pp. 104–113, Springer-Verlag, 1996.

[8] D. Page, “Theoretical use of cache memory as a cryptanalytic side-channel.,” IACR Cryptology ePrint Archive, vol. 2002, p. 169, 2002.

[9] P. C. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Proceedings of the 19th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO ’99, (London, UK, UK), pp. 388–397, Springer-Verlag, 1999.

[10] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealing the Secrets of Smart Cards. Advances in Information Security Series, Springer Sci- ence+Business Media, LLC, 2007.

94 BIBLIOGRAPHY 95

[11] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-based side channel attacks,” SIGARCH Comput. Archit. News, vol. 35, pp. 494–505, June 2007. [12] D. Page, “Partitioned cache architecture as a side-channel defense mechanism,” Cryptology ePrint Archive Report, 2005. [13] D. Page, “Defending against cache based side-channel attacks,” Information Secu- rity Technical Report, vol. 8, pp. 30–44, April 2003. [14] T. S. Messerges, “Securing the aes finalists against power analysis attacks,” in Pro- ceedings of the 7th International Workshop on Fast Software Encryption, FSE ’00, (London, UK, UK), pp. 150–164, Springer-Verlag, 2001. [15] T. K. A. M. V. I, “A dynamic and differential cmos logic with signal independent power consumption to withstand differential power analysis on smart cards,” in Proceedings of the 2002 Solid-State Circuits Conference, ESSCIRC 2002, pp. 403 – 406, 2002. [16] K. Tiri, D. Hwang, A. Hodjat, B. Lai, S. Yang, P. Schaumont, and I. Verbauwhede, “A side-channel leakage free coprocessor ic in 0.18um cmos for embedded aes- based cryptographic and biometric processing,” in In Dac 05, pp. 222–227, ACM Press, 2005. [17] K. Tiri and I. Verbauwhede, “A digital design flow for secure integrated circuits,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 25, pp. 1197–1208, July 2006. [18] K. Tiri and I. Verbauwhede, “A logic level design methodology for a secure dpa resistant asic or fpga implementation,” in DATE ’04: Proceedings of the conference on Design, automation and test in Europe, (Washington, DC, USA), p. 10246, IEEE Computer Society, 2004. [19] D. Sokolov, J. Murphy, A. Bystrov, and A. Yakovlev, “Design and analysis of dual- rail circuits for security applications,” IEEE Trans. Comput., vol. 54, pp. 449–460, Apr. 2005. [20] J. Fournier and M. Tunstall, “Cache based power analysis attacks on aes,” in Pro- ceedings of the 11th Australasian conference on Information Security and Privacy, ACISP’06, (Berlin, Heidelberg), pp. 17–28, Springer-Verlag, 2006. [21] W. Redmond, “Intelligent systems and the next information age.” http://www.microsoft.com/en-us/news/features/2011/oct11/10- 27EmbeddedDYK.aspx. [22] P. Kocher, R. Lee, G. McGraw, and A. Raghunathan, “Security as a new dimen- sion in embedded system design,” in Proceedings of the 41st annual Design Au- tomation Conference, DAC ’04, (New York, NY, USA), pp. 753–760, ACM, 2004. Moderator-Ravi, Srivaths. 96 BIBLIOGRAPHY

[23] Y. Zhou and D. Feng, “Side-channel attacks: Ten years after its publication and the impacts on cryptographic module security testing.” Cryptology ePrint Archive, Report 2005/388, 2005.

[24] T. S. Messerges, E. A. Dabbish, and R. H. Sloan, “Examining smart-card security under the threat of power analysis attacks.,” IEEE Trans. Computers, vol. 51, no. 5, pp. 541–552, 2002.

[25] S. Mangard, “A simple power-analysis (spa) attack on implementations of the aes key expansion,” in Proceedings of the 5th international conference on Information security and cryptology, ICISC’02, (Berlin, Heidelberg), pp. 343–358, Springer- Verlag, 2003.

[26] J. A. Ambrose, N. Aldon, A. Ignjatovic, and S. Parameswaran, “Anatomy of differ- ential power analysis for aes,” in Proceedings of the 2008 10th International Sym- posium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC ’08, (Washington, DC, USA), pp. 459–466, IEEE Computer Society, 2008.

[27] S. B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel, “Power-analysis attack on an asic aes implementation,” itcc, vol. 02, p. 546, 2004.

[28] M.-L. Akkar and C. Giraud, “An implementation of des and aes, secure against some attacks,” in Cryptographic Hardware and Embedded Systems CHES 2001 (e. Ko, D. Naccache, and C. Paar, eds.), vol. 2162 of Lecture Notes in Computer Science, pp. 309–318, Springer Berlin Heidelberg, 2001.

[29] J. A. Ambrose, S. Parameswaran, and A. Ignjatovic, “Multiprocessor information concealment architecture to prevent power analysis-based side channel attacks,” Computers & Digital Techniques, IET, vol. 5, no. 1, pp. 1 – 15, 2011.

[30] D. Jayasinghe, J. Fernando, R. Herath, and R. Ragel, “Remote cache timing at- tack on advanced encryption standard and countermeasures,” in Information and Automation for Sustainability (ICIAFs), 2010 5th International Conference on, pp. 177–182, 2010.

[31] J. Kong, O. Aciicmez, J.-P. Seifert, and H. Zhou, “Deconstructing new cache de- signs for thwarting software cache-based side channel attacks,” in Proceedings of the 2nd ACM workshop on Computer security architectures, CSAW ’08, (New York, NY, USA), pp. 25–34, ACM, 2008.

[32] I. news Release, “Computing No Longer Confined to the PC-It’s Everywhere.” (http://www.intel.com/pressroom/archive/releases/2010/20100107corp.htm).

[33] S. Ravi, A. Raghunathan, P. Kocher, and S. Hattangady, “Security in embedded systems: Design challenges,” ACM Trans. Embed. Comput. Syst., vol. 3, pp. 461– 491, Aug. 2004. BIBLIOGRAPHY 97

[34] “Cyber crime and security survey report 2012.” http://apo.org.au/research/cyber- crime-and-security-survey-report-2012.

[35] I. Damgard,˚ “A design principle for hash functions,” in Proceedings of the 9th An- nual International Cryptology Conference on Advances in Cryptology, CRYPTO ’89, (London, UK, UK), pp. 416–427, Springer-Verlag, 1990.

[36] G. J. Simmons, “Symmetric and asymmetric encryption,” ACM Comput. Surv., vol. 11, pp. 305–330, Dec. 1979.

[37] R. L. Rivest, “The MD5 message digest algorithm.” Internet RFC 1321, April 1992.

[38] P. J. C. S. S. . D. Eastlake, 3rd Motorola, “Us secure hash algorithm 1 (sha1).” http://www.ipa.go.jp/security/rfc/RFC3174EN.html.

[39] X. Zijie, “Dynamic sha2 - the ehash main page.” ehash.iaik.tugraz.at/uploads/5/5b/DyamicSHA2.pdf.

[40] W. G. Barker, Introduction to the Analysis of the Data Encryption Standard (DES). Laguna Hills, CA, USA: Aegean Park Press, 1991.

[41] J. Daemen and V. Rijmen, “AES Proposal: Rijndael.” http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.640.

[42] S. Mister and S. E. Tavares, “Cryptanalysis of rc4-like ciphers,” in Proceedings of the Selected Areas in Cryptography, SAC ’98, (London, UK, UK), pp. 131–143, Springer-Verlag, 1999.

[43] P. Zimmerman, “A proposed standard format for rsa cryptosystems,” Computer, vol. 19, pp. 21–34, Sept. 1986.

[44] W. Diffie and M. E. Hellman, “New Directions in Cryptography,” IEEE Transac- tions on Information Theory, vol. IT-22, no. 6, pp. 644–654, 1976.

[45] N. Doraswamy and D. Harkins, IPSec: The New Security Standard for the Internet, Intranets, and Virtual Private Networks. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1999.

[46] SSL and TLS: designing and building secure systems. Boston, MA, USA: Addison- Wesley Longman Publishing Co., Inc., 2001.

[47] J. Mizusawa, N. Shigematsu, and H. Itoh, “Virtual private network control system concept,” in Private Switching Systems and Networks, 1988., International Con- ference on, pp. 137–141, 1988.

[48] P. Wohlmacher, “Digital certificates: a survey of revocation methods,” in Proceed- ings of the 2000 ACM workshops on Multimedia, MULTIMEDIA ’00, (New York, NY, USA), pp. 111–114, ACM, 2000. 98 BIBLIOGRAPHY

[49] B. Rosenblatt, “Drm, law and technology: an american perspective,” Online Infor- mation Review, vol. 31, no. 1, pp. 73–84, 2007.

[50] H. Fuchs and N. Farber, “Isma interoperability and conformance,” IEEE MultiMe- dia, vol. 12, pp. 96–102, Apr. 2005.

[51] R. Torrance and D. James, “The state-of-the-art in ic reverse engineering,” in Pro- ceedings of the 11th International Workshop on Cryptographic Hardware and Em- bedded Systems, CHES ’09, (Berlin, Heidelberg), pp. 363–381, Springer-Verlag, 2009.

[52] C. Percival, “Cache missing for fun and profit,” in Proc. of BSDCan 2005, 2005.

[53] G. Bertoni, V. Zaccaria, L. Breveglieri, M. Monchiero, and G. Palermo, “Aes power attack based on induced cache miss and countermeasure,” in Proceedings of the International Conference on Information Technology: Coding and Com- puting (ITCC’05) - Volume I - Volume 01, ITCC ’05, (Washington, DC, USA), pp. 586–591, IEEE Computer Society, 2005.

[54] C. C. Tiu, “A new frequency-based side channel attack for embedded systems,” 2005. Master’s thesis, University of Waterloo.

[55] J. Bonneau and I. Mironov, “Cache-collision timing attacks against aes,” in Cryp- tographic Hardware and Embedded Systems - CHES 2006 (L. Goubin and M. Mat- sui, eds.), vol. 4249 of Lecture Notes in Computer Science, pp. 201–215, Springer Berlin / Heidelberg, 2006.

[56] D. J. Bernstein, “Cache-timing attacks on AES,” 2004. http://cr.yp.to/papers.html#cachetiming.

[57] D. Agrawal, B. Archambeault, J. R. Rao, and P. Rohatgi, “The em side-channel(s),” in Revised Papers from the 4th International Workshop on Cryptographic Hard- ware and Embedded Systems, CHES ’02, (London, UK, UK), pp. 29–45, Springer- Verlag, 2003.

[58] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fourth Edition: A Quantitative Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2006.

[59] G. C. Kessler, “An overview of cryptography,” 1998. http://garykessler.net/library/crypto.html#intro.

[60] B. Schneier, “Description of a new variable-length key, 64-bit block cipher (blow- fish),” in Fast Software Encryption (R. Anderson, ed.), vol. 809 of Lecture Notes in Computer Science, pp. 191–204, Springer Berlin Heidelberg, 1994. BIBLIOGRAPHY 99

[61] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, and N. Ferguson, “Twofish: A 128-bit block cipher,” in in First Advanced Encryption Standard (AES) Confer- ence, 1998.

[62] K. Aoki, T. Ichikawa, M. Kanda, M. Matsui, A. M. Matsui, S. Moriai, J. Nakajima, and T. Tokita, “Camellia: A 128-bit block cipher suitable for multiple platforms - design and analysis,” 2000.

[63] M. Matsui, “New block encryption algorithm misty,” in Fast Software Encryption, 4th International Workshop, FSE ’97, Haifa, Israel, January 20-22, 1997, Pro- ceedings, vol. 1267 of Lecture Notes in Computer Science, pp. 54–68, Springer, 1997.

[64] J. Massey, G. Khachatrian, and M. Kuregian, “Nomination of SAFER+ as Candi- date Algorithm for the Advanced Encryption Standard,” in AES Candidate Confer- ence, 1998.

[65] G. T. 35.201, “Specification of the 3gpp confidentiality and integrity algorithms; document 1: f8 and f9 spedifications.” http://www.3gpp.org.

[66] H. Lee, J. Yoon, S. Lee, and J. Lee, “The SEED Cipher Algorithm and Its Use with IPsec.” RFC 4196 (Proposed Standard), October 2005.

[67] D. Kwon, J. Kim, S. Park, S. H. Sung, Y. Sohn, J. H. Song, Y. Yeom, E.-J. Yoon, S. Lee, J. Lee, S. Chee, D. Han, and J. Hong, “New block cipher: Aria.,” in ICISC (J. I. Lim and D. H. Lee, eds.), vol. 2971 of Lecture Notes in Computer Science, pp. 432–445, Springer, 2003.

[68] L. R. Knudsen and D. Wagner, “On the structure of skipjack.,” Discrete Applied Mathematics, vol. 111, no. 1-2, pp. 103–116, 2001.

[69] FIPS, “Announcing the advanced encryption standard (aes),” Tech. Rep. 197, Fed- eral Information Processing Standards, November 2001.

[70] J. Park, H. Lee, J. Ha, Y. Choi, H. Kim, and S. Moon, “A differential power analy- sis attack of block cipher based on the hamming weight of internal operation unit,” in Computational Intelligence and Security (Y. Wang, Y.-m. Cheung, and H. Liu, eds.), vol. 4456 of Lecture Notes in Computer Science, pp. 417–426, Springer Berlin Heidelberg, 2007.

[71] T. S. Messerges, E. A. Dabbish, and R. H. Sloan, “Investigations of power analysis attacks on smartcards,” in WOST’99: Proceedings of the USENIX Workshop on Smartcard Technology on USENIX Workshop on Smartcard Technology, (Berkeley, CA, USA), pp. 17–30, USENIX Association, 1999. 100 BIBLIOGRAPHY

[72] R. Mayer-Sommer, “Smartly Analyzing the Simplicity and the Power of Simple Power Analysis on Smartcards,” in CHES ’00: Proceedings of the Second Inter- national Workshop on Cryptographic Hardware and Embedded Systems, (London, UK), pp. 78–92, Springer-Verlag, 2000.

[73] S. Chari, C. Jutla, J. R. Rao, and P. Rohatgi, “A cautionary note regarding evalua- tion of aes candidates on smart-cards,” in In Second Advanced Encryption Standard (AES) Candidate Conference, pp. 133–147.

[74] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, and N. Ferguson, “Twofish: A 128-bit block cipher,” in in First Advanced Encryption Standard (AES) Confer- ence, 1998.

[75] J. Daemen and V. Rijmen, The Design of Rijndael. Secaucus, NJ, USA: Springer- Verlag New York, Inc., 2002.

[76] C. H. Lim, “A revised version of - crypton v1.0,” in FSE, pp. 31–45, 1999.

[77] L. Knudsen, “Deal - a 128-bit block cipher,” in NIST AES Proposal, 1998.

[78] A. D. Angeli, S. Brahnam, P. Wallis, and A. Dix, “Ntt-nippon telegraph and tele- phone corporation, e2: Efficient encryption algorithm,” in AES Proposal, pp. 1647– 1650, 1998.

[79] C. Burwick, D. Coppersmith, A. Edward D, R. Gennaro, S. Halevi, C. Jutla, S. M. M. Jr, C. Luke O, M. Peyravian, J. Luke, connor Mohammad Peyravian, D. Stafford, and N. Zunic, “Mars - a candidate cipher for aes,” NIST AES Proposal, 1999.

[80] M. J. Jacobson and K. Huber, “The MAGENTA Block Cipher Algorithm,” in AES Candidate Conference.

[81] R. Anderson, E. Biham, and L. Knudsen, “Serpent: A Proposal for the Advanced Encryption Standard,” in Proceedings of the First AES Candidate Conference, (Ventura, CA, USA), National Institute of Standard and Technology, June 1998.

[82] R. L. Rivest, M. J. B. Robshaw, R. Sidney, and Y. L. Yin, “The block cipher,” in in First Advanced Encryption Standard (AES) Conference, p. 16, 1998.

[83] L. Brown and J. Pieprzyk, “Introducing the new LOKI97 Block Cipher,” in AES Candidate Conference, 1998.

[84] R. Schroeppel, “An overview of the Hasty Pudding Cipher,” in AES Candidate Conference, 1998.

[85] D. Georgoudis, D. Leroux, and B. S. Chaves, “The ‘FROG”Encryption Algorithm,” in AES Candidate Conference, 1998. BIBLIOGRAPHY 101

[86] H. Gilbert, M. Girault, P. Hoogvorst, and F. Noilhan, “Decorrelated Fast Cipher: an AES Candidate,” in AES Candidate Conference, 1998. [87] C. Adams, “The CAST256 Encryption Algorithm,” in AES Candidate Conference, 1998. [88] E. Biham and A. Shamir, “Differential cryptanalysis of des-like cryptosystems,” in Proceedings of the 10th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO ’90, (London, UK, UK), pp. 2–21, Springer-Verlag, 1991. [89] M. Matsui, “ method for des cipher,” in Advances in Cryptol- ogy EUROCRYPT 93 (T. Helleseth, ed.), vol. 765 of Lecture Notes in Computer Science, pp. 386–397, Springer Berlin Heidelberg, 1994. [90] T. S. Messerges, E. A. Dabbish, R. H. Sloan, T. S. Messerges, E. A. Dabbish, and R. H. Sloan, “Investigations of power analysis attacks on smartcards,” in In USENIX Workshop on Smartcard Technology, pp. 151–162, 1999. [91] J.-S. Coron and L. Goubin, “On boolean and arithmetic masking against differ- ential power analysis,” in Proceedings of the Second International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’00, (London, UK, UK), pp. 231–237, Springer-Verlag, 2000. [92] T. S. Messerges, “Using second-order power analysis to attack dpa resistant soft- ware,” in CHES ’00: Proceedings of the Second International Workshop on Crypto- graphic Hardware and Embedded Systems, (London, UK), pp. 238–251, Springer- Verlag, 2000. [93] J. Waddle and D. Wagner, “Towards efficient second-order power analysis,” in In the Proceedings of the 1st International Workshop on Cryptographic Hardware and Embedded Systems, vol. 3156 of Lecture Notes in Computer Science, pp. 1– 15, 2004. [94] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with a leakage model,” in Cryptographic Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11-13, 2004. Proceedings, vol. 3156 of Lecture Notes in Computer Science, pp. 16–29, Springer, 2004. [95] J.-S. Coron and A. Tchulkine, “A new algorithm for switching from arithmetic to boolean masking,” in Cryptographic Hardware and Embedded Systems - CHES 2003 (C. Walter, e. Ko, and C. Paar, eds.), vol. 2779 of Lecture Notes in Computer Science, pp. 89–97, Springer Berlin Heidelberg, 2003. [96] L. Goubin, “A sound method for switching between boolean and arithmetic mask- ing,” in Cryptographic Hardware and Embedded Systems CHES 2001 (e. Ko, D. Naccache, and C. Paar, eds.), vol. 2162 of Lecture Notes in Computer Science, pp. 3–15, Springer Berlin Heidelberg, 2001. 102 BIBLIOGRAPHY

[97] L. Goubin and J. Patarin, “Des and differential power analysis the duplication method,” in Cryptographic Hardware and Embedded Systems (e. Ko and C. Paar, eds.), vol. 1717 of Lecture Notes in Computer Science, pp. 158–172, Springer Berlin Heidelberg, 1999. [98] S. Chari, C. S. Jutla, J. R. Rao, and P. Rohatgi, “Towards sound approaches to counteract power-analysis attacks,” pp. 398–412, Springer-Verlag, 1999. [99] M.-L. Akkar and L. Goubin, “A generic protection against high-order differential power analysis,” in Fast Software Encryption (T. Johansson, ed.), vol. 2887 of Lec- ture Notes in Computer Science, pp. 192–205, Springer Berlin Heidelberg, 2003. [100] E. Trichina, D. Seta, and L. Germani, “Simplified adaptive multiplicative mask- ing for aes,” in Cryptographic Hardware and Embedded Systems - CHES 2002 (B. Kaliski, e. Ko, and C. Paar, eds.), vol. 2523 of Lecture Notes in Computer Science, pp. 187–197, Springer Berlin Heidelberg, 2003. [101] K. Schramm and C. Paar, “Higher order masking of the aes,” in Topics in Cryptol- ogy CT-RSA 2006 (D. Pointcheval, ed.), vol. 3860 of Lecture Notes in Computer Science, pp. 208–225, Springer Berlin Heidelberg, 2006. [102] J. D. Golic and C. Tymen, “Multiplicative masking and power analysis of aes,” in Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’02, (London, UK, UK), pp. 198–212, Springer- Verlag, 2003. [103] J. Blmer, J. Guajardo, and V. Krummel, “Provably secure masking of aes,” in Se- lected Areas in Cryptography (H. Handschuh and M. Hasan, eds.), vol. 3357 of Lecture Notes in Computer Science, pp. 69–83, Springer Berlin Heidelberg, 2005. [104] E. Trichina, “Combinational logic design for aes subbyte transformation on masked data,” tech. rep., IACR report, 2003. [105] E. Oswald, S. Mangard, N. Pramstaller, and V. Rijmen, “A side-channel analysis resistant description of the aes s-box,” in Fast Software Encryption (H. Gilbert and H. Handschuh, eds.), vol. 3557 of Lecture Notes in Computer Science, pp. 413– 423, Springer Berlin Heidelberg, 2005. [106] E. Oswald and K. Schramm, “An efficient masking scheme for aes software im- plementations,” in Information Security Applications (J.-S. Song, T. Kwon, and M. Yung, eds.), vol. 3786 of Lecture Notes in Computer Science, pp. 292–305, Springer Berlin Heidelberg, 2006. [107] E. Trichina, T. Korkishko, and K. H. Lee, “Small size, low power, side channel- immune aes coprocessor: design and synthesis results,” in Proceedings of the 4th international conference on Advanced Encryption Standard, AES’04, (Berlin, Hei- delberg), pp. 113–127, Springer-Verlag, 2005. BIBLIOGRAPHY 103

[108] J. D. Golic and R. Menicocci, “Universal masking on logic gate level,” 2004. http://www.etsi.org/services/security-algorithms/3gpp-algorithms.

[109] W. Fischer and B. M. Gammel, “Masking at Gate Level in the Presence of Glitches,” in Cryptographic Hardware and Embedded Systems CHES 2005, pp. 187–200, 2005.

[110] Z. Chen and Y. Zhou, “Dual-rail random switching logic: A countermeasure to re- duce side channel leakage,” in Cryptographic Hardware and Embedded Systems - CHES 2006, 8th International Workshop, Yokohama, Japan, October 10-13, 2006, Proceedings (L. Goubin and M. Matsui, eds.), vol. 4249 of Lecture Notes in Com- puter Science, pp. 242–254, Springer, 2006.

[111] “Specifications of the 3gpp confidentioality and integrity algorightms.” http://www.etsi.org/services/security-algorithms/3gpp-algorithms, 1999.

[112] J. Coron, P. C. Kocher, and D. Naccache, “Statistics and secret leakage,” in FC ’00: Proceedings of the 4th International Conference on Financial Cryptography, (London, UK), pp. 157–173, Springer-Verlag, 2001.

[113] K. Tiri and I. Verbauwhede, “A dynamic and differential cmos logic style to re- sist power and timing attacks on security ics..” Cryptology ePrint Archive, Report 2004/066, 2004.

[114] R. Muresan and C. H. Gebotys, “Current flattening in software and hardware for security applications.,” in CODES+ISSS, pp. 218–223, 2004.

[115] J. A. Ambrose, R. G. Ragel, and S. Parameswaran, “Randomized instruction injec- tion to counter power analysis attacks,” ACM Trans. Embed. Comput. Syst., vol. 11, pp. 69:1–69:28, Sept. 2012.

[116] D. May, H. L. Muller, and N. P. Smart, “Non-deterministic processors,” in Pro- ceedings of the 6th Australasian Conference on Information Security and Privacy, ACISP ’01, (London, UK, UK), pp. 115–129, Springer-Verlag, 2001.

[117] J. Irwin, D. Page, and N. P. Smart, “Instruction stream mutation for non- deterministic processors,” in ASAP ’02: Proceedings of the IEEE Interna- tional Conference on Application-Specific Systems, Architectures, and Processors, (Washington, DC, USA), p. 286, IEEE Computer Society, 2002.

[118] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 4th ed., 2008. 104 BIBLIOGRAPHY

[119] J. Kelsey, B. Schneier, D. Wagner, and C. Hall, “Side channel cryptanalysis of product ciphers,” in Proceedings of the 5th European Symposium on Research in Computer Security, (London, UK), pp. 97–110, Springer-Verlag, 1998.

[120] J. Nakahara, Jorge, “A linear analysis of blowfish and khufu,” in Information Secu- rity Practice and Experience (E. Dawson and D. Wong, eds.), vol. 4464 of Lecture Notes in Computer Science, pp. 20–32, Springer Berlin Heidelberg, 2007.

[121] Y. Tsunoo, E. Tsujihara, M. Shigeri, H. Kubo, and K. Minematsu, “Improving cache attacks by considering cipher structure,” International Journal of Informa- tion Security, vol. 5, pp. 166–176, 2006. 10.1007/s10207-005-0079-7.

[122] D. Brumley and D. Boneh, “Remote timing attacks are practical,” in Proceedings of the 12th conference on USENIX Security Symposium - Volume 12, SSYM’03, (Berkeley, CA, USA), pp. 1–1, USENIX Association, 2003.

[123] E. W. Felten and M. A. Schneider, “Timing attacks on web privacy,” in Proceedings of the 7th ACM conference on Computer and communications security, CCS ’00, (New York, NY, USA), pp. 25–32, ACM, 2000.

[124] F. cois Koeune and J.-J. Quisquater, “A timing attack against Rijndael,” 1999. url: http://www.dice.ucl.ac.be/crypto/techreports.html.

[125] Computer Systems Laboratory (U.S.), Data Encryption Standard (DES), 1994. Category: computer security, subcategory: cryptography. Supersedes FIPS PUB 46-1–1988 January 22. Reaffirmed December 30, 1993. Shipping list no.: 94-0171- P.

[126] National Institute of Standards and Technology, Advanced Encryption Standard (AES), 2001. Supersedes FIPS PUB 197–2001 November.

[127] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: the case of aes,” in Topics in Cryptology - CT-RSA 2006, The Cryptographers Track at the RSA Conference 2006, pp. 1–20, Springer-Verlag, 2005.

[128] S. Moore, R. Anderson, and M. Kuhn, “Improving Smart card Security Using Self- Timed Circuit Technology,” in ACiD-WG ’00: Forth ACiD-WG Workshop, (Greno- ble), 2000.

[129] J. Mccardle and D. Chester, “Measuring an Asynchronous Processor”s Power and Noise,” 2001.

[130] T. S. Messerges, Power analysis attacks and countermeasures for cryptographic algorithms. PhD thesis, Chicago, IL, USA, 2000. AAI9978665.

[131] E. Tromer, D. A. Osvik, and A. Shamir, “Efficient cache attacks on aes, and coun- termeasures,” J. Cryptol., vol. 23, pp. 37–71, Jan. 2010. BIBLIOGRAPHY 105

[132] J. Kong, O. Aciic¸mez, J.-P. Seifert, and H. Zhou, “Hardware-software integrated approaches to defend against software cache-based side channel attacks,” in HPCA, pp. 393–404, IEEE Computer Society, 2009.

[133] S. Gueron, “White paper: Advanced encryption standard (aes) instruction set,” July 2008.

[134] S. Tillich and J. Groschdl, “Instruction set extensions for efficient aes implementa- tion on 32-bit processors.,” in CHES, pp. 270–284, 2006.

[135] S. M. Min, J. Peddersen, and S. Parameswaran, “Realizing cycle accurate processor memory simulation via interface abstraction,” in VLSI Design, pp. 141–146, 2011.

[136] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob, “Dram- sim: a memory system simulator,” SIGARCH Comput. Archit. News, vol. 33, pp. 100–107, November 2005.

[137] T. Inc., “Xtensa Processor.” Tensilica Inc. (http://www.tensilica.com).

[138] S. Danil, M. Julian, B. Alexander, and Y. Alex, “Design and analysis of dual-rail circuits for security applications,” IEEE Trans. Comput., vol. 54, no. 4, pp. 449– 460, 2005.

[139] J. A. Ambrose, S. Parameswaran, and A. Ignjatovic, “Mute-aes: a multiproces- sor architecture to prevent power analysis based side channel attack of the aes algorithm,” in Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’08, (Piscataway, NJ, USA), pp. 678–684, IEEE Press, 2008.

[140] “The PEAS Team. ASIP Meister.” ASIP (http://www.asip-solutions.com/english/).

[141] “Simplescalar.” SimpleScalar LLC (http://www.simplescalar.com/).

[142] “Synopsys design compiler.” Synopsys (http://www.synopsys.com/Tools/Implementation/RTLSynthesis /DesignCompiler/Pages/default.aspx).

[143] “Modelsim.” Mentor Graphics (http://www.model.com).

[144] “Synopsys primetime.” Synopsys (http://www.synopsys.com/Tools/Implementation/SignOff /Pages/PrimeTime.aspx).

[145] P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Lecture Notes in Computer Science, vol. 1666, pp. 388–397, 1999.

[146] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with a leakage model.,” in CHES, pp. 16–29, 2004. 106 BIBLIOGRAPHY

[147] J. A. Ambrose, S. Parameswaran, and A. Ignjatovic, “MUTE-AES: A Multiproces- sor Architecture to prevent Power Analysis based Side Channel Attack of the AES Algorithm,” in ICCAD, pp. 489–492, 2008.