<<

PHYSICAL ATTACKS AND COUNTERMEASURESONTHE ADVANCED STANDARD

DISSERTATION

for the degree of Doktor-Ingenieur of the Faculty of Electrical Engineering and Information Technology at the Ruhr-Universitat¨ Bochum, Germany

by Oliver Marc Mischke Bochum, April 2015 Copyright © 2015 by Oliver Marc Mischke. All rights reserved. Printed in Germany. To Elfi, Norbert, and Melanie

Oliver Marc Mischke Place of birth: Frankfurt am Main, Germany Author’s contact information: [email protected] http://www.sha.rub.de/

Thesis Advisor: Prof. Dr.-Ing. Tim G¨uneysu Ruhr-Universit¨atBochum, Germany Secondary Referee: Prof. Dr.-Ing. Stefan Mangard Graz Universtity of Technology, Austria Tertiary Referee: Dr. Amir Moradi Ruhr-Universit¨atBochum, Germany Thesis submitted: April 14, 2015 Thesis defense: May 18, 2015 Last revision: May 2, 2016

v

Abstract

With the increasing pervasion of embedded computing devices in our everyday life, there arises also the need to protect these devices by means of strong . This may either be required to protect the intellectual property of a vendor, secure confidentiality of sensitive data, or to establish secure means of communication. The preferred cryptographic algorithm in many – especially commercial – applications is the Advanced Encryption Standard (AES). It was selected in 2001 by the National Institute of Standards and Technology (NIST) in a public competition, whose aim was to find the most suitable successor to the outdated (DES) algorithm. Due to the short size and low performance in software implementations, DES could no longer satisfy the requirements imposed by many applications. While AES remains a very secure algorithm considering a black-box attack scenario, care has to be taken when designing a physical implementation for embedded devices. Since these devices are in the field and must therefore be considered as operating in a hostile environment, they are susceptible to a multitude of physical attacks. This includes passive attacks like measuring the data-dependent power consumption while computing on sensitive data (so-called ), and also active attacks where the device is forced into faulty behavior by being operated outside the defined operating conditions (e.g., clock or voltage spikes). Many countermeasures have been proposed to protect implementations of AES against those attacks, but the resistance of these countermeasures when deployed on actual hard- ware is seldom evaluated in sufficient detail. For example, even recently, some coun- termeasures were proposed claiming resistance to power analysis purely considering a Hamming Distance (HD) leakage metric on registers. Considering that glitches in un- derlying hardware gates are a major reason for the leakage of supposedly masked data, designs based on such a pure HD metric can never provide a sufficient level of protection when implemented in hardware. This dissertation aims to address the problems arising from the practical utilization of the theoretical countermeasures in hardware implementations. We have evaluated the suitability of several countermeasure proposals for achieving a high level of resistance when implemented on FPGAs. Using collision attacks, we are able to detect leakages without relying on hypothetical power models, which are usually not able to adequately capture real device behavior. We also propose a new technique on how to implement a Boolean masking scheme in a glitch-free manner making use of special FPGA resources and characteristics. In addition, we present new variants of an active fault attack. They allow the recovery of data-dependent timing behavior of S-boxes and can thereby extract the secrets. It is also shown how a Zero-Value vulnerability in S-boxes implemented using a composite field approach can be exploited to break implementations even if they are equipped with sophisticated error detection schemes.

Keywords.

Physical Attacks, Side-Channel Attacks, Side-Channel Countermeasures, Power Analysis, Fault Analysis, Advanced Encryption Standard (AES), Masking Schemes, Concurrent Error Detection (CED), Collision Attacks, Fault Sensitivity Analysis (FSA), Glitches.

viii Kurzfassung

Physikalische Angriffe und Gegenmaßnahmen auf die Advanced Encryption Standard Blockchiffre

Mit der fortschreitenden Verbreitung von eingebetteten Prozessoren in Ger¨aten des t¨aglichen Gebrauchs, w¨achst auch der Bedarf diese mittels starker Kryptographie zu schutzen.¨ Dies kann sowohl zum Schutz des geistigen Eigentums, der Vertraulichkeit sen- sibler Nutzerdaten, als auch zur Etablierung sicherer Kommunikationskan¨ale erforderlich sein. Der bevorzugte Algorithmus, vor allem fur¨ gewerbliche Anwendungen, ist der Advan- ced Encryption Standard (AES). AES wurde im Jahre 2001 vom National Institute of Standards and Technology (NIST) nach einem ¨offentlichen Wettkampf als am besten ge- eigneter Nachfolger des Data Encryption Standard (DES) ausgew¨ahlt. DES wurde Auf- grund einer zu geringen Schlussell¨ ¨ange sowie unzureichender Ausfuhrungsgeschwindigkeit¨ in Softwareimplementierungen aktuellen und zukunftigen¨ Anforderungen nicht mehr ge- recht. Zwar ist AES als mathematisch hochsicher anzusehen, bei physischen Realisierungen des Algorithmus in Hardware ergibt sich jedoch ein anderes Bild. Da sich die Ger¨ate in der Hand des Nutzers befinden ergeben sich eine Vielzahl an M¨oglichkeiten physika- lische Angriffe durchzufuhren.¨ Ein Beispiel fur¨ einen passiven Angriff ist die Messung des datenabh¨angigen Stromverbrauchs w¨ahrend Ver- und Entschlusselungsoperationen¨ durchgefuhrt¨ werden (Stromprofilanalyse). Auch aktive Angriffe, wie beispielsweise die M¨oglichkeit uber¨ Spannungsspitzen eine fehlerhafte Berechnung im Ger¨at zu erzwingen (Fehlerinjektionsangriffe), sind in diesen Einsatzgebieten durchfuhrbar.¨ Zwar wurden in der Vergangenheit bereits zahlreiche Gegenmaßnahmen vorgestellt um physikalische Angriffe auf AES Implementierungen zu erschweren, jedoch wurde die Wirk- samkeit in der Praxis h¨aufig nur unzureichend untersucht. Erst vor kurzem wurde eine Gegenmaßnahme pr¨asentiert, deren Sicherheit auf der Annahme beruht, dass nur der dy- namische Stromverbrauch beim Uberschreiben¨ von Registern in der Schaltung zu schutzen¨ ist. Vor dem Hintergrund, dass einer der Hauptgrunde¨ fur¨ die Unsicherheit von eigentlich geschutzten¨ Implementierungen physikalische Effekte auf Gatter-Ebene sind, kann eine solche Gegenmaßnahme nicht die Erwartungen erfullen.¨ Der Fokus dieser Dissertation liegt auf der praktischen Untersuchung der oftmals nur theoretisch fundierten Gegenmaßnahmen. Es wird das erreichbare Sicherheitsniveau einer Vielzahl von Gegenmaßnahmen auf rekonfigurierbaren Hardware Plattformen (FPGAs) evaluiert. Mit Hilfe von Kollisionsangriffen war es m¨oglich auch solche Informationslecks zu finden, welche nicht bekannten theoretischen Modellen entsprechen. Einer der For- schungsbeitr¨age dieser Dissertation ist eine neue Implementierungstechnik, mit welcher kryptographische Schaltungen in FPGAs sicher realisiert werden k¨onnen. Zus¨atzlich werden zwei neuartige Varianten eines aktiven Fehlerinjektionsangriffes pr¨asen- tiert, welcher es erm¨oglicht die datenabh¨angige Laufzeit von Signalen in kryptographi- schen S-boxen zu ermitteln und so die Implementierung zu brechen. Ebenso wird demons- triert, wie mittels einer speziellen Schwachstelle von S-boxen, welche in einem Erweite- rungsk¨orper implementiert wurden, die Schlusselextraktion¨ sogar in solchen F¨allen ge- lingt, in denen Implementierungen mit speziellen Fehlerdetektionsalgorithmen geschutzt¨ sind.

Schlagworte.

Physikalische Angriffe, Seitenkanalangriffe, Seitenkanalgegenmaßnahmen, Stromprofil- analyse, Fehleranalyse, Advanced Encryption Standard (AES), Maskierungsschemata, Fehlerdetektion, Kollisionsangriffe, Fault Sensitivity Analysis (FSA), Glitches.

x Acknowledgements

This thesis is the outcome of three and a half years in the Hardware Security group (SHA) of the Horst G¨ortzInstitute for IT-Security (HGI), Ruhr University Bochum (RUB). I found not only colleagues and co-authors, but also close friends who made sure that my time in and outside the university was always enjoyable. Same is true for everyone in the Embedded Security group (EMSEC), with whom we shared the office space. Thanks to all of you for a wonderful time, I will not forget you! Special thanks go to my advisor Tim G¨uneysu, who accepted me as PhD student when I wanted to leave industry to pursue an academic career. A big shout-out also to Amir Moradi, who took me with him on his exciting journey of side-channel research. Thanks a lot to both of you, without your guidance and support I would not be who I am today. I would also like to thank Stefan Mangard for taking his time being my secondary referee and providing me excellent feedback on my thesis. Our groups would not have been the same without our non-scientific staff; Irmgard K¨uhn and Horst Edelmann, who kept a lot of administrative or technical issues away from us so that we could focus on the science. Thanks for all your support and for always providing kind words when needed. Another special shout-out goes to Elif Kavun and Alexander Wild, who endured sharing offices with me and made sure that we always had fun no matter how close the next deadline was. This thesis would also not have been possible without the help of all my co-authors (in alphabetical order): Georg Becker, Wayne Burleson, Benedikt Driessen, Thomas Eisen- barth, Tim G¨uneysu,Markus Kasper, Elif Kavun, Yang Li, Amir Moradi, Kazuo Ohta, Christof Paar, Thomas P¨oppelmann, Christopher P¨opper, Kazuo Sakiyama, Pascal Sas- drich, Michal Varchola. Thanks to all of you for the fruitful discussions and collaborations! I would also like to express my gratitude to all the people who proof-read this thesis, and especially Gesine Hinterw¨alderfor keeping me sane during the last days of thesis writing and defense. All the loved ones; my family and friends: you provided me the support I needed all the way along... This work would not be possible without you, thank you for being there! Last but not the least, to all other people I forgot to mention here: You know I owe you, thanks!

Table of Contents

Imprint ...... v Abstract ...... v Kurzfassung ...... viii Acknowledgements ...... xi

1 Introduction 1 1.1 Physical Attacks ...... 2 1.2 Motivation ...... 2 1.3 Research Contributions and Outline ...... 3

I Side-Channel Analysis of AES 7

2 Preliminaries 9 2.1 Introduction ...... 9 2.2 Glitches in Hardware Circuits ...... 10 2.3 Side-Channel Attacks ...... 11 2.3.1 Differential Power Analysis (DPA) ...... 11 2.3.2 Template Attacks ...... 12 2.3.3 Correlation-Enhanced (CECA) ...... 12 2.4 Side-Channel Countermeasures ...... 14 2.4.1 Hiding ...... 14 2.4.2 Masking ...... 15 2.5 Contributions ...... 16

3 Analysis of Boolean Masking & Hiding 19 3.1 Introduction ...... 19 3.2 Architecture ...... 20 3.2.1 32-bit Architecture ...... 21 3.2.2 Unrolled Architecture ...... 23 3.2.3 Target Platform and Measurement Setup ...... 24 3.3 Evaluation of Masking & Shuffling ...... 24 3.3.1 Profile A: No Countermeasures ...... 24 3.3.2 Profile B: Column-wise Shuffling Only ...... 25 Table of Contents

3.3.3 Profile C: Masking Only ...... 27 3.3.4 Profile D: Masking and Column-wise Shuffling ...... 28 3.3.5 Profile E: Masking, Column-wise and Instance Shuffling ...... 29 3.4 Evaluation of Masking & Pipelining ...... 30 3.4.1 One Round per Clock Cycle ...... 30 3.4.2 Two Rounds per Clock Cycle ...... 31 3.4.3 Three Rounds per Clock Cycle ...... 32 3.4.4 Four Rounds per Clock Cycle ...... 33 3.5 Conclusion ...... 33

4 Glitch-free Implementation of Masking 37 4.1 Introduction ...... 37 4.2 Preliminaries ...... 38 4.2.1 Masked AES S-box ...... 38 4.2.2 Xilinx FPGA Resources ...... 40 4.3 Design ...... 42 4.4 Evaluation ...... 44 4.5 Conclusion ...... 47

5 Implementation of a Glitch-resistant Masking Scheme 51 5.1 Introduction ...... 51 5.2 Target Scheme ...... 52 5.3 Implementation & Caveats ...... 56 5.4 Conclusion ...... 60

6 Are Dual Ciphers a Side-Channel Countermeasure? 61 6.1 Introduction ...... 62 6.2 Dual Cipher Concept ...... 63 6.3 Design ...... 65 6.4 Evaluation ...... 67 6.4.1 Mask Reuse ...... 68 6.4.2 Concurrent Processing of Mask and the Masked Data ...... 68 6.4.3 Unbalance ...... 69 6.4.4 Zero Value ...... 69 6.4.5 Practical Investigations ...... 70 6.5 Conclusion ...... 71

II Fault Analysis of AES 73

xiv Table of Contents

7 Preliminaries 75 7.1 Introduction ...... 75 7.2 Fault Injection Attacks ...... 76 7.2.1 Differential Fault Analysis (DFA) ...... 76 7.2.2 Fault Sensitivity Analysis (FSA) ...... 76 7.3 Countermeasures ...... 77 7.3.1 Sensors ...... 77 7.3.2 Concurrent Error Detection (CED) Schemes ...... 78 7.4 Joint Motivation and Contributions ...... 79

8 Correlation Timing Analysis 81 8.1 Introduction ...... 81 8.2 Notations ...... 82 8.2.1 How to Measure the Timing ...... 82 8.2.2 Definitions ...... 84 8.3 Attack Description ...... 84 8.4 Evaluation ...... 85 8.5 Conclusion ...... 89

9 Zero-Value Fault Sensitivity Analysis 91 9.1 Introduction ...... 92 9.2 Designing an Architecture for Evaluation ...... 92 9.3 Zero-Value Fault Sensitivity Analysis (ZV-FSA) ...... 94 9.3.1 Zero-Value Vulnerability ...... 95 9.3.2 Attack Methodology ...... 96 9.4 Evaluation of the Attack ...... 97 9.4.1 Profile 1: Time redundancy-based CED ...... 98 9.4.2 Profile 2: Invariance-based CED ...... 99 9.4.3 Profile 3: Randomized permutations ...... 101 9.5 Conclusion ...... 102

III Conclusion 105

10 Conclusion 107

IV Appendix 111

Bibliography 113

xv Table of Contents

List of Abbreviations 125

List of Figures 127

List of Tables 131

About the Author 135

Author’s Publications 137

xvi Chapter 1

Introduction

This chapter briefly introduces physical attacks, states the motivation, the research contributions, and outlines the structure of the complete dissertation.

Contents of this Chapter 1.1 Physical Attacks ...... 2 1.2 Motivation ...... 2 1.3 Research Contributions and Outline ...... 3

We are surrounded by an increasing number of embedded computing platforms. Besides traditional smart cards used in the banking sector, mobile phones have also grown to powerful processing platforms and are often accompanied by other electronic gadgets like smart watches or fitness trackers. Home electronics are also upgraded with more and more computing and remote control capabilities creating the so-called Smart Home. All those devices handle sensitive data and therefore require cryptographic solutions to protect the privacy of a user, confidentiality of data and allow secure means of communication.

There exist a multitude of cryptographic algorithms capable of providing the required level of protection considering a traditional black-box attack scenario, where only the input and output of a cipher can be observed. However, with the discovery of physical attacks in the late 1990s [Koc96,KJJ99], implementations without dedicated countermea- sures can nowadays be broken easily. Instead of relying on mathematical , attackers can measure the data-dependent timing behavior or power consumption of a device during the execution of a cryptographic algorithm and extract secret data.

This chapter provides a broader overview on physical attacks, the research motivation, and outlines the structure of the dissertation and the research contributions.

1 Chapter 1. Introduction

1.1 Physical Attacks

Physical attacks can be categorized in different ways. The main categories are whether an attack is invasive or non-invasive. In addition, they can be further classified into active and passive attacks. In an invasive attack a chip package is opened to allow probing or modifications. If the chip’s functionality is not altered but, e.g., micro-probes are used to read out confidential data from a bus, this constitutes a passive invasive attack. Forcing signals on bus lines or rerouting signals by means of a Focused Ion Beam (FIB) are examples of active invasive attacks. Often a third category – semi-invasive attacks – is used, in which the package is removed but the passivation layer remains intact. Faults can then be introduced using, e.g., lasers, UV light, X-rays, or alpha radiation. For further reading on (semi-)invasive attacks we refer to [Sko05]. Non-invasive attacks are harder to detect since no observable damage is done to the chip. This makes them especially dangerous since if an outside attack cannot be detected by the owner, mitigation steps like revocation or exchange of compromised keys might not be performed. Examples for active non-invasive attacks are the use of clock glitches or voltage spikes to introduce a faulty behavior into the chip. Faults might make the chip accept wrong passwords as correct or skip a number of encryption rounds, thereby weakening the used cryptographic primitives and allowing mathematical cryptanalysis. Passive non-invasive attacks require neither chip manipulations nor fault injections. In those so-called side-channel attacks physical and logical characteristics of a device are measured. Examples include the timing [Koc96], power consumption [KJJ99], or electro- magnetic emanation. If the execution time, i.e., the instruction flow, depends on some secret information, or if the power consumption is dependent on the processed data, as is the case in CMOS technology, gathering this information often is enough to deduce the secrets. For more information on passive non-invasive attacks – especially power analysis–, we refer to [MOP07] as a starting point. Chapter 2 and Chapter 7 also provide more insights into relevant non-invasive passive and active attacks.

1.2 Motivation

The AES algorithm is the de-facto industry standard for symmetric encryption solutions. While many countermeasures have been proposed to protect hardware implementations of the AES from physical attacks, they often lack a sufficient practical security evaluation, i.e., when implemented on a real hardware platform. The problem relies in the fact that many countermeasures are designed based on specific assumptions on the leakage of the intermediate states of an algorithm in hardware. Even if an implementation should

2 1.3. Research Contributions and Outline in theory be perfectly secure against a specific attack by application of one of those countermeasures, if the practical leakage behavior of the target platform is not a perfect match to the model, it cannot provide the promised level of security.

In this dissertation we therefore aim to analyze different ways to secure hardware im- plementations of AES against non-invasive physical attacks. Our evaluations of counter- measure are performed on different Field Programmable Gate Arrays (FPGAs) platforms, sold under the name SASEBO [sas], which were specifically designed to facilitate power analysis attacks. By utilizing collision attacks, we avoid relying on hypothetical power models which usually do not adequately cover real device behavior.

Using FPGA platforms has several advantages, not only for an evaluation of the con- sidered algorithms, but also in commercial settings. Their reconfiguration capabilities enable the implementation and subsequent analysis of various configurations of a target countermeasures in a timely manner. The same reconfigurability in FPGA-based com- mercial products also makes it possible to upgrade any integrated cryptographic engines. This could be required in case flaws are found in the implementations of countermeasures or if new attacks make more robust security measures necessary. Because of the reduced time-to-market and the lower development costs, FPGAs are often a preferred solution compared to Application Specific Integrated Circuits (ASICs), if the product volume does not exceed a certain threshold.

1.3 Research Contributions and Outline

Besides this introductory chapter, this dissertation consists of three parts. Part I, Side- Channel Analysis (SCA) of AES, presents the relevant work we have performed on coun- termeasures to protect AES. It includes evaluations of externally proposed countermea- sures as well as an own proposal. Part II, Fault Analysis (FA) of AES, showcases some advancements on FSA, namely Collision (CTA) and ZV-FSA. Part III, Conclusion, concludes the dissertation and outlines future directions.

Note that the thesis author has partnered with Amir Moradi during all the practical eval- uations presented in this thesis. With the exception of Chapter 9, where the thesis author was solely responsible, the workload was split with the thesis author being responsible for all implementation aspects while the evaluations were led by Amir Moradi. To keep the focus on the thesis author’s contributions, the evaluation aspects of some chapters are therefore shortened or cut in this thesis, with references being given to the complete publications when required.

In the following, we list the four parts and their chapters together with the respective contributions and the corresponding publications.

3 Chapter 1. Introduction

 Chapter 1: A general introduction to this dissertation is given. This in- cludes the motivation, the general structure, and a summary of the research contributions.

 Part I

 Chapter 2: This chapter provides an introduction into the field of side- channel attacks and countermeasures. Techniques relevant to this dissertation are discussed, which helps to sort the dissertation into the larger field of SCA research. In particular, the Correlation-Enhanced Collision Attack (CECA) is introduced, which will be used to evaluate the resistance of the analyzed countermeasures.

 Chapter 3: First a Boolean masking scheme is evaluated, then different hiding techniques are also considered in combination to strengthen the resistance of the masking scheme under test. Results of this work were published at the IEEE International Symposium on Hardware-Oriented Security and Trust (HOST) in 2011 [MMP11a].

 Chapter 4: As follow up to the previous chapter, the problem of glitches in hardware gates is addressed by manually mapping the masking scheme to the available FPGA resources and the usage of special enable signals. Results of this work were accepted to HOST in 2012 [MM12a].

 Chapter 5: A masking scheme based on Shamir’s secret sharing and Multi- Party Computation (MPC) is presented with a focus on the intricate im- plementation aspects. Results from this research were presented at the International Workshop on Cryptographic Hardware and Embedded Systems (CHES) in 2013 [MM13b].

 Chapter 6: This chapter analyzes if the idea of using dual ciphers of AES as SCA countermeasure has any merits. Detailed reasoning is pro- vided on the infeasibility of this idea. Results of this work were published at the International Conference on Information and Communications Secu- rity (ICICS) in 2013 [MM13a].

 Part II

 Chapter 7: The purpose of this chapter is to provide an introduction into the field of non-invasive fault injection attacks and countermeasures. In par- ticular, Fault Sensitivity Analysis (FSA) is described which is the basis for the advanced attacks in the following chapters.

 Chapter 8: The Collision Timing Attack (CTA) is presented, which merges the benefits of the CECA and FSA. Results from this work were first accepted

4 1.3. Research Contributions and Outline

to CHES in 2011 [MMP+11b]. An extended version was published in IEEE Transactions on Computers in 2013 [MMP13].

 Chapter 9: Taking a closer look at the Zero-Value vulnerability of composite- field S-boxes, we present another advancement to FSA which is able to recover the secret from an AES implementation protected by a state-of-the-art CED scheme. Results of this work were published at the Workshop on Fault Diag- nosis and Tolerance in Cryptography (FDTC) in 2014 [MMG14].

 Part III

 Chapter 10: We summarize our results and discuss open topics and guidelines for future work.

Note that the author of this thesis has contributed to other topics, such as Elliptic Curve Cryptography (ECC) accelerators; however, they are not included in this thesis due to its focus on physical attacks and countermeasures on AES. A detailed list of all publications can be found at the end of this thesis.

5

Part I

Side-Channel Analysis of AES

Chapter 2

Preliminaries

This chapter provides a brief introduction into side-channel attacks and countermeasures. In particular the Correlation-Enhanced Collision Attack (CECA) is described, which is used as a means to evaluate the security of our implemented countermeasures in the following chapters. Lastly, a brief outline of the remaining work presented in this SCA part of the dissertation is given.

Contents of this Chapter 2.1 Introduction ...... 9 2.2 Glitches in Hardware Circuits ...... 10 2.3 Side-Channel Attacks ...... 11 2.4 Side-Channel Countermeasures ...... 14 2.5 Contributions ...... 16

2.1 Introduction

With the increasing pervasion of cryptography in more and more embedded systems to protect either the intellectual property of a vendor or to preserve privacy by allowing secure communications, the need of secure implementations of cryptographic primitives like AES is at an all-time high. These implementations should not only be resistant to classical attacks but also be protected against side-channel attacks like power analy- sis [KJJ99,MOP07]. Many different kinds of countermeasures have been proposed for the protection of ei- ther software implementations or hardware platforms (see [MOP07] for instance). Mask- ing of sensitive values is one of the most considered solutions, and the scientific com- munity has shown a huge interest in different aspects of masking countermeasures,

9 Chapter 2. Preliminaries e.g., [AG01,GT02,BGK04,OMPR05,NRR06,CB08,RP10,GPQ10,GPQ11,NRS11,PR11]. They indeed have been presented to the community in an arms race to counteract the also evolving new side-channel attacks. Implementations of most of these earlier mask- ing schemes, while considered secure under the used security model at that time, still exhibit a detectable univariate first-order leakage which is caused by glitches in the com- binatorial circuits of hardware. For instance, we can mention the schemes presented in [OMPR05] and [CB08] which have later been shown to be vulnerable in [MPO05] and [MME10], respectively. Taking these occurring glitches into account, new masking schemes have been developed claiming glitch resistance, e.g., Threshold Implementa- tion (TI) [NRR06,NRS08,NRS11]. In the following we briefly explain the problem of glitches in hardware and give an overview over the side-channel attacks and countermeasures which are relevant to this dissertation.

2.2 Glitches in Hardware Circuits

When the inputs of a hardware gate change, the output can change its value more than once. If the inputs do not arrive at exactly the same time, the first arriving input can toggle the circuit output and the next arriving input can again create another output activity. Since combinatorial functions usually consist of many underlying hardware gates, glitches in one gate do propagate and cause even more glitches in the following gates, since now the inputs of those gates are switching multiple times. An illustrative example for this phenomenon is given in Fig. 2.1. As can be seen, when the input of the AES S-box changes, the circuit output does toggle multiple times before reaching a stable state. Also note that the amount of activity and the critical path

i:aa 00 i:aa 55 i:aa ff

Figure 2.1: Glitches at the AES S-box output after the input has changed from 0xaa to 0x00, to 0x55, or to 0xff.

10 2.3. Side-Channel Attacks length appears to be data-dependent, which indicates that information can leak if the input depends on a secret. In addition to being aware of glitches in combinatorial functions, a designer should also make sure that control signals are hazardless. Otherwise, glitches on, e.g., multiplexer lines, can create additional sources of exploitable leakage.

2.3 Side-Channel Attacks

In the following, we explain the basic ideas of Differential Power Analysis (DPA), Correlation Power Analysis (CPA), template attacks, as well as the Correlation-Enhanced Collision Attack (CECA). For an in-depth mathematical explanation we refer to [MOP07] as well as the referenced original publications.

2.3.1 Differential Power Analysis (DPA)

The concept of Differential Power Analysis (DPA) was presented at CRYPTO’99 [KJJ99] by Kocher et al. It was the first work demonstrating how the secret key of a DES implementation can be recovered exploiting the data-dependent power consumption of a device implemented in CMOS technology, and was a main driving factor boosting the interest in research on side-channel attacks and countermeasures. DPA aims to recover the complete secret key by employing a divide-and-conquer ap- proach. An attacker must gather a large amount of power consumption traces for variable plaintexts and a fixed secret key, and be able to predict intermediate values of the algo- rithm, which requires access to either the plaintext or of each computation. A choice for the intermediate values V in case of AES could be the output of a first round S-box i, since it relies only on the corresponding plaintext byte pi and unknown secret key byte ki: Vk,i = Sbox(pi ⊕ ki).

Difference of Means

For each key candidate and power trace an attacker computes the hypothetical interme- diate value and applies a selection function to sort each trace into one of two sets S0 and S1. As selection function usually the value of a single bit of the intermediate value Vk,i, for example the LSB, is used. Therefore, if it is 0, the trace is added to the set S0, otherwise it belongs to S1. When computing the difference of means between the two sets for each key candidate, the key candidate which yields the highest difference should be the correct one. The

11 Chapter 2. Preliminaries intuition is that if a wrong key candidate is chosen, the traces are sorted based on wrongly computed intermediate values. Therefore the difference between the two sets should be close to zero. On the other hand, if the correct key candidate is used to compute the intermediate values, then all traces in set S0 have a certain bit set to zero while it is 1 in set S1, and a slight difference in the power consumption traces should be visible.

Correlation Power Analysis (CPA)

Instead of using the difference of means between two sets of power traces, in a Correlation Power Analysis (CPA) [BCO04] the attacker computes Pearson’s correlation coefficient of each point of the real power measurement traces and hypothetical power values H. The hypothetical power values are generated by applying a power model to the computed intermediate values V . The strength of the attack therefore depends on finding a suit- able power model which closely approximates the real power consumption of the target platform. Choices for a suitable power model often are the Hamming Weight (HW) of an intermedi- ate value, the Hamming Distance (HD) of two intermediate values, e.g., if one overwrites the other in a register, and the zero value power model. The zero value model works on the assumption that a combinatorial circuit will consume the least amount of power if all input bits are zero. This is for example the case in the AES S-box, since the zero input is a special case in the inversion part of the function and is mapped to the zero output.

2.3.2 Template Attacks

Template Attacks, which were first discussed in [CRR02], do not rely on a hypotheti- cal power model. An attacker obtains a similar device as the target one where he can have more direct influence on the cryptographic operations. By setting a known key he can analyze the real-world power consumption of different operations and create power consumption templates in a profiling stage. During an attack, the measured power con- sumption is not compared to a hypothetical model, but to the created templates.

2.3.3 Correlation-Enhanced Collision Attack (CECA)

The Correlation-Enhanced Collision Attack (CECA) has been introduced in [MME10]. In contrast to classical power analysis attacks, it requires neither a hypothetical power model –as in CPA–, nor an offline profiling phase –as in template attacks–. Also, unlike other collision attacks, it works well against certain masked implementations which still have some kind of first order leakage.

12 2.3. Side-Channel Attacks

Similar to other collision attacks [SWP03], it recovers the differences between the parts of the secrets, e.g., the xor between the key bytes in the case of AES, which finally allows an easy key recovery. Since the big combinatorial circuits, e.g., AES S-boxes, are usually shared in the computation of a cipher round because of area constraints, the collision attacks can compare the side-channel leakage of the same instance of the circuit in two different instances of time. Due to the bijectivity of the AES S-box, output collisions, i.e., two different S-box computations in time t1 and t2 taking the same value, require input collisions. Means, ∆ = i1 ⊕ i2 = k1 ⊕ k2 when Sbox(i1 ⊕ k1) = Sbox(i2 ⊕ k2), which is known as linear collision attack on AES [Bog08]. The advantage of the correlation collision attack in comparison to the other collision attacks is that it tries to detect the case when maximum collisions occur selecting the correct ∆.

In order to perform the correlation collision attack, the power values corresponding to time t1 are sorted based on the values of input byte i1 such that all traces where i1 = α are summed and averaged to an average power consumption M1(α). Hence, due to the 256 possible values that α can take for AES, we get 256 different M1(α). Repeating the same procedure for input byte i2 on power values at time t2 leads to 256 different M2(α). The attack now assumes that the power consumption of the S-box computation for two different bytes at t1 and t2 has the same leakage if the same values are processed. For a known key difference ∆ = k1 ⊕k2, the S-box inputs are the same if i1 = i2 ⊕∆, and hence the average power consumptions M1(α) ≈ M2(α ⊕ ∆) should be highly similar. Such a similarity can be detected computing the correlation between all possible (M1,M2)-pairs for all possible key differences ∆. The correlation for a correct key difference ∆ is very high, as each (M1,M2)-pair is a direct collision, while for false ∆’s the correlation is low as unrelated computations are correlated. Repeating the same scheme for different S-box computations corresponding to different input bytes recovers sufficient ∆’s to reveal the complete secret.

Since this attack compares the power consumption characteristics of two combinatorial circuits, as illustrated in [MME10], the best result is achieved by comparing the power consumption of one instance of the target combinatorial circuit, e.g., an S-box, in different clock cycles. Therefore, the best target for this attack is when only one instance of the S-box is implemented and shared in all S-box computations. If there are more instances of the S-box, e.g., a 32-bit architecture where four instances of the S-box are implemented and four S-box computations are performed in a clock cycle, the attack should compare the power consumption of each instance of the S-box in different clock cycles. If the archi- tecture does not share any S-box for a round computation, and comparing the leakages of the same instance of the circuit in different clock cycles is not possible any more, the effectiveness of such an attack depends strongly on the similarity of power consumption characteristics of different instances of the S-box whose leakages are compared.

13 Chapter 2. Preliminaries

2.4 Side-Channel Countermeasures

The two main classes of side-channel countermeasures are hiding and masking. A good overview is given in [MOP07], and the following explanations closely follow the descrip- tions of that publication.

2.4.1 Hiding

In implementations applying purely hiding countermeasures, the same intermediate values as in unprotected implementations are computed. The goal of the countermeasure is to still have a power consumption which is independent of the intermediate values, and in theory this could be achieved by either having the same computation consume random amounts of power in different clock cycles, or make every possible computation consume exactly the same amount of power.

Time Domain

One practical solution is to apply hiding in the time domain. By inserting dummy clock cycles, the power consumption traces can be de-synchronized, making it very hard for an attacker to properly align the traces to attack a specific operation. Care has to be taken that the dummy operations cannot be easily distinguished from real operations rendering the countermeasure ineffective. Another solution is to shuffle the order of executions. Let us consider an AES design employing a single S-box to compute the whole SubBytes transformation on 16 input bytes in 16 consecutive clock cycles. If the order in which the different inputs are processed by the Sbox is randomized, then an attacker does not know in which cycle the target S-box of his attack was processed, which makes performing a successful attack more difficult.

Amplitude Domain

Instead of applying hiding in the time domain, one can also try to influence the power consumption in the amplitude domain. By adding noise generators, the Signal to Noise Ratio (SNR) of the real cryptographic operations can be reduced. Also some special logic styles were proposed [PM05, TV04, TAV02] to reduce the leakage of a signal or to consume the same amount of power for any given input.

14 2.4. Side-Channel Countermeasures

2.4.2 Masking

In designs implementing masking, the goal is to perform all computations only on masked intermediate states xm, where x denotes a unmasked intermediate value and m is a mask m: xm = x ∗ m. In Boolean masking schemes –and considering the AES algorithm–, the ∗ operation is the exclusive-OR ⊕, while in multiplicative masking schemes it would denote a finite field multiplication ×. The idea is that if a device performs operations only on masked data, which is randomized for each computation, then the data-dependent power consumption no longer leaks infor- mation of the unmasked intermediate states. Care has to be taken that all intermediate states are masked at all times, and that operations with two masked values still lead to a masked result and do not leak additional information during storage. As example, let us consider a Boolean masked value xm which is stored in a register. If it is overwritten in a later clock cycle by another masked value ym which uses the same mask m, then the Hamming distance of the two unmasked values will leak:

HD(xm ⊕ ym) = HD(x ⊕ m ⊕ y ⊕ m) = HD(x ⊕ y).

Boolean and Multiplicative Masking

In an cryptographic algorithm like AES, some circuit parts will perform linear operations, e.g., key addition or MixColumns, and some will perform non-linear operations, e.g., SubBytes. Masking a linear functions L(x) is trivial using a Boolean masking scheme, since the relation L(x ⊕ m) = L(x) ⊕ L(m) holds. However, this does not work with non-linear functions NL(x), since NL(x⊕m) =6 NL(x)⊕NL(m). A non-linear function, like the inversion part of the AES S-box, could be masked by a multiplicative scheme, since NL(x × m) = NL(x) × NL(m). Using a multiplicative masking scheme, however, has one important weakness. For all possible values of the mask, the zero input will not be masked since multiplication with zero always yields a zero result [AG01,GT02]. There are two options how to implement a Boolean masking scheme for S-boxes. The first is to implement the S-box as masked table lookup [GM11b,SMMG15] –using independent input masks m and output masks n–, which is recomputed with fresh masks before every computation:

yn = Sboxmasked(xm), with Sboxmasked = Sbox(x ⊕ m) ⊕ n. For the AES S-box, the second implementation option is using a tower-field approach [Paa94,SMTM01]. If the inversion is performed in GF(22) instead of GF(28), it becomes linear since inversion in GF(22) is equivalent to squaring which can be implemented as bitswap [OMPR05, CB08]. Note that it has been demonstrated that naive implementa- tions of those schemes are still susceptible to first order SCA attacks due to glitches in the circuit (see Section 2.2).

15 Chapter 2. Preliminaries

Glitch-Resistant Masking

Several masking schemes have been proposed to with the issue of glitches in hardware. Threshold Implementation (TI), which utilizes Boolean masking and multi- party computation, and another scheme using Shamir’s secret sharing and multi-party computation [PR11]. Creating a TI-based implementation of the AES S-box is very challenging, and so far only proposals masking certain parts of the AES S-box ex- ist [MPL+11, BGN+14]. Note that smaller S-boxes used in other schemes than AES, e.g., 4-bit S-box, can be efficiently masked using TI [BNN+12,BNN+15]. The scheme of [PR11] is presented in great detail in Section 5.2.

2.5 Contributions

The practical evaluation of different side-channel countermeasures on our targeted FPGA evaluation platform is the topic of this part of the dissertation. Since we are unable to make modifications to the hardware platform, e.g., adding analog sensors or implementing the actual hardware resources in a special logic style, we kept our focus on algorithmic countermeasures, i.e., hiding in the time domain and different masking schemes. Since hiding in the time domain mainly reduces the SNR, thereby requiring an attacker to gather a larger amount of power traces and/or use sophisticated alignment techniques, we have –with one exception in Chapter 3– avoided using hiding countermeasures when evaluating masking schemes in order to create a best case scenario for an attacker. This includes reducing the noise by making sure that only the design under test is active on the target FPGA during our practical experiments. Using the Correlation-Enhanced Collision Attack (CECA) we are also independent of a hypothetical power model and do not require a profiling stage, which leads to –in our eyes– very strong but at the same time realistic assumptions on the capabilities of an attacker. In Chapter 3 we analyze the robustness of a Boolean masking scheme [CB08] versus the CECA. We evaluate the achievable resistance of the countermeasure either alone or in combination with a time randomization or noise addition hiding techniques (shuffling, unrolling). We show that the masking scheme, while perfectly secure under the theoreti- cal model it was developed, in practice exhibits an exploitable first-order leakage because of glitches in the circuit. While neither the masking nor the hiding countermeasure is able to provide a sufficient level of protection, combining both weak techniques leads to a considerably higher number (>1 million) of required traces to break the implementa- tion. As alternative to implementing a glitch-resistant masking scheme, Chapter 4 proposes a new technique to implement the Boolean masking scheme used in the previous Chapter 3 in a glitch-free manner. This is achieved by manually mapping the masking circuit to the

16 2.5. Contributions available resources of the FPGA evaluation platform and avoiding any glitches by spe- cial enable signals. The resulting implementation showed no vulnerability to the CECA even using 50 million power measurements. As comparison, we have also implemented the glitch-resistant scheme of [PR11] in Chapter 5. While the security claims have been confirmed during practical evaluation, the extremely high area demand and low through- put of the design – even in the simplest setting – hinder a practical utilization of the scheme. The question about the validity of using AES dual ciphers as SCA countermeasure has been stated as early as in 2002 [BB02a]. However, a sufficient practical evaluations of this idea has never been performed. In Chapter 6 we provide detailed reasoning why no variant of AES dual ciphers is able to provide a meaningful level of SCA resistance.

17

Chapter 3

Analysis of Boolean Masking & Hiding

This chapter examines the effectiveness of well-known DPA countermeasures versus the Correlation-Enhanced Collision Attack. The considered counter- measures include masking, shuffling, and noise addition, when applied in hardware. Practical evaluations, which all have been performed using power traces acquired from an SASEBO platform, show an increase in the number of required traces, e.g. from 10,000 to 1,500,000, when combining different coun- termeasures. This study allows for a fair comparison between the hardware countermeasures and helps identifying an appropriate key lifetime.

Contents of this Chapter 3.1 Introduction ...... 19 3.2 Architecture ...... 20 3.3 Evaluation of Masking & Shuffling ...... 24 3.4 Evaluation of Masking & Pipelining ...... 30 3.5 Conclusion ...... 33

3.1 Introduction

When the Correlation-Enhanced Collision Attack (CECA) was first published at CHES 2010 [MME10], its target was an 8-bit serialized masked implementation of the AES. Since this architecture only computes one S-box per clock cycle, it provides a suitable situation for a collision attack. A question may arise on the efficiency of such an attack having different architectures in addition to different countermeasures. Although some notes about a 32-bit architecture have been given by [MME10], it seems necessary to have a comprehensive study on the influence of different randomizing and noise-additive countermeasures. Two different architectures are considered in this paper. The first one,

19 Chapter 3. Analysis of Boolean Masking & Hiding

which has a 32-bit data path, in contrast to the architecture of [MME10] does not em- ploy any and implements four S-box instances. Both masking and shuffling schemes are the options which can be enabled in this architecture. The second architec- ture is based on the unrolling scheme, proposed in [BGSD10], as a DPA countermeasure. Using this approach, we are able to execute a whole AES encryption in 10, 5, 4, and 3 clock cycles.

We investigate the efficiency of the CECA when masking, shuffling, unrolling, and their possible combinations are enabled in our target architectures. The practical evalua- tions are performed using power consumption traces measured from the same platform as in [MME10], i.e., a Virtex-II Pro FPGA. The result of these investigations can be summarized as: none of the previously mentioned countermeasures can perfectly provide resistance against the considered attack. The reasons, which are well-known to the com- munity, are that (i) implementing masking in hardware still leads to a kind of first-order leakage caused by glitches in the circuit that is detectable by our considered attack, (ii) shuffling which does randomization in the time domain is also defeated by increasing the number of traces or using a “windowing” scheme, and (iii) unrolling, which seems to have the most effect on collision-like attacks, adds noise to the measurements and is overcome by averaging, which is done inherently by the Correlation-Enhanced Collision Attack. However, enabling each (or a combination) of these countermeasures leads to an increase in the number of required traces. Depending on the target application this can be considered as an important parameter helping to define the key life time of the device under evaluation.

The remainder of this chapter is organized as follows: The different implemented designs and countermeasures are presented in Section 3.2. The evaluation results of the attack on the 32-bit architecture employing masking and shuffling are shown in Section 3.3, while the results of the attack on the unrolled architectures are depicted in Section 3.4. A conclusion is finally given in Section 3.5. Note that the Correlation-Enhanced Collision Attack (CECA) has already been introduced in 2.3.3.

Results of this work were published at HOST 2011 [MMP11a] in a joint work with Amir Moradi.

3.2 Architecture

This section gives an overview of the architectures used to evaluate the countermeasures. We also describe characteristics of our implementation platform and the setup used for side-channel measurements.

20 3.2. Architecture

Data Output

32 32

Add Mask m AddRoundKey 32

0 1 KeySchedule

32 Mask n Mask m State Registers Registers Registers

ShiftRows ShiftRows

Sel_Col

Sel_Col Key 32 Registers 32 0 1 KeySchedule 32 KeySchedule

!KeySch Switching Matrix 32 8 Sel_Col Masked Masked Masked Masked S-Box S-Box S-Box S-Box

8 Instance Shuffling Switching Matrix 32 32 32 32 32 Final Round Masked Decrypt 32 MixColumns 32 1 MixColumns AddRoundKey 0

32 32 AddRoundKey 32 Remove Mask n

0 1 Final Round 32

Figure 3.1: Architecture of the 32bit implementation, allowing masking as well as column- wise and S-box instance shuffling

3.2.1 32-bit Architecture

The objective of choosing a 32-bit architecture was to use a real-world scenario during our measurements. While choosing an 8-bit architecture would be the best choice from an attacker’s point-of-view because of the reduced amount of noise and the option to observe each processed byte independently, selecting a 32-bit architecture provides a good

21 Chapter 3. Analysis of Boolean Masking & Hiding compromise between size and throughput while still enabling us to use countermeasures like shuffling.

32-bit architecture in this context means that all module interconnections are using a 32-bit datapath including the outside ports. Module internals are not bound to this restriction, so the ShiftRows transformation and the key scheduling are performed in a single clock cycle using an internal 128-bit datapath.

The complete datapath of the AES engine excluding the and key registers are masked. Similarly to the scheme used in [MME10], we apply the additively masked AES S-box by Canright and Batina [CB08] that uses a tower-field approach [Paa94] to implement the inversion in GF(22). Each S-box operation needs two mask bytes, one for the input masking and one to mask the output byte. These mask values are independent of each other, and are generated by a PRNG with uniform distribution. The general order of mask switches is as follows: in the beginning of an encryption each input byte is masked by the corresponding input mask (let us call it m for one input byte)1. While passing through the inversion part of the S-box the input mask is replaced by the output mask n. This process is reversed by the masked MixColumns module while in the last round, where no MixColumns operation is performed, the S-box output mask n is removed after the final AddRoundkey operation, and the final result of the AES operation is stored unmasked in the state register. An overview of this architecture is depicted in Fig. 3.1.

Since the S-boxes are shared between the normal data operation and the key schedule and four instances of the S-box are implemented in our target architecture, each round of the AES needs five clock cycles. During the first clock cycle the S-boxes are used by the key schedule unit. Since the key schedule is not masked, the input and output masks of the S-boxes are set to all zero by means of AND gates. Simultaneously, while the S-boxes are utilized by the key schedule, the ShiftRows transformation is performed both on the data state and the mask m registers. This is necessary because the data state is masked by the m masks which therefore have to be transformed as well to keep the relation between the mask bytes and the data state. In the next four clock cycles the SubBytes, MixColumns, and AddRoundkey transformations are applied on one column at a time. This allows implementing the second countermeasure, i.e., shuffling, which needs each column of the data, mask, or key registers to be selected and stored independently. It is therefore possible to switch the processing order of the columns during each encryption (we call this option column-wise shuffling). We also implemented another option to switch which byte of each column is processed by which S-box instance (we call this option instance shuffling). It should be noted that the same procedure and options have been considered for the decryption operation which shares some building blocks with the encryption unit.

1No mask reuse is applied in a computation of a cipher round; two 128-bit masks are required for each encryption or decryption run.

22 3.2. Architecture

Our final architecture has different options, i.e., masking, column-wise shuffling, and instance shuffling, which can be selected during the operation of the target device. Based on these options we define five different profiles, and later investigate the efficiency of each to a Correlation-Enhanced Collision Attack. These profiles are as follows:

 Profile A: no countermeasure, using always zero for all the masks and turning off both shuffling options

 Profile B: column-wise shuffling only

 Profile C: masking only

 Profile D: masking and column-wise shuffling

 Profile E: masking, column-wise shuffling and instance shuffling

3.2.2 Unrolled Architecture

In addition to the masking and shuffling schemes we have tried to examine the effective- ness of unrolling, which has been explained in [BGSD10], in counteracting Correlation- Enhanced Collision Attacks. An overview of an unrolled design is shown in Fig. 3.2. In

Ciphertext Plaintext Key

AddRoundkey

Init 0 1 Init 1 0

Data State Key State R

ShiftRows o u n

S-Boxes S-Boxes S-Boxes S-Boxes Roundkey d F

Computation u n

MixCol MixCol MixCol MixCol c t i o AddRoundkey n R

ShiftRows o u n

S-Boxes S-Boxes S-Boxes S-Boxes Roundkey d F

Computation u n

MixCol MixCol MixCol MixCol c t i o AddRoundkey n

Figure 3.2: Architecture of the unrolled designs

23 Chapter 3. Analysis of Boolean Masking & Hiding order to reduce the required area of our unrolled architecture, we chose the very com- pact unmasked S-box by Canright [Can05] in an encryption-only scenario. Since the key scheduling is unrolled as well, twenty S-boxes are needed to implement each round function. We implemented four different designs, varying the number of rounds which are computed per clock cycle. In the smallest design only one complete round is computed at each clock cycle. The second design features two complete rounds, the third computes three and the last design computes four complete rounds of the AES at each clock cycle creating a highly glitching circuit.

3.2.3 Target Platform and Measurement Setup

All designs have been implemented on a Xilinx Virtex-II Pro FPGA (xc2vp7) of a SASEBO circuit board which is particularly designed for side-channel attack experi- ments [sas]. All experiments are performed on the power consumption of the Virtex- FPGA containing our implementation. Measurements are performed using a LeCroy WP715Zi 1.5GHz oscilloscope at a sampling rate of 1GS/s and by means of a differential probe which captures the voltage drop over an 1Ω resistor in the VDD (1.6V) supply path of the FPGA. In all the experiments the clock signal of our cryptographic engine is supplied by a stable oscillator at a frequency of 3MHz.

3.3 Evaluation of Masking & Shuffling

The later parts of this section deal with evaluating the resistance/vulnerability of different profiles of the 32-bit architecture addressed in Section 3.2.1 to Correlation-Enhanced Collision Attacks. Note that the evaluation described in this section was performed by Amir Moradi, not the dissertation author.

3.3.1 Profile A: No Countermeasures

Performing the Correlation-Enhanced Collision Attack as described in Section 2.3.3, we start by creating two sets of 256 mean traces according to the plaintext byte values corresponding to the target S-box instances. As explained in [MME10], the best situation for a collision attack is when the side-channel leakages of an S-box instance in two different clock cycles are compared. We therefore have selected two plaintext bytes which are processed by the same S-box instance. Looking at the variance traces computed over a set of mean traces, e.g., Fig. 3.3(b), clarifies in which clock cycle the corresponding plaintext byte is processed.

24 3.3. Evaluation of Masking & Shuffling

1.9

120 3 10 × ] 2 Voltage [mV] Variance [mV 20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

Figure 3.3: Profile A: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces

In order to perform the attack, aiming at recovering the relation between two key bytes, the mean traces must first be aligned to have both S-box executions – virtually – at the same instance of time. Though 1, 000, 000 traces have been used for the attack result depicted in Fig. 3.3(c), plotting the result over the number of traces, i.e., Fig. 3.3(d), shows that for this case even 10, 000 traces are sufficient to mount a successful attack.

3.3.2 Profile B: Column-wise Shuffling Only

When the column-wise shuffling is enabled, the target S-box computation does not take place at a specific clock cycle. But the computation will always be performed by the same S-box instance. Therefore, as can be seen in Fig. 3.4(b), the variance over the mean traces shows high values in all four clock cycles when the S-boxes are computed. This can in fact be considered as evidence of the existing time-randomization countermeasure. Without taking this countermeasure into account and just performing the last attack2, as depicted in Fig. 3.4(c), the correct difference between the target key bytes is still detectable and appears in all four mentioned clock cycles. However, it requires a higher

2It is not needed to shift the mean traces and align them in this case.

25 Chapter 3. Analysis of Boolean Masking & Hiding

120 400 ] 2 Variance [mV Voltage [mV]

20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

(e) (f)

Figure 3.4: Profile B: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing number of traces, i.e., 50, 000, since on average only one fourth of the mean traces are aligned to each other.

As mentioned in [MME10], one can divide a trace into clock cycles and sum them up to have a sum trace with a length of one clock cycle, which is known as “windowing” (integration over a sliding comb) [CCD00]. Doing so on the traces of this profile, consid- ering only those four clock cycles where the SubBytes transformations of the first round are performed, guarantees that the mean traces, which now are as long as a clock cy- cle, are aligned and contain the desired information. Performing the same attack on the

26 3.3. Evaluation of Masking & Shuffling

140 220 ] 2 Variance [mV Voltage [mV]

20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

Figure 3.5: Profile C: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 5, 000, 000 traces and (d) over the number of traces combined traces reveals the correct secret and decreases the required number of traces to 20, 000, as depicted in Fig. 3.4(e) and Fig. 3.4(f).

3.3.3 Profile C: Masking Only

While column-wise shuffling had low area and power overheads, implementing the masked S-boxes (as explained in Section 3.2.1) needs significantly more area and leads to a high power consumption. This can be seen when comparing the sample power traces of these two architectures in Fig. 3.4(a) and Fig. 3.5(a).

Since no shuffling is enabled in this profile, the mean traces must be aligned according to the clock cycles reported by the variance traces, e.g., Fig. 3.5(b). It should be noted that, as expressed in [MME10] and can be seen in the variance trace, the masked S-box implementation still has a first order leakage which is due to the glitches that occur in the combinatorial functions. The fact, that the variance is lower than that of the previous profiles, implies that a higher number of traces is necessary to reveal the correct key relation. The result of the attack using 5, 000, 000 traces is shown in Fig. 3.5(c), but based on Fig. 3.5(d) 150, 000 measurements are sufficient to reveal the desired secret.

27 Chapter 3. Analysis of Boolean Masking & Hiding

280 35 ] 2 Variance [mV Voltage [mV]

20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

(e) (f)

Figure 3.6: Profile D: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing

3.3.4 Profile D: Masking and Column-wise Shuffling

Attacking an implementation that combines both of the previously applied countermea- sures proves to be highly resistant against the CECA. Observing the variance traces, e.g., Fig. 3.6(b), shows that the dependency of the mean traces on the plaintext byte values is decreased because of the used masking scheme, and is spread over four clock cycles because of the column-wise shuffling. Fig. 3.6(c) shows the result of the attack using 10, 000, 000 traces. As expected after comparing the variance traces to those of the previous profiles, Fig. 3.6(d) reports around 4, 500, 000 as the number of traces we

28 3.3. Evaluation of Masking & Shuffling

25 280 ] 2 Variance [mV Voltage [mV]

20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

(e) (f)

Figure 3.7: Profile E: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing

required which is considerably higher than in the previous cases. If the same windowing scheme is used to overcome the time-randomization effect of the shuffling, the number of traces decreases to 700, 000 as depicted in Fig. 3.6(e) and Fig. 3.6(f).

3.3.5 Profile E: Masking, Column-wise and Instance Shuffling

As stated before, the Correlation-Enhanced Collision Attack works best if the target plaintext bytes are processed by the same S-box instance. Randomizing which of the

29 Chapter 3. Analysis of Boolean Masking & Hiding four S-box instances compute which bytes of a column (called instance shuffling in Sec- tion 3.2.1) should further increase the resistance of the implementation. This is confirmed comparing a variance trace over the mean traces of this profile (Fig. 3.7(b)) and that of profile D. The results of the attack using 10, 000, 000 traces, in both cases with and with- out windowing in addition to their required number of traces, are shown by Fig. 3.7. While around 5, 500, 000 traces are necessary to distinguish the correct guess when per- forming the considered attack in a straightforward manner, employing windowing reduces this number to 1, 500, 000 which is significantly higher than of previous profiles.

3.4 Evaluation of Masking & Pipelining

The same attacks, which have been done on the 32-bit architecture, are performed on the traces measured from the unrolled implementations. Since in our smallest unrolled architecture one round of the cipher encryption is performed per clock cycle, there is no shared hardware unit during the computation of a round. Therefore, one cannot compare the side-channel leakage of an unit in different clock cycles, and needs to consider different S-box instances to perform a collision attack. This, of course, decreases the efficiency of the attack since two different circuits are compared which, even with the same netlist, are differently placed and routed by the hardware design tool. Moreover, the switching noise level generated by the other parts of the circuits, e.g., S-boxes, which are not considered in the attack is considerably higher than the case of 32-bit architecture. We therefore have expected a higher number of required traces and collected more traces compared to the 32-bit cases. The results of this attack on different unrolled implementations are given by this section. Again, the practical side-channel measurements in this section have been performed by Amir Moradi.

3.4.1 One Round per Clock Cycle

Observing a sample power trace of a whole encryption run by our smallest unrolled implementation depicted in Fig. 3.8(a), verifies the execution of one round per clock cycle3. The same as before, two S-box instances and hence their corresponding plaintext bytes have been selected to make two sets of 256 mean traces. Since both the selected S-boxes are executed at the same clock cycle, their corresponding traces are already aligned and when performing the Correlation-Enhanced Collision Attack we do not need to shift the mean traces. Fig. 3.8(c) shows the attack result using 1, 000, 000 traces. As depicted in Fig. 3.8(d), the attack is still successful using 100, 000 measurements. We

3In Fig.3.8(a) 11 clock cycles with high power consumption are detectable. The last one is due to the case when the ciphertext is saved into the state register and appears at the input of the combinatorial circuit again.

30 3.4. Evaluation of Masking & Pipelining

170 1800 ] 2 Voltage [mV] Variance [mV

0 0 0 1.5 Time [µs] 3.5 5 0 1.5 Time [µs] 3.5 5 (a) (b)

(c) (d)

Figure 3.8: One round unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces should emphasize that the difference between the side-channel leakage of the implemented 16 S-box instances varies because of their different similarity in placement and routing. Therefore, the efficiency of the attack also varies selecting different S-box instances. The result shown in Fig. 3.8 is one of the best cases.

3.4.2 Two Rounds per Clock Cycle

When two rounds of the cipher encryption are unrolled, as can be seen in Fig. 3.9(a), the whole encryption is performed in 5 clock cycles. Comparing the variance traces shown in Fig. 3.8(b) and Fig. 3.9(b) of one and two unrolled rounds respectively, shows a significant increase of the noise level. We repeated the same attack procedure as before on 7, 500, 000 traces collected from the two-round unrolled implementation. This led to the result shown by Fig. 3.9(c) as amongst the most successful cases. Also, Fig. 3.9(d) reports around 300, 000 as the number of traces we have required to recover the correct relation between the selected key bytes. Although evaluation of the later rounds knowing their inputs when more than one round is unrolled has been included in the original proposal of the unrolling countermeasure [BGSD10], we have not reported the results of

31 Chapter 3. Analysis of Boolean Masking & Hiding

500 650 ] 2 Voltage [mV] Variance [mV

0 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

Figure 3.9: Two rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces the corresponding collision attacks because of their similarity to the case when attacking the first round. Moreover, knowing the input of the e.g., second round of the AES, in contrary to the DES case, reveals all the secrets used in the first round, and one does not need to perform the attack on the second round.

3.4.3 Three Rounds per Clock Cycle

Fig. 3.10 shows the results of a similar attack on the first round when three unrolled rounds are implemented, and the whole encryption process is completed in 4 clock cycles. As a reference, Fig. 3.10(a) and Fig. 3.10(b) show a sample power trace of this implemen- tation and a variance trace over a set of 256 mean traces. According to the low variance (Fig. 3.10(b)) and unclear distinguishability of the correct hypothesis amongst the others (Fig. 3.10(c)), a high number of required traces is expected, e.g., around 3, 000, 000 as shown in Fig. 3.10(d).

32 3.5. Conclusion

250 500 ] 2 Voltage [mV] Variance [mV

0 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)

(c) (d)

Figure 3.10: Three rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces

3.4.4 Four Rounds per Clock Cycle

In our last unrolled architecture, where four unrolled encryption rounds are implemented, every encryption run needs 3 clock cycles (see Fig. 3.11(a) as an example), and the switching noise has the highest level compared to all previous examined architectures (see Fig. 3.11(b) as a variance trace over the mean traces). The results of the attack, which are shown in Fig. 3.11, are practical evidence of the success of the CECA using around 3, 500, 000 traces countering unrolling as a countermeasure. Although the required number of traces is comparably higher than in previous cases, since the correlation colli- sion attack employs the mean traces, increasing the number of traces helps removing the switching noise effect and finally recovers the relation between the key bytes.

3.5 Conclusion

The results of the Correlation-Enhanced Collision Attack (CECA) on different hardware countermeasures have been presented. We have chosen this attack scheme for our in- vestigations since no hypothetical power model is required and its efficiency does not

33 Chapter 3. Analysis of Boolean Masking & Hiding

200 500 ] 2 Variance [mV Voltage [mV]

0 0 0 0.25 Time [µs] 0.75 1 0 0.25 Time [µs] 0.75 1 (a) (b)

(c) (d)

Figure 3.11: Four rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 30, 000, 000 traces and (d) over the number of traces rely on the leakage model of the target device which allows for a fair comparison. It is not a surprise that each countermeasure alone is not able to overcome the vulnerability against the attacks since even (theoretically) perfectly Boolean masked implementations still contain a slight first order leakage in practice due to glitches in the circuit. Similarly time randomization or noise addition countermeasures, which diminish the SNR, can be overcome by increasing the number of measurements.

However, when different countermeasures are combined, it is possible to significantly strengthen the resistance against DPA attacks. Applying all implemented countermea- sures of the 32-bit architecture, the number of necessary traces for a successful CECA can be increased from 10, 000 to 1, 500, 000. If area constraints are not an issue (e.g., unused FPGA resources are available), unrolling can further increase the resistance. Increasing the number of rounds per clock cycle from one to four increases the number of required traces from 100, 000 to 3, 500, 000. The combination of unrolling and masking has not been considered because of the rather large and impractical area requirements.

Since our implementation platform has been specifically designed for side-channel pur- poses considering an appropriate measurement setup and a well-defined trigger point, in real-world scenarios especially when the crypto cores are not the only circuits computing

34 3.5. Conclusion at one instance of time even more measurements will expectedly be required. However, knowing the number of required traces for an attack on an implementation in a low-noise environment helps choosing appropriate key lifetimes to further protect the secrets.

35

Chapter 4

Glitch-free Implementation of Masking

Due to the propagation of glitches in combinatorial circuits, side-channel leak- age of most masked S-boxes realized in hardware is a known issue. Our con- tribution in this chapter is to adopt a masked AES S-box circuit to the FPGA resources in order to avoid the glitches. Our design is suitable for the 5, 6, and 7 FPGA series of Xilinx, although our practical investigations are performed using a Virtex-5 chip. In short, compared to the original design synthesized by automatic tools while requiring the same area (slice count) our design reduces power consumption, critical path delay, and more importantly the side-channel leakage. In our practical investigations we could not recover any first-order leakage of our design using up to 50 million traces. However, since the tar- geted S-box realizes a first-order Boolean masking, the second-order leakage could be revealed using around 25 million measurements.

Contents of this Chapter 4.1 Introduction ...... 37 4.2 Preliminaries ...... 38 4.3 Design ...... 42 4.4 Evaluation ...... 44 4.5 Conclusion ...... 47

4.1 Introduction

Contrary to the goals of Threshold Implementation (TI) [NRS11] and the scheme of [PR11], in this work we do not try to create a glitch-resistant implementation but instead aim to avoid causing any glitches at all. The target of our implementation is the Virtex-5 LX-50 FPGA of the readily available side-channel evaluation platform SASEBO-GII [sas].

37 Chapter 4. Glitch-free Implementation of Masking

For this we take the very compact masked S-box by Canright-Batina [CB08] and manually map the combinatorial functions to the resources of our target platform. By efficiently using special enable signals in each FPGA Look-Up Table (LUT), we can suppress any glitches at the LUT outputs by enabling them only sequentially. We have evaluated different versions of our design including a fully pipelined one achieving a very high clock frequency. Note that although our design has been initially optimized to the 6-input LUT architecture of the Xilinx Virtex-5 FPGA, the same architecture is used in their newer Series 6 and 7 FPGAs which allows using the same design on these recent platforms. When evaluating the side-channel leakage of our final design, contrary to the original S-box implementation our design did not show any first-order leakage by analyzing 50 million measurements. Since the scheme only implements a first-order masking, a second- order attack is expected to be successful, which is practically confirmed using a very high amount of 25 million measurements. In the next section we briefly describe the reasons why we have selected the Canright- Batina masked S-box as the basis of our implementation. Moreover, we introduce the Xilinx LUT architecture and how we have used it to eliminate glitches. Section 4.3 gives an overview of our S-box design and names the implementation profiles used in the evaluation whose results are depicted in Section 4.4. Finally, Section 4.5 concludes this chapter. Results from this chapter were generated as joint work with Amir Moradi, who led the SCA evaluation, and were accepted to HOST in 2012 [MM12a].

4.2 Preliminaries

In the following we will first give a short summary of the recent masked S-box designs and state why we have chosen the one of Canright and Batina as basis for our modifications to create a glitch-free version. Afterwards we will describe the architecture of the Xilinx 6-Input LUT and how we use it to minimize the possible leakage.

4.2.1 Masked AES S-box

As stated previously the currently known glitch-resistant schemes come with some draw- backs. Threshold implementation has been shown to be quite effective when using small S-boxes [PMK+11], but because of the large S-box size of AES up to now no expressions could be found to rewrite the AES S-box using this scheme. Note that the implemen- tation reported in [MPL+11] has been made by masking the multipliers of a tower-field

38 4.2. Preliminaries

input output mask mask Optimized xor/sq/scl/ 4 mul

XORS

Figure 4.1: Masked GF(28) Inverter by Canright-Batina (taken from [MME10])

implementation of the AES S-box which could not follow the requirements of the thresh- old implementation. At CHES 2011 a mixture of Shamir secret sharing scheme and multi-party computation was introduced [PR11]. Unfortunately, it is obvious that the hardware resource requirements are quite high (see Chapter 5). Furthermore, because of the sequential way of computing the inversion of the S-box, a large number of clock cycles are necessary to compute only one S-box output. All these predicted area and time overheads may hinder its practical feasibility.

Instead of focusing on glitch resistance in this article we try to avoid any glitches at the FPGA LUTs at all. From the more traditional currently known masking schemes the one of Canright-Batina [CB08] uses an additive masking and implements the S-box in a tower-field approach using carefully chosen normal bases to minimize the circuit size. It is based on the area-optimized S-box by Canright [Can05], and it is still supposed to be the most compact design available. While it was claimed to be perfectly secure by the definition of [BGK04], it was shown in [MME10] that because of glitches in the circuit there still exists an exploitable first-order leakage. Figure 4.1 shows an overview of the GF(28) inverter design omitting the tower-field conversions. The GF(24) inverter is implemented using the same design the only difference being that the inversion in GF(22) is also merged to this module. The authors of the original design were kind enough to supply the HDL source code online1 which we used as basis for our modifications detailed in the following.

1http://faculty.nps.edu/drcanrig/pub/index.html

39 Chapter 4. Glitch-free Implementation of Masking

(a) (b)

Figure 4.2: Two possible LUTs in Virtex-5: (a) 6-input LUT, (b) 32-bit shift- register [Xil09]

4.2.2 Xilinx FPGA Resources

When not using dedicated hardware blocks like Multipliers/DSPs, a combinatorial logic circuit in an FPGA is usually implemented by means of many-to-one Look-Up Tables. Their general design is as a number of single-bit storage elements whose values are ini- tialized during the configuration of the FPGA by the bitstream. The inputs of the LUT control the setting of internal multiplexers thereby choosing which stored bit value is available at the output of the LUT. As example, considering the 6-to-1 LUT of the Xilinx Series 5, 6, and 7 FPGAs, the implementation of this LUT is realized as two 5- to-1 LUTs and a multiplexer as can be seen in Fig. 4.2(a). Each of these 5-to-1 LUTs themselves can again be seen as two 4-to-1 LUTs and a multiplexer and so on.

In our device under test, the Xilinx Virtex-5 LX50 FPGA mounted on a SASEBO-GII Board, each slice consists of four LUT6 and four single-bit flip-flops. The LUT6, as depicted in Fig. 4.2(a), can be hard-instanced in two different configurations. As LUT6 1 any combinatorial function having up to 6 input signals and one output signal can be implemented. Using the LUT in a LUT5 2 configuration allows providing two output signals from the 5 inputs but only if these 5 inputs are the same for both internal 5-to-1 LUTs, i.e., the inputs must be shared.

40 4.2. Preliminaries

Glitches at the output of a LUT happen since the input signals arrive at different instances of time because of the routing specification in the device. In order to avoid this the output of the LUT must be hold stable until all input signals have arrived. We achieve this by using one of the input signals as an active low enable signal, i.e., in our case as long as this input signal is set to logic ’1‘, the LUT output will always be logic ’0‘ no matter the values of the other input signals. It is important to carefully select which LUT input is used as enable signal. Let us consider choosing the input I5 in Fig. 4.2(a) as the enable signal. While the output of the LUT6 will actually not change during the transition period of the other input signals, there will still be glitches at the output of one of the internal LUT5 instances. We therefore have to choose the input signal which controls the very first multiplexer stage so that toggles at the select signals of the following multiplexers do not cause any glitches.

XOR ah GF_MUL_4x4 a al LUT ahal 5x2

XOR MUL.SCL 2x2 bh bl LUT bhbl LUT p 5x2 5x2 an MUL 2x2 XOR LUT ph LUT Q1 5x2 5x2

MUL 2x2 XOR LUT pl LUT Q0 5x2 5x2 en1 en2 en3

XOR ah GF_MUL_4x4 b al LUT ahal 5x2 XOR ah GF_MUL_4x4 LUT XOR MUL.SCL 2x2 al ahal bh 5x2 bl LUT bhbl LUT p GF_INV_8 (masked) 5x2 5x2 XOR MUL.SCL 2x2 bh mb LUT bhbl LUT p MUL 2x2 XOR bl 5x2 5x2 LUT ph LUT Q1 5x2 5x2 MUL 2x2 XOR LUT ph LUT Q1 p MUL 2x2 XOR 5x2 5x2 LUT pl LUT Q0 XOR 5x2 5x2 MUL 2x2 XOR LUT en1 en2 en3 LUT pl LUT Q0 6x1 5x2 5x2 XOR en12 en13 en14 XOR ah GF_MUL_4x4 m al LUT ahal LUT 6x1 5x2 XOR ah QH d GF_MUL_4x4 XOR MUL.SCL 2x2 al LUT ahal XOR bh 5x2 n bl LUT bhbl LUT p LUT XOR 6x1 5x2 5x2 XOR MUL.SCL 2x2 LUT cl bh bl LUT bhbl LUT p XOR MUL 2x2 XOR mn 6x1 LUT ph LUT Q1 5x2 5x2 LUT XOR 6x1 5x2 5x2 MUL 2x2 XOR LUT LUT ph LUT Q1 dn MUL 2x2 XOR 6x1 b GF_INV_4 LUT pl LUT Q0 5x2 5x2 XOR 5x2 5x2 a MUL 2x2 XOR en1 en2 en3 LUT LUT pl LUT Q0 en3 6x1 MUL 2x2 5x2 5x2 LUT c1 cst LUT an en12 en13 en14 XOR 6x1 LUT 5x2 XOR 6x1 ch MUL 2x2 XOR ah GF_MUL_4x4 XOR LUT c3 LUT LUT mb LUT al LUT ahal LUT af8 6x1 6x1 5x2 6x1 5x2 6x1 Q1 XOR MUL.SCL 2x2 XOR MUL 2x2 XOR bh c2 LUT LUT LUT LUT mn LUT bl LUT bhbl LUT p 5x2 6x1 5x2 5x2 6x1 5x2 5x2 c4 cst MUL 2x2 XOR XOR MUL 2x2 c5 LUT cst1 XOR LUT LUT LUT LUT LUT e LUT LUT ph Q1 5x2 6x1 5x2 n 6x1 5x2 q 5x2 5x2 a 5x2 XOR c6 cst0 MUL 2x2 XOR LUT XOR XOR MUL 2x2 LUT c7 m LUT pl LUT Q0 b LUT LUT 6x1 LUT LUT LUT 6x1 em e 5x2 5x2 5x2 6x1 cm1 6x1 6x1 5x2 en12 en13 en14 XOR en2 c8 m4 en4 LUT m2 6x1 XOR XOR MUL 2x2 XOR LUT 6x1 mn LUT LUT LUT LUT XOR XOR csm LUT cm0 ah GF_MUL_4x4 QL 6x1 6x1 d 5x2 p 6x1 LUT ahal XOR LUT Q0 XOR al 6x1 5x2 6x1 MUL 2x2 XOR LUT LUT 6x1 LUT LUT 5x2 XOR MUL.SCL 2x2 XOR bh LUT 5x2 dn 6x1 bl LUT bhbl LUT p XOR en6 en7 en8 en9 en10 5x2 5x2 6x1 csm LUT 6x1 XOR XOR MUL 2x2 XOR LUT LUT LUT ph LUT Q1 em 6x1 5x2 o1 5x2 5x2

XOR XOR MUL 2x2 XOR LUT LUT o0 LUT pl LUT Q0 6x1 5x2 5x2 5x2 n m en12 en13 en14

m4 m5 en1 en2 en3 en4 en5 en6 en7 en8 en9 en10 en11 en12 en13 en14 en15

Figure 4.3: Design of our full-custom optimized S-box (inversion part only)

41 Chapter 4. Glitch-free Implementation of Masking

Although the details of the internal architecture of the FPGA resources are not publicly available, this input signal can be identified by looking at the architecture of the SRLC32E depicted in Fig. 4.2(b). It is a special mode of operation for LUTs in some slices of Xilinx FPGAs that realizes a shift register. In this mode the content of the LUT storage cells can be changed in a serial fashion during the operation of the FPGA. By using the LUT inputs as select lines, the length of the shift register can be set dynamically. Since the all zero input sets the length to 1 bit, and switching the I0 input signal to logic ’1‘ increases the length to 2 bits, i.e., choosing the neighboring cell, the I0 signal must control the very first multiplexer stage. Therefore, I0 is the correct choice for the enable signal. Note that since the synthesizer permutes the LUT input signals (and accordingly changes the LUT configuration) to optimize the routing, by special constraints [Xil08] one has to keep the PIN positions of the hardinstanced LUTs locked.

4.3 Design

The detailed structure of our design is given by Fig. 4.3. Omitting the tower-field con- version, 15 LUT stages are required to perform the full inversion in GF(28). We give performance figures for 6 different implementation profiles: from the original unmodi- fied design to our optimized one with or without pipelining stages, and when the special enable signals to minimize glitches in the circuit are used or not. The implementation profiles of the S-box are as follows: (1) The original HDL code optimized by the ISE synthesizer (2) The original HDL but avoiding any optimizations or trimming by the synthesizer, i.e., one LUT per gate to keep all hierarchy levels (3) Our modified design using hardinstanced LUTs, all enable signals always ’0‘, no pipeline registers (4) Our modified design without pipelining but activating each stage sequentially by the enable signals (5) Our modified design using pipelining to hinder glitch propagation, but all enable signals always ’0‘ (6) Our modified design using both pipelining to hinder the glitch propagation and using the enable signals to avoid glitches in the circuit In Profiles 1, 2, and 3 the implementations are pure combinatorial functions where at each clock the full S-box is computed at once. Glitches in the first LUT stage therefore are passed through the whole S-box generating a highly glitching circuit until all signals get stable. Therefore, we do not consider Profiles 1 and 2 in our side-channel evaluations (Section 4.4), but restrict the presentation of result to Profile 3.

42 4.3. Design

Profile 4 avoids this issue. Here only one LUT stage is activated in each clock cycle, thereby not only hindering the propagation of glitches, but also not causing any glitches at all. That is because the input signals of the next LUT stage are stable when they are activated in the following clock cycle. The downside of this profile is the apparent non- practicality. One needs 15 clock cycles to compute a single S-box output while the inputs must be held stable. To make matters worse one would need to spend another 15 clock cycles to deactivate each stage in the reverse order before the next S-box computation can begin. In Profile 5 the pipelining stages hinder the glitch propagation. On the other hand, keeping all enable signals at ’0‘ glitches will still occur at the LUT outputs of each stage.

Finally, in the last Profile 6 we combine both the pipelining to avoid any glitch propa- gation and the use of the active-low enable signals to completely shut down glitches at the LUT outputs. In order to reach our goal in a straightforward way one would need to i) first disable all LUTs, ii) clock every second pipelining registers after enabling their corresponding LUTs, iii) disable all LUTs again, iv) clock the other half of pipelining registers having their corresponding LUTs enabled and so on. This means that only every four clock cycles a new S-box input can be feed into the circuit, and it leads to a latency of 30 clock cycles from input to output. This is necessary because if one would simply merge clocking every second register and disabling the connected LUT stage at the same time, the routing of the signals would determine whether the disable signal arrives at the LUT first or if other inputs arrive earlier, which in the latter case would cause glitches at the LUT output.

To avoid this issue we can use the special way the clock signal is routed in the FPGA. The clock is routed on special dedicated paths to each switch box separately to avoid race problems in synchronous circuits. However, the LUT output signals need to first go back to the corresponding slice’s switch box and from there travel to the destination LUT inputs where more switch boxes might be passed. Therefore, a transition, e.g., low-to- high, on the clock signal arrives at the registers and LUTs of each slice earlier than the other signals. Therefore, by tying our active-low LUT enable signals to the clock signal, the LUT gets deactivated at each rising clock edge before the new inputs arrive. At the falling edge of the clock the LUT gets active and provides the output signal to the next flip-flop stage where it will be stored at the next rising edge. This way the pipelining registers can be active at every clock cycle and no glitches will occur. Please note that the maximum clock frequency in this case cannot be faster than twice the longest critical path delay of the S-box circuit. In order to provide a better understanding Fig. 4.4 showcases the different signal timings. Also, the performance results of each implementation profile for only the inversion module of the S-box is given in Table 4.1.

43 Chapter 4. Glitch-free Implementation of Masking

clk/LUT en

LUT output ‘0’ output (i) ‘0’ output (ii) ‘0’

LUT data in inputs (i) inputs (ii) inputs (iii)

Figure 4.4: Signal timings on LUT inputs and outputs

Table 4.1: Synthesis results for all profiles (inversion only) Latency Throughput Profile Max. Freq. #LUTs #FFs (#clocks) (16 Inv. /s) 1 105.519 MHz 99 0 0 6 594 937 2 56.504 MHz 244 0 0 3 531 500 3 88.300 MHz 100 0 0 5 518 750 4 641.026 MHz 100 0 30 1 335 471 5 641.026 MHz 100 649 15 (pipe’d) 20 678 258 6 320.513 MHz 100 649 15 (pipe’d) 10 339 129

4.4 Evaluation

We used a SASEBO-GII [sas] board as the target platform to examine the side-channel leakage of our designs. Different profiles of our design were implemented on the Virtex-5 (XC5VLX50) FPGA embedded on the target platform, and the power consumption traces were gathered using a LeCroy WP715Zi 1.5GHz oscilloscope with the sampling rate set to 1GS/s. Since the aim of some of some of our design is to minimize the number of toggles in each clock cycle, and only a single S-box instance was implemented, the peak-to-peak amplitude of the signal in the power traces was very low. Therefore, we utilized a DC blocker and amplifier while measuring the power traces by means of a passive probe with a 1Ω resistor in the VDD path. Furthermore, a bandwidth limit of 20MHz was set on the oscilloscope to reduce the electrical noise. All measurements were performed while our designs were run by a 3MHz external clock signal to avoid any overlap of power traces. Note that the actual evaluation was not performed by the dissertation author but by his co-author Amir Moradi.

In order to evaluate the resistance of our different profiles in a low-noise environment, we created an exemplary architecture where only the AddRoundKey module and one instance of the targeted S-box exists. After the initial key addition with the already masked data, the result is sequentially given to the S-box module one byte per clock cycle. The method we used to examine the side-channel leakage of our targeted designs is the Correlation- Enhanced Collision Attack (CECA). Since the best case for the attack is if a circuit is

44 4.4. Evaluation

reused in multiple clock cycles, it perfectly suits to our implemented architecture where the targeted S-box instance is shared for all SubBytes transformations. Note that the target masked S-box [CB08] requires two different random mask bytes per input byte, i.e., one to mask the input byte and another one to mask the S-box output. We have therefore provided two different 128-bit masks for each computation of our exemplary architecture.

To have a baseline for comparison purposes, we start our evaluations by analyzing Pro- file 3. It corresponds to a naive implementation where glitches are not controlled and can be propagated. Please note that the same is true for Profiles 1 and 2. We have, however, omitted their evaluation results since they exhibit the same side-channel leakage as that of Profile 3.

A sample power trace of this design is depicted by Fig. 4.5(a). The sixteen S-box com- putations of the SubBytes transformation can be clearly distinguished. We measured 50 000 traces, and performed a CECA on two plaintext bytes which are consecutively processed by the targeted S-box instance. Fig. 4.5(b) depicts the successful attack re- sults, and Fig. 4.5(c) demonstrates the simplicity of recovering the linear key difference if the glitches are not controlled in the circuit. A very low number of 5 000 measurements are sufficient to extract the secret.

The S-box design of Profile 5 was our next evaluation target. Compared to the previous profile, the difference of this design is that, while glitches at each LUT stage still occur, their propagation is hindered by the pipeline stages. Because of the 15 register stages in the S-box we require a total of 31 clock cycles to compute the full SubBytes trans- formations. An example power trace is shown in Fig. 4.6(a). Interestingly, the power consumption of Profile 5 is reduced compared to that of Profile 3 (Fig. 4.5(a)). This is due to the heavily reduced amount of glitches in the S-box, which more than offsets the energy consumed by the very high amount of additional register flip-flops.

We had to collect a significantly higher number of power traces compared to Profile 3, i.e., 20 000 000, to successfully recover the relations between key bytes. The attack results are depicted in Fig. 4.6(b), which indicate that a first-order leakage still remains. This demonstrates that controlling the propagation of the glitches is an effective method to significantly reduce the side-channel leakage. However, the leakage is not completely pre- vented, and we were able to exploit it using 8 000 000 power measurements (Fig. 4.6(c)).

The last design we considered for evaluation is Profile 6, where glitches in the circuit are not only not propagated, but completely prevented by a sophisticated control over the LUT enable signals. The level of power consumption of this design, as shown by Fig. 4.7(a), is basically equal to the one of Profile 5. We measured 50 000 000 traces of this design, but even with this high amount of measurements we were unable to perform a successful attack, which is depicted in Fig. 4.7(b). This demonstrates that preventing

45 Chapter 4. Glitch-free Implementation of Masking

10 Voltage [mV] −5 0 2 Time [µs] 7 9 (a)

(b)

(c)

Figure 4.5: Profile 3: evaluation results (a) a sample trace, (b) attack result using 50 000 traces, (c) over the number of traces. the occurrence of glitches significantly increased the resistance of the target S-box design against first-order attacks.

Since the underlying masking scheme only realizes a first-order masking, we should be able to recover the secret key using a second-order attack. As is illustrated in [Mor12], we can employ the variance traces of our measurements instead of the mean traces and perform a similar CECA to examine the second-order moments. The result of this attack is presented in Fig. 4.7(c). While the second-order leakage can be exploited, it still requires around 25 000 000 measurements to reveal the secret (see Fig. 4.7(d)).

46 4.5. Conclusion

6 Voltage [mV] −5 0 5 Time [µs] 13 18 (a)

(b)

(c)

Figure 4.6: Profile 5: evaluation results (a) a sample trace, (b) attack result using 20 000 000 traces, (c) over the number of traces.

4.5 Conclusion

In this work we have taken the highly optimized for ASICs very compact masked S-box by Canright and Batina, and ported it to use the available resources of the current Xilinx FPGA Series (Virtex-5 onward) in a size-optimized manner. Compared to a design created by an automatic synthesizer, this led to the same number of LUTs and a slight decrease of the operation frequency. We could also, as already pointed out in [MME10], confirm the still available first-order leakage of this S-box design when implemented in a straightforward manner. Since this leakage was caused by glitches in the circuit, we have first eliminated the glitches by placing enable signals in each used LUT, so that no output is propagated

47 Chapter 4. Glitch-free Implementation of Masking

6 Voltage [mV] −5 0 5 Time [µs] 13 18 (a)

(b)

(c)

(d)

Figure 4.7: Profile 6: evaluation results (a) a sample trace, (b) attack result using 50 000 000 traces, (c) attack result on 50 000 000 squared mean-free traces, (d) over the number of traces. while the inputs are not stable. By combining this solution together with pipelining stages and utilizing the special way how the clock signals are routed for the LUT enable

48 4.5. Conclusion signals, we could create an implementation which operates at an extremely high clock frequency while showing no first-order leakage by means of 50 million power consumption measurements. While not specifically focusing on this, we also achieved a quite high resistance against univariate second-order attacks. In this case 25 million traces is the threshold after which the secrets become slowly distinguishable using the attacks of [Mor12]. We should emphasize a comparison between our results and those of a threshold implementation of AES reported in [MPL+11] and [Mor12]. Although their implementation platform is different to ours, their scheme required roughly the same number of traces to exploit the second-order leakage while the area overhead of their design – excluding all the internal PRNGs – is much higher than our optimized one. In order to allow further study of our design and to use it in real applications the HDL source code of our masked S-box design is available online at http://www.sha.rub.de/research/publications.

49

Chapter 5

Implementation of a Glitch-resistant Masking Scheme

Only few of the masking schemes proposed to protect cryptographic imple- mentations against side-channel attacks also considered glitches in the logic circuits. One which is based on multi-party computation protocols and uti- lizes Shamir’s secret sharing scheme was presented at CHES 2011. It aims at providing security for hardware implementations – mainly of AES – against those sophisticated side-channel attacks that also take glitches into account. This Chapter deals with the practical issues and relevance of the aforemen- tioned masking scheme. Following the recommendations given in the extended version of the mentioned article, we provide a guideline on how to implement the scheme for the simplest settings. Constructing an exemplary design of the scheme reveals the impractical area requirements and low throughput, which will most likely prevent it from being practically used.

Contents of this Chapter 5.1 Introduction ...... 51 5.2 Target Scheme ...... 52 5.3 Implementation & Caveats ...... 56 5.4 Conclusion ...... 60

5.1 Introduction

A new masking countermeasure was proposed at CHES 2011 [PR11]. Similiar to TI, it is based on multi-party computation, but instead of Boolean masking it utilizes Shamir’s secret sharing scheme [Sha79]. It claims security not only against 1st-order attacks but

51 Chapter 5. Implementation of a Glitch-resistant Masking Scheme

also, depending on the number of shares, against higher-order multivariate ones.1 The contribution of this chapter is to give guidelines on how to implement the scheme, thereby allowing its practical realization on a hardware platform. More details on its practicability as well as its ambiguous points was later also given in [RP12] by the original authors. In order to make an exemplary architecture of this scheme we have chosen a parameter set based on the minimum number of shares to supposedly provide protection against any univariate attack. We address a couple of challenges on the way of its practical realization because of the very high time and area overheads.

This chapter is joint work with Amir Moradi and was also published as part of [MM13b] at CHES 2013, containing also a detailed evaluation and proposals how the scheme can still be overcome by changes in the measurement setup to detect the remaining multivariate leakage by a univariate attack.

5.2 Target Scheme

In [PR11] a very general description of the scheme is given. We use the most simplified settings given in the original publication, i.e., a polynomial of degree 1, and adjust it to the AES algorithm. The number of shares and accordingly the number of Players is fixed to 3, which should still provide protection against any univariate attack. Multiplication in GF(28) using the AES irreducible polynomial is denoted as ⊗, and finite-field addition as ⊕.

Before starting the shared operations, one needs to select 3 distinct non-zero elements, 8 so-called public points, α1, α2, α3 in GF(2 ). Moreover, it is required to precompute the j first row (λ1, λ2, λ3) of the inverse of the Vandermonde (3 × 3)-matrix (αi )1≤i,j≤3 as

−1 −1 λ1 = α2 ⊗ α3 ⊗ (α1 ⊕ α2) ⊗ (α1 ⊕ α3) −1 −1 λ2 = α1 ⊗ α3 ⊗ (α1 ⊕ α2) ⊗ (α2 ⊕ α3) −1 −1 λ3 = α1 ⊗ α2 ⊗ (α1 ⊕ α3) ⊗ (α2 ⊕ α3) ,

where x−1 denotes the multiplicative inverse of x in GF(28) (again using the Rijndael irreducible polynomial). These elements, α1, α2, α3 and λ1, λ2, λ3, are available to all 3 Players.

1A similar masking scheme using Shamir’s secret sharing with a software platform as target has also been presented at CHES 2011 [GM11a].

52 5.2. Target Scheme

Sharing a secret x is done by randomly selecting a secret coefficient a and computing 3 shares x1, x2, x3 as

x1 = x ⊕ (a ⊗ α1), x2 = x ⊕ (a ⊗ α2), x3 = x ⊕ (a ⊗ α3).

Each Player i receives only one share xi without having any information about the other shares.

Reconstructing the secret x from the 3 shares x1, x2, x3 is possible knowing the pre- computed values λ1, λ2, λ3 as

x = (x1 ⊗ λ1) ⊕ (x2 ⊗ λ2) ⊕ (x3 ⊗ λ3).

In the following we consider the essential operations required for an AES S-box compu- tation, and discuss about the role of each Player. Hence, let us suppose a (unshared) constant c and two secrets x and y, which are each represented by 3 shares, i.e., x1, x2, x3 and y1, y2, y3. The shares were constructed using the same public points α1, α2, α3 and by the secret coefficients a and b, respectively.

Addition with a constant, i.e., z = c ⊕ x, in the shared mode can be performed by each Player as

Player 1 : z1 = x1 ⊕ c = x ⊕ (a ⊗ α1) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α1)

Player 2 : z2 = x2 ⊕ c = x ⊕ (a ⊗ α2) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α2)

Player 3 : z3 = x3 ⊕ c = x ⊕ (a ⊗ α3) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α3).

Therefore, z1, z2, z3 correctly provide the shared representation of z without requiring interaction between the Players.

Multiplication with a constant, i.e., z = c ⊗ x, c =6 0, can be executed in a similar way as

Player 1 : z1 = x1 ⊗ c = (x ⊕ (a ⊗ α1)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α1)

Player 2 : z2 = x2 ⊗ c = (x ⊕ (a ⊗ α2)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α2)

Player 3 : z3 = x3 ⊗ c = (x ⊕ (a ⊗ α3)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α3), since z1, z2, z3 represent a correctly shared z, if we consider a⊗c as the secret coefficient.

53 Chapter 5. Implementation of a Glitch-resistant Masking Scheme

Addition of two shared secrets, i.e., z = x ⊕ y, is easily performed by

Player 1 : z1 = x1 ⊕ y1 = x ⊕ (a ⊗ α1) ⊕ y ⊕ (b ⊗ α1) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α1)

Player 2 : z2 = x2 ⊕ y2 = x ⊕ (a ⊗ α2) ⊕ y ⊕ (b ⊗ α2) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α2)

Player 3 : z3 = x3 ⊕ y3 = x ⊕ (a ⊗ α3) ⊕ y ⊕ (b ⊗ α3) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α3). z1, z2, z3 provide the shared representation of z, again considering a ⊕ b as the secret coefficient.

Multiplication of two shared secrets, i.e., z = x ⊗ y, is the challenging part. If each Player computes the multiplication of two shares as

2 Player 1 : t1 = x1 ⊗ y1 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α1) ⊕ (a ⊗ b ⊗ α1) 2 Player 2 : t2 = x2 ⊗ y2 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α2) ⊕ (a ⊗ b ⊗ α2) 2 Player 3 : t3 = x3 ⊗ y3 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α3) ⊕ (a ⊗ b ⊗ α3), t1, t2, t3 are not a correctly shared representation of z because –according to [PR11]– the underlying polynomial is of a higher degree and does not have a uniform distribution. The solution given in [PR11] is as follows:

(1) Each Player i after computing ti, randomly selects a coefficient ai, remasks ti as

qi,1 = ti ⊕ (ai ⊗ α1), qi,2 = ti ⊕ (ai ⊗ α2), qi,3 = ti ⊕ (ai ⊗ α3),

and sends each qi,∀j6=i to the corresponding Player j.

(2) Now each Player i has three elements q1,i, q2,i, q3,i, and reconstructs zi as

zi = (q1,i ⊗ λ1) ⊕ (q2,i ⊗ λ2) ⊕ (q3,i ⊗ λ3).

Indeed, z1, z2, z3 now is a correctly shared representation of z considering (a1 ⊗λ1)⊕(a2 ⊗ λ2) ⊕ (a3 ⊗ λ3) as the secret coefficient.

Square of a , i.e., z = x2, cannot be computed in a straightforward way in contrast to what is stated in [PR11]. If each Player i squares its share xi as

2 2 2 2 Player 1 : z1 = x1 = x ⊕ (a ⊗ α1 ) 2 2 2 2 Player 2 : z2 = x2 = x ⊕ (a ⊗ α2 ) 2 2 2 2 Player 3 : z3 = x3 = x ⊕ (a ⊗ α3 ), z1, z2, z3 do not provide a correctly shared representation of z unless – as also stated in [GM11a] – the public points α1, α2, α3 as well as λ1, λ2, λ3 are squared. If the result

54 5.2. Target Scheme

x2 x3 x6 x12 x15 x30 x60 x120 x240 x252

(a) S-box

c1 c2 c3 c4

(b) MixColumns

Figure 5.1: Block diagram of sequential operations necessary for an AES S-box and a fourth of MixColumns

of the squaring z1, z2, z3 is used in later computations where other secrets shared by the original public points α1, α2, α3 are involved, z1, z2, z3 must be remasked to provide a correctly shared representation of z using the original public points. To do so a Fresh- Masks scheme is proposed in [GM11a]. Moreover, in [RP12], the extended version of the original scheme, a specific condition is defined for the public points to simplify the operation. In the simplest settings α1 is set to 1 and the other public points have to 2 2 satisfy the conditions: (α2) = α3, (α3) = α2. Therefore, after each Player squared its share, two Players must exchange their secrets which is called reordering in [RP12].

However, we instead realize the squaring operation by simply reusing our shared mul- tiplication circuit. This creates a correct representation of z without requiring specific public points or a reordering operation. Indeed, following the conditions for the public points given in [RP12] would lead to less computation overhead and higher performance compared to our considered solution. On the other hand, since we are implementing the scheme as hardware circuit, our multiplication reuse solution reduces the area require- ments and leads to a simpler control structure.

55 Chapter 5. Implementation of a Glitch-resistant Masking Scheme

The inversion part of the AES S-box can be computed by the scheme presented in [RP10] as

 4  16  4 x−1 = x254 = x2 ⊗ x ⊗ x2 ⊗ x ⊗ x2 ⊗ x ⊗x2.

Since this scheme contains only a couple of square and multiply operations, the inver- sion can be realized by our shared multiplication scheme. In contrast to what is stated in both [PR11] and [GM11a], the affine transformation following the inversion cannot be performed in a straightforward manner. The reason is, as also addressed in [epr11], that the linear part of the affine transformation of the AES S-boxis a linear function over GF(2), not over GF(28). The solution for this problem, as also stated in [RP12], is to rep- resent the affine transformation over GF(28) (using the Rijndael irreducible polynomial). This actually has been presented before in [MR02] and [DR02] as

Affine (x) = 63 ⊕ (05 ⊗ x) ⊕ (09 ⊗ x2) ⊕ (f9 ⊗ x4) ⊕ (25 ⊗ x8) ⊕ (f4 ⊗ x16) ⊕ (01 ⊗ x32) ⊕ (b5 ⊗ x64) ⊕ (8f ⊗ x128) .

Fig. 5.1(a) displays the sequence of operations for a complete S-box computation consider- ing the secret sharing restated above. Note that the modules denoted by black N indicate the shared multiplication, and those by gray N the multiplication with a constant.

5.3 Implementation & Caveats

In order to implement the aforementioned scheme one needs to follow the requirements addressed in [PR11]. The goal of the scheme is to separate the side-channel leakage of the computations done by each Player in order to prevent any univariate leakage. As stated in [PR11], there are two possible ways to separate the leakage. Either the circuit of each Player is realized by dedicated hardware, e.g., one FPGA per Player, which does not seem to be practical, or the operations of each Player are separated in time. We follow the second option and have tried to mount the whole of the scheme in one FPGA – with the goal of a global minimum area-overhead – by the design shown in Fig. 5.2.

By means of a dedicated and carefully designed control unit we made sure that the Players sequentially get active. In other words, no computation or activity is done by the other Players when one Player is active. The design of the shared multiplication module is slightly different to the other modules. In contrast to the others, where the computation on each share by the corresponding Player is independent of that of the other shares, the Players in the shared multiplication module need to communicate with each other. Therefore, we had to divide the computations of each share in this module into two parts

56 5.3. Implementation & Caveats

ea1

em1 3 t1 em2 3 t2 em3 3 t3 AFF1 63 a1 a2 a3

PRNG 2 PRNG 2 PRNG 2 ea1

eo1 ea1 1 1 1 sela1 out1 ea2 M&MSK1 M&MSK2 M&MSK3 05 09 f9 25 em4 em4 em4 f4 em5 em5 em5 em6 em6 em6 01 q q q q q q q q q b5 1,1 2,1 3,1 1,2 2,2 3,2 1,3 2,3 3,3 8f AFF2 63 selc f oe ea2 1 2 3 1 2 3 1 2 3 eo2 ea2

sela2 out2 NMSK1 z1 NMSK2z2 NMSK3 z3 ea3

2 3 12 2 3 12 2 3 12 ea1 em1 es 1 es 1 es 1 ea2 em2 es 2 es 2 es 2 ea3 em3 es 3 es 3 es 3 AFF3 63

ea3 eo3 ea3 em1 em2 em3 selm1 selm2 selm3 sela3 out3

in1 in2 in3

Figure 5.2: Our design of the shared multiplication and addition to realize the AES S-box by inserting a register between the two steps as explained in Section 5.2 (see registers marked by qi,j in Fig. 5.2).

Another important issue regarding our design is the way that the multiplexers are con- trolled. Since the shared multiplication module needs to get different inputs in order to realize a multiplication or a square, there should be a multiplexer to switch between different inputs. That is because – considering Fig. 5.1(a) – the shared multiplication module performs always squaring except in steps 2, 5, 10, and 11. Control signals which select the appropriate multiplexer input must be hazardless2. Otherwise, as an exam- ple, glitches on select signals of Player 1 while Player 2 is active will lead to concurrent side-channel leakage of two shares. Therefore, as a solution we provided some registers to control which input to be given to the target module.

2In the areas of digital logic a dynamic hazard means undesirable transient changes in the output as a result of a single input change.

57 Chapter 5. Implementation of a Glitch-resistant Masking Scheme

For simplicity, we first explain how the shared multiplication module works:

 In the first clock cycle by activating the enable signal em1 the first share of both appropriate inputs are saved into their corresponding registers, get selected by select signal selm1, and therefore are multiplied. At the same time the remasking process using a new random a1 and public points α1, α2, α3 is performed. Note that the result of these computations are not saved in this clock cycle.

 The same procedure as in the first clock cycles is done on the second and the third shares one after each other in the second and the third clock cycles by activating enable signals em2 and em3 respectively.

 The results of the remasking for Player 1 (indeed provided by all 3 Players), which are available at the input of registers q1,1, q2,1, q3,1, are stored at the forth clock cycle by enabling signal em4. Therefore, the second step of the module gets ac- tive and performs the unmasking using λ1, λ2, λ3 to provide the first share of the multiplication output. Note that again the result is not saved in this clock cycle.

 In the next two clock cycles (fifth and sixth) the same operation as the previous clock cycle is performed for Player 2 and Player 3 consecutively by enable signals em5 and em6. Note that to save x2, x3, and x12 (see Fig. 5.1(a)) in the appropriate step, one of the 2 3 12 signals esi∈{1,2,3}, esi , and esi gets enabled at the same time with the corresponding emi signal. In fact, we need six clock cycles to completely perform a shared multiplication or a square. It means that since we use only one shared multiplication module in our design, in 6 × 11 = 66 clock cycles the inverse of the given shared input is computed. Afterwards, in order to realize the affine transformation the multiplication-addition mod- ule (modules AFF1, AFF2, and AFF3 in Fig. 5.2) must also contribute to the compu- tations. The Players in this module do not need to establish any communication and their computation is restricted to their own shares. Therefore, by appropriately selecting selai∈{1,2,3} and enabling the eai signal the multiplication with constant and the shared addition both can be done in one clock cycle per share, i.e., three clock cycles in sum. Note that the same techniques as before to make hazardless control signals are used in the design of the multiplication-addition module. Also, the sequence of operations is similar to what is expressed for the first three clock cycles of the shared multiplication mod- ule. According to Fig. 5.1(a), during the affine transformation a multiplication-addition operation must be performed prior to each and after the last square. Therefore, after 3 × 8 + 6 × 7 = 66 clock cycles the operations of an affine transformation is completed resulting in 132 clock cycles in sum to compute an S-box shared output. One optimization option is to perform the multiplication-addition and the first three clock cycles of the squaring at the same time to save 24 clock cycles per S-box computation. According to the definition and the requirements of the scheme, it should not provide

58 5.3. Implementation & Caveats

Table 5.1: Area and time overhead of our design based on XC5VLX50 Virtex-5 FPGA (excluding state register, KeySchedule, PRNGs, initial masking, and final unmasking) FF LUT Slice SB MC+ARK Encryption Design #% #% #% CLK CLK CLK 1 SB MC 315 1% 1387 5% 859 12% 2112 192 22 896 16 SB MC 4275 15% 21 328 74% no fit 132 12 1431

any security loss. However, since our main goal is to practically examine the side-channel leakage of this scheme, we ignored this optimization to be able to separately localize the side-channel leakage of each operation.

Though an optimized scenario to perform MixColumns is proposed in [PR11], by adding more multiplexers (and select registers) to the multiplication-addition module our pre- sented design can also realize MixColumns and AddRoundKey. This can be done ac- cording to the diagram given by Fig. 5.1(b) and selecting the appropriate coefficients c1, c2, c3, c4 corresponding to the rows of the matrix representation of MixColumns. After finishing all SubBytes transformations of one encryption round, i.e., 132 × 16 = 2112 clock cycles, every output byte of the MixColumns transformation in addition to the corresponding AddRoundKey can be computed in 3 × 4 = 12 clock cycles. That is, 12 × 16 = 192 clock cycles for whole of the MixColumns and AddRoundKey transformations. In sum, ignoring the required time for initial masking of the input and the key and for (pre)computing the round keys a whole encryption process takes 2112 × 10 + 192 × 9 + 3 × 16 = 22 896 clock cycles.3

We should stress that – except the mentioned one – no time-optimization option exists for our single-S-box design since no more than one share is allowed to be processed at the same time. It is possible to reach a higher throughput by making multiple, e.g., 16, instances of our design inside the target FPGA and process all SubBytes and later all MixColumns in parallel. This, in fact, leads to a very high area-overhead (addressed by Table 5.1) that even cannot fit into the slices available in our target FPGA which is of the medium-size modern series. We should emphasize that the GF(28) multiplier we employed here is a highly optimized and pure combinatorial circuit, and the design is made for any arbitrary public values αi∈{1,2,3} and λi.

3In the last round MixColumns is ignored and each separate AddRoundKey on one shared state value takes 3 clock cycles.

59 Chapter 5. Implementation of a Glitch-resistant Masking Scheme

5.4 Conclusion

In this work we have demonstrated how to correctly implement a provably-secure glitch- resistant masking scheme of [PR11]. By making certain that in each point in time only operations on a single share are performed, there should in theory exist no exploitable univariate leakage, which was also confirmed by practical evaluations when using a low operation frequency and a basic measurement setup. For details on the evaluation we refer to [MM13b]. While the countermeasure is valuable from a theoretic point of view, its use in real world implementations is unlikely because of the large overheads in area and performance.

60 Chapter 6

Are Dual Ciphers a Side-Channel Countermeasure?

While keeping the in- and outputs of a dual cipher equal to the original AES, all the intermediate values and operations can be different from that of the original one. A comprehensive list of these dual ciphers is given by an article presented at ASIACRYPT 2002, where it is mentioned that they might be used as a kind of side-channel attack countermeasure if the dual cipher is randomly selected. Later, in a couple of works performance figures and overhead penalty of hardware implementations of this scheme are reported. However, the suit- ability of using randomly selected dual ciphers as a power analysis counter- measure has never been thoroughly evaluated in practice. In this chapter we address the pitfalls and flaws of this scheme when used as a side-channel coun- termeasure. As evidence of our claims, we provide practical evaluation results based on a Virtex-5 FPGA platform. We realized a design which randomly selects between the 240 different dual ciphers at each AES computation and examined its vulnerability to SCA attack models. As a result, we show that the protection provided by the scheme is negligible considering the increased costs in term of area and lower throughput.

Contents of this Chapter 6.1 Introduction ...... 62 6.2 Dual Cipher Concept ...... 63 6.3 Design ...... 65 6.4 Evaluation ...... 67 6.5 Conclusion ...... 71

61 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?

6.1 Introduction

In the early 2000s, there were attempts to better understand the algebraic specification of AES-Rijndael. One is about how to make dual ciphers which are equivalent to the original Rijndael in all aspects [BB02a]. By replacing all the constants in Rijndael, including the replacement of the irreducible polynomial, the coefficients of the MixColumns, the affine transformation in the S-box, etc, the idea is to make different ciphers which generate the same ciphertext as the original Rijndael for the given plaintext and key. As explained in [BB02a], there exist 240 non-trivial Rijndael dual ciphers, and a comprehensive list of the matrices and coefficients is given in [BB02b]. Later in [Rad04], it has been shown that one can include field mappings from GF(28) to GF(2)8 as well as intermediate isomorphic mappings to GF(22) and GF(24) to build 61 200 similar Rijndael dual ciphers.

This idea was taken by the authors of [WLL04], and by means of the gate count they in- vestigated which of those 240 dual ciphers can be implemented in hardware using smaller area, and which ones can speed up the implementation. Since the intermediate val- of the dual ciphers during encryption are different than Rijndael’s, it is mentioned in [BB02a] that one can randomly change the constants of the cipher thereby realizing different dual ciphers and provide security against power analysis attacks. This led to other contributions. For instance, a hardware-software co-design of a system based on an Altera FPGA where according to the randomly chosen parameters the content of the lookup tables are dynamically changed is presented in [JCCC07a,JCCC07b]1. Moreover, the authors of [GL08] and [GL09] presented a hardware implementation which can realize every selected dual cipher amongst those 240 ones. They reported the performance and area loss when the scheme is realized in order to increase the security against side-channel attacks.

In this work we examine this scheme, i.e., random selection of constants to choose a dual cipher out of 240, from a side-channel point of view. We address its flaws and weaknesses which can lead to easily breaking the corresponding implementation. In order to examine our findings in practice, we implemented the scheme on a Virtex-5 FPGA by means of precomputed matrices and constants and – in contrast to [JCCC07b] – by avoiding the use of any lookup table. Our practical side-channel evaluations confirm our claims indicating that the protection provided by the scheme is negligible while having high area and performance overheads. We show that the implementation can be easily broken when a suitable attack model is taken by the adversary.

The next section restates the concept of Rijndael dual ciphers with respect to the original work [BB02a]. Our design of the scheme considering our targeted FPGA platform in addition to its performance and area overhead figures are represented in Section 6.3. Our

1In fact, the cipher which is realized by their design is not always equivalent to the original AES- Rijndael.

62 6.2. Dual Cipher Concept

discussions about the side-channel resistance of the scheme and practical investigations are given by Section 6.4 while Section 6.5 concludes our research on dual ciphers.

Results of this chapter were published at ICICS in 2013 [MM13a] as joint work with Amir Moradi.

6.2 Dual Cipher Concept

Two ciphers E and are called dual ciphers, if they are isomorphic, i.e., if there exist invertible transformations f(·), g(·) and h(·) such that

0 ∀P,KEK (P ) = f(Eg(K)(h(P ))), where plaintext and key are denoted by P and K respectively.

The concept of dual ciphers for AES-Rijndael was first published in 2002 [BB02a]. The authors demonstrate how to build a square dual cipher of the original AES and show that it is possible to again iterate this process multiple times creating more square dual ciphers. This way 8 dual ciphers for each possible irreducible polynomial in GF(28) can be derived. Since it is also shown how to create dual ciphers by porting the cipher to use one of the other 30 irreducible polynomials in GF(28), a total of 240 non-trivial dual ciphers for AES exist. Here non-trivial means that we are only considering those dual ciphers which actually change the inner core of AES and not only consist of invertible transformations of the input and output of the cipher.

As an example, closely following the explanation in [BB02a], let us consider a square dual cipher of the original AES-Rijndael. In order to create this dual cipher one first has to multiply all AES constants by a matrix which performs the squaring operation under the original AES-Rijndael polynomial 0x11b. These constants include the round constant of the key schedule, the coefficients of the MixColumns transformation, as well as the input data and the key. In this special example this matrix is generated by taking a generator a, in this case the polynomial x2 in GF(28), and building a matrix of the form R = (a0, a1, a2, a3, a4, a5, a6, a7), where each of these elements represents a column of the matrix and the result of the exponentiation is reduced by the original AES reduction polynomial. The resulting matrix is

63 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?

1 0 0 0 1 0 1 0   0 0 0 0 1 0 1 1   0 1 0 0 0 1 0 0     0 0 0 0 1 1 1 1 R =  . 0 0 1 0 1 0 0 1     0 0 0 0 0 1 1 0   0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1

Furthermore, we also need to make changes to the SubBytes transformation. If we con- sider SubBytes to be purely a table look-up of constants S(x), we can compute a new look-up table S2 by applying the R matrix and its inverse R−1 as follows: S2 = RS(R−1x). If we consider the SubBytes transformation as inversion in GF(28) followed by a multi- plication by the affine matrix A and addition of the constant b, then the inversion stays unchanged while the new affine matrix A2 is computed as A2 = RAR−1 and the new constant b2 is computed (similar to the other constants, i.e., those of MixColums and key schedule) by multiplying it with the transformation matrix R: b2 = Rb. Note that in the case of S2 or A2 no actual squaring is taking place.

If we consider the original cipher as E and the above described square dual cipher as , by applying the same squaring routines again we can create a total of 8 square dual ciphers (up to E128 since E256 is equal to E in GF(28)). These square dual ciphers all use different constants and SubBytes transformations. According to the dual cipher concept, if the R matrices are multiplied with all input data bytes and key bytes and the result is transformed back by multiplying each output byte with the inverse matrix R−1, the results of all ciphers when given the same input data and key will be equal. The differences in the internal structure, like the different S-box in SubBytes or the different coefficients in the MixColumns, also translates into e.g., different power consumption and EM emanations of a circuit implementing this technique. As denoted in [BB02a], these differences in the internal structure of the dual ciphers might be usable as some kind of side-channel countermeasure. If the used dual cipher is randomly chosen, this could be comparable to a normal masking countermeasure.

Besides using square dual ciphers of the original AES-Rijndael, one can use the same transformation techniques as above to change all constants by using different generators a and reducing the ai by the new irreducible polynomial. If the SubBytes transformation is not implemented as table look-up but as inversion plus affine transformation, the inversion as well as all field multiplications as in MixColumns are then also performed using the new irreducible polynomial not the original one. This works for all 30 irreducible polynomials in GF(28). Since there exist 8 generators for all irreducible polynomials representing the 8 square dual ciphers, as stated previously a total of 240 different non-trivial dual ciphers in GF(28) exist. All generators, polynomials and constants of each of the 240 dual ciphers can be found in [BB02b]. Note that we consider only dual ciphers using

64 6.3. Design

x2 x3 x6 x12 x15 x30 x60 x120 x240 x252 -1 x SQ MUL SQ SQ MUL SQ SQ SQ SQ MUL MUL x

Figure 6.1: Inversion circuit in GF(28) mappings in GF(28) not such where other composite field representations are utilized, e.g., those presented in [Rad04].

6.3 Design

The first design decision one has to make is whether to implement the SubBytes trans- formation purely based on look-up tables or if a general inversion circuit is used together with the affine matrix multiplication and constant addition. Since the area overhead to store 240 different complete S-boxes is massive, similar to [GL08] and [GL09] we opted to implement a general inversion circuit. Because we want to analyze the side-channel resistance of the original submission of dual-ciphers [BB02a], this requires the inversion to be implemented in GF(28) without the option to save on resources by utilizing inversions in composite fields or using a tower field approach [Paa94, SMTM01]. In other words, the inversion circuit must be general and valid for all the 30 irreducible polynomials mentioned in Section 6.2.

In order to prevent leakage through the timing channel during the inversion it is important to make the circuit time invariant. To achieve this one can make use of the fact that in GF(28) x256 is equivalent to x, which leads to x−1 ≡ x254. Using addition chains this exponentiation can be implemented by a low number of modular multipliers and squaring circuits as depicted in Fig. 6.1. Note that the squaring step itself is free in GF(28) and only requires hardware resources for the modular reduction.

For each possible dual cipher one needs to store the following parameters:

(1) Initial transformation matrix R (64 bits), which is required to transform the original input data and key to the dual cipher representation.

(2) Inverse transformation matrix R−1 (64 bits), required to transform the output of the AES computation from the dual cipher representation back to the original AES representation, precomputed as normal matrix inversion of R in GF(2).

(3) Modular reduction polynomial pˆ (8 bits), to be used during all field multipli- cations (MixColumns) and the inversion steps (SubBytes).

65 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?

(4) MixColumns coefficients mcˆ (2 × 8 bits). While the MixColumns coefficients originally are 8-bit elements of a 4 × 4 matrix, because the coefficients of each row 8 are only a rotated variant of the first row and only two are not 01x (in GF(2 )), it is sufficient to store only these two transformed coefficients (R(02x), R(03x)). (5) Affine matrix of SubBytes Aˆ (64 bits), to apply the affine matrix multiplication step of the affine transformation. The matrix is computed as Aˆ = RAR−1, where A is the original affine matrix of the AES. (6) Affine constant ˆb of SubBytes (8 bits), final addition step of the affine trans- formation. As for every other constant transformation this can be computed as ˆb = Rb, where b is the original affine constant, i.e., 63x. (7) Round constants (rcon) of the key scheduling rcˆ (10 × 8 bits). The rcons r are constructed asrc ˆ (r) = (R 02x) modp ˆ, with r starting from 1 for the first round,p ˆ being the used irreducible polynomial, (02x) the initial element, and R the transformation matrix. The rcons could also be computed on-the-fly which would only require the storage of the transformedrc ˆ init = R 02x (8 bits). Since this would require another modular multiplier, we have opted to store all the precomputed rcons for each of the 240 dual ciphers. The overall architecture of our evaluation circuit is depicted in Fig. 6.2. The initial transformations of the input data and key are performed prior to the general AES/dual cipher computation. After the full encryption is complete, the inverse transformation moves the result back to the original AES representation as described previously. The AES/dual cipher computation itself is implemented as round-based design, i.e., every round of AES requires one clock cycle and the computation is finished after ten clock cycles excluding the initial and final transformations and data loading. We chose to implement a round-based design because this is very common in real-world implementations when a hardware platform is targeted. The on-the-fly key scheduling seems to be the most suitable option since the roundkeys, which are different for each dual cipher, would otherwise require 41 kBytes of storage. We have implemented the whole design on a Virtex-5 LX50 FPGA mounted on a SASEBO-GII [sas] (Side-channel Attack Standard Evaluation Board). In our implementation all the aforementioned parameters and constants are stored in Block-RAMs (BRAM) and are preloaded before every com- plete AES computation. The resource utilization is shown in Table 6.1. Compared to

Table 6.1: Performance figures (excluding the PRNG) Version #LUTs #FFs #BRAMs FREQ Random Dual Cipher 13 481 651 6 21 MHz General AES Enc Only 503 154 6 202 MHz

66 6.4. Evaluation

lastround Matrix done Multiply AndGates 0 GF(2)

Ciphertext 1 R-1 SubBytes 0 Matrix Multiply 1 -1 x Add GF(2) Matrix

GF(2) Multiply Plaintext Constant

init MixColumns AddRoundkey R State Registers BRAM p^ ^ mc^ PRNG Constants A ^b Storage R init rc^ Matrix Multiply 1 KeySchedule Key GF(2) 0 Key Registers

Figure 6.2: Overall architecture of the AES dual ciphers circuit an unprotected design utilizing a more common S-box implementation based on look-up tables we require significantly more LUT resources. This is due to the 20 large general inversion circuits implemented in parallel (16 for the round function and 4 for the key scheduling) which are required to perform the inversion in every selectable dual cipher representation. The number of LUTs could be heavily reduced by using a composite field or tower field approach in the S-box design which, as stated previously, we have not implemented at this point to enable a side-channel evaluation of the original dual cipher proposal. We should also highlight the very low maximum operation frequency of our design. It is due to the very long critical path of the inversion unit. Since it has to be general for any given irreducible polynomial, it could not be optimized with respect to both delay and area.

6.4 Evaluation

As explained in the previous section, at the beginning of each encryption process a PRNG determines which of the 240 dual ciphers is selected. This dual cipher index i is provided as address to the BRAM, which in turn outputs all the constants and coefficients for the whole circuit. If the dual cipher index is unknown to the adversary, he cannot predict the intermediate values. This can be seen as a kind of masking scheme providing some

67 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure? level of resistance against side-channel attacks. However, there are a few issues which significantly affect the robustness of the scheme we address below. Note that the practical evaluations whose results are described in this section were performed by Amir Moradi.

6.4.1 Mask Reuse

All intermediate values and inputs are transformed using the same transformation ma- trix Ri of the corresponding dual-cipher. Therefore, all plaintext bytes are transformed using the same transformation, which leads to an issue similar to mask reuse in masking schemes. Considering a Boolean masking scheme, if the inputs of two S-boxes are masked by the same mask value, the corresponding key bytes difference might be recovered by a classical linear collision attack [Bog08, CFG+11]. The same holds true for dual-ciphers, since the inputs are transformed by the same matrix and all S-boxes use the same dual- cipher parameters. In case a collision in the side-channel leakage between two S-boxes is detected (1) (1) (2) (2) Si(Ri(x ⊕ k )) = Si(Ri(x ⊕ k )), the linear key difference k(1) ⊕ k(2) can be recovered as x(1) ⊕ x(2), where j denotes the byte index of the given plaintext and key. However, since our design is realized as round- based architecture, separating the side-channel leakage of different S-boxes and detecting collisions is infeasible.

6.4.2 Concurrent Processing of Mask and the Masked Data

Contrary to implementations of masking in software, preventing univariate leakages is a very challenging task on hardware platforms. When a masked S-box is processing both mask and the masked data at the same time, glitches in the circuit (see Section 2.2 can cause exploitable leakage. This issue has been observed in many realizations of masking schemes [MPO05, MME10, MM12b], and the dual ciphers scheme suffers from the same issue. The SubBytes unit receives the transformed key-whitened input as well as the irreducible polynomialp ˆ, affine matrix coefficients Aˆ, etc. Due to the glitches the side- channel leakage of the S-box is therefore not independent of its untransformed input. A univariate attack, e.g., a CPA [BCO04], should be able to recover the secrets if an appropriate power model is used.

68 6.4. Evaluation

6 4

Freq. 2 0 0 50 100 150 200 250 6 Sbox Input 4

Freq. 2 0 0 50 100 150 200 250 S−box output

Figure 6.3: Distributions of the S-box output for (top) 11x and (bottom) 44x as original input over all 240 dual ciphers

6.4.3 Unbalance

Considering the lemmas and properties of [NRR06, NRS11], we explain this issue as follows. Let us assume a masking scheme where an input value x is transformed into its masked representation xm with a mask m: xm = x ∗ m. The conditional probability

Pr(xm = XM |x)

must be constant for ∀x to ensure the balance of the distributions (with XM we mean a realization of xm). In other words, if fx(xm) represents the probability density function of xm for a given x, the probability distributions of two different realizations of x must be the same. If the distributions would be different, their corresponding side-channel leakages could be distinguished from each other allowing to detect which value of x was processed. For a scheme to be considered secure, this property has to hold for all intermediate values, which is not the case in our dual cipher design. Taking two distinct values for the S-box output and computing the corresponding probability distributions of their inputs for all 240 dual-cipher cases, clearly indicates that those intermediate values are unbalanced (see Fig. 6.3). Again, a univariate side-channel attack should therefore be able to extract the secrets since leakages for different inputs can be distinguished.

6.4.4 Zero Value

A general problem in multiplicative masking schemes is the masking of the zero value. Regardless of the mask m, the x = 0 input will always be mapped to itself. A CPA attack utilizing the zero-value power model [GT02] can therefore usually extract the secrets. This also holds true for the dual cipher approach. Since the transformation step

69 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?

consists of a linear matrix multiplication with R in GF(2), the zero input remains zero in all 240 dual-cipher cases. Therefore, a zero-value CPA attack targeting the S-box input should easily be able to overcome the countermeasure. Since we have so far only considered the 240 dual-ciphers of the original publication in our design, one might argue that some of those issues might be mitigated if one can select the set of dual-ciphers from the 61 200 cases of [Rad04]. Regardless of the zero-value issue, even if one could find a set of dual ciphers which satisfy the balance property on the S-box input, at the same time keeping the balance property for the S-box output cannot be certainly justified because each dual cipher employs a different S-box. Therefore, the aforementioned problems remain valid for any selection of dual- ciphers making implementations employing this scheme vulnerable to certain attacks.

6.4.5 Practical Investigations

As stated before, our practical evaluations are based on experiments performed on a SASEBO-GII [sas] FPGA platform. The design was implemented on the crypto FPGA of the board, a Xilinx Virtex-5 LX50. For each encryption operation the dual cipher index i was randomly chosen by means of an internal PRNG. A LeCroy HRO66Zi 600MHz digital oscilloscope was used to measure the power consumption of the crypto core at a sampling rate of 1GS/s with a 1Ω resistor in the VDD path. All experiments were performed while running the crypto core at a low clock frequency of 1.5MHz. The bandwidth of the oscilloscope was limited to 20MHz to further reduce electrical noise. A sample power trace of a complete AES computation is shown in Fig. 6.4. The first peak between 0 and 1µs is caused by the selection of the dual cipher, where the corresponding parameters of the selected dual cipher appear at the BRAMs’ output and propagate through the combinatorial circuit. The following peaks are due to the 10 AES rounds and the final storage of the ciphertext in the state register. Note that the peak-to-peak voltage of more than 300 mV is quite high, which is due to the required general purpose inversion circuits (c.f. Section 6.3).

3

2

1

Voltage [100mV] 0 0 2 4 6 8 10 Time [μs]

Figure 6.4: A sample power trace, PRNG ON

70 6.5. Conclusion

(a) PRNG ON

(b) PRNG ON (c) PRNG OFF

Figure 6.5: Correlation-Enhanced Collision Attack results, (a) using 500 000 traces, (b) and (c) over the number of traces

First we mounted a Correlation-Enhanced Collision Attack (CECA) (c.f. Section 2.3), targeting two S-boxes of the first round. While attacking two different instantiations of a S-box is not the optimal case for the CECA, we were still able to recover the linear key difference as is depicted in Fig. 6.5. Compared to the unprotected implementation, where the PRNG is switched off thereby selecting the original AES parameters, the number of required traces increases from 5000 to 100 000.

Next, we examined the feasibility of a zero-value attack whose evaluation results are presented in Fig. 6.6. They confirm our theoretical claims that a zero-value attack is amongst the weakest points of the scheme. Using a very low number of 10 000 traces is sufficient to overcome the provided protection.

6.5 Conclusion

In this work we have taken an in-depth look at the AES-Rijndael dual cipher concept from a side-channel point of view. We have implemented an evaluation circuit which is able to perform AES computations using randomly chosen dual ciphers. The inversion part of the circuit operates in GF(28), as in the original dual cipher contribution [BB02a],

71 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?

(a) PRNG ON

(b) PRNG ON (c) PRNG OFF

Figure 6.6: Zero-value attack results, (a) using 100 000 traces, (b) and (c) over the number of traces giving a total choice of 240 different internal computations with correspondingly different side-channel leakage characteristics. Besides providing practical evidence of the vulnerability of this original dual cipher im- plementation to several side-channel attacks, we have also described some of the general flaws of the scheme when considered as a side-channel countermeasure. This includes the mask reuse, the concurrent operations on both mask and the masked data, the violation of the balance property, and the inability to mask the zero value. Because of these prop- erties the vulnerability of dual cipher implementations is not only limited to those which are restricted to a low amount of possible transformations by focusing on mappings in GF(28). Even when one would be able to select between several thousand dual ciphers using composite fields, as given in [Rad04], the described weaknesses still exist and would enable an attacker to successfully extract the secret key. In conclusion, even when ignor- ing the large area overhead of the circuit in comparison to other lighter masking schemes, AES-Rijndael dual ciphers are unsuitable as a side-channel countermeasure and can be broken using modest efforts and simple attack models.

72 Part II

Fault Analysis of AES

Chapter 7

Preliminaries

This chapter provides an introduction to fault injection attacks and counter- measures. It also outlines the remaining chapters of this part of the disserta- tion.

Contents of this Chapter 7.1 Introduction ...... 75 7.2 Fault Injection Attacks ...... 76 7.3 Countermeasures ...... 77 7.4 Joint Motivation and Contributions ...... 79

7.1 Introduction

Fault Injection Analysis (FIA) belongs to the active branch of physical attacks. For an FIA attack a device has to be forced into faulty behavior, which can be achieved by (semi-) invasive or non-invasive means. Examples of invasive attacks are the use of laser light to destroy a memory cell, UV strobes to create electric charges, cutting a wire, or forcing a signal to a constant value by use of probing station. Operating the device outside the defined environmental conditions – too high or too low operating voltage or temperature–, or using voltage or clock glitches, are examples of non-invasive fault injection. The fault injection itself might already lead to a successful outcome, e.g., if a certain instruction is skipped by the target device, but more often the output of the faulty computation is used to recover the secret. The following sections present certain fault attack techniques, in particular FSA which is the basis for our proposed attack improvements in later chapters. We also provide an overview over common countermeasures and outline the remaining chapters of this dissertation.

75 Chapter 7. Preliminaries

7.2 Fault Injection Attacks

We introduce both Differential Fault Analysis (DFA), which the most dominant fault- based attack in literature, as well as Fault Sensitivity Analysis (FSA), a recent new technique presented at CHES in 2010.

7.2.1 Differential Fault Analysis (DFA)

Differential Fault Analysis [BS97] has been proposed at CRYPTO in 1997. While no practical results were presented at that time, it was shown that considering a certain fault model the secret key of a DES computation could be recovered in simulation using just a few faulty and non-faulty ciphertext pairs. The attack requires an attacker who is able to cause a single or at least a low number of bit flips during an encryption operation. He must also be able to collect the resulting faulty ciphertext C0 as well as the non-faulty ciphertext C belonging to the same plaintext. Based on a key guess an intermediate state of the cipher is then computed for both C and C0. If the relation of both intermediate states fit to the assumed fault characteristic, e.g., a single bit flip in the targeted round, the secret key can be recovered after repeating this procedure for several ciphertext pairs.

7.2.2 Fault Sensitivity Analysis (FSA)

Unlike DFA [BS97], no faulty are required by a Fault Sensitivity Analysis (FSA) [LSG+10] attack. Instead, the attack works by increasing the fault intensity until a distinguishable characteristic can be observed, e.g., the first appearance of a faulty output. Therefore the value of the faulty output is not required, only the fact that a fault occurred under the used operating conditions. It was practically demonstrated that this attack is able to completely break the AES PPRM1 core of the SASEBO LSI2 (fabricated in 130 nm technology) using 200 faulty operations for each of 50 randomly selected plaintexts. It also could reveal three key bytes of the AES WDDL implementation of the same ASIC using its fault sensitivity leakage obtained from 1200 plaintexts. The presented method used to increase the fault intensity in [LSG+10] is based on the shortening of clock glitches. Two normal clock cycles get replaced by a short and a longer one, whereby the length of the short one can be gradually decreased until a faulty output occurs or the fault becomes stable. Since the critical path of some gates, e.g., AND and OR gates, is data dependent, the knowledge of the underlying model for this data dependency helps revealing the secret. For example, by simulation it could be

76 7.3. Countermeasures

ascertained that the timing delay of computing the output of a PPRM S-box correlates to the Hamming Weight (HW) of its input [LSG+10]. While WDDL should in theory be immune against set-up time violation attacks, by creating templates with a known key it was shown that at least some bits correlate to the timing delay which lead to the aforementioned recovery of three key bytes. However, for an S-box implemented by an inversion circuit followed by an affine transformation, it could not be shown how to map the timing information of the faults to input values of the S-box. Since no faulty ciphertexts are required, the attack might also be applicable to implemen- tations which apply DFA countermeasures. Therefore, even if a DFA countermeasure is implemented which hinders the propagation of the faulty ciphertext, just knowing that a fault occurred might be enough to break the implementation, as we will demonstrate in Chapter 9. Another difference to DFA attacks is that in the case of FSA the faults do not need to be restricted to a small sub-space. In contrast, by for example attacking the last round of the AES PPRM1 implementation, each faulty output byte can be independently observed and therefore the same complete faulty output can be used to attack all key bytes simultaneously. Moreover, as stated in [LSG+10], while countermeasures like masking are only of limited use against DFA attacks, they may have a larger impact on FSA attacks since the critical path is affected by the random mask bits.

7.3 Countermeasures

One can distinguish between two different types of fault attack countermeasures. By embedding active sensors in a circuit, it is possible to detect the fault injection itself and shut down the running cryptographic operation. A fault injection can also be detected by algorithmic means, e.g., by inserting redundancy into the circuit and verifying that no mismatch occurs.

7.3.1 Sensors

If a designer is in complete control of the fabrication, as is the case of cryptographic algorithm embedded in ASICs, active countermeasures can be employed. External power and clock lines can be monitored and, in case an anomaly is detected, an alarm can be raised usually resulting in a reset of the whole chip. Important circuit parts can be covered by active shielding to detect or at least increase the difficulty of modification attacks, e.g., rewiring or cutting with a focused ion beam (FIB).

77 Chapter 7. Preliminaries

(a) (b) (c)

(d) (e) (f)

Figure 7.1: CED schemes: (a) information redundancy: parity; (b) information redun- dancy: robust code; (c) time redundancy; (d) hardware redundancy; (e) hy- brid redundancy: inverse function; (f) hybrid redundancy: invariance-based CED (taken from [GMK12]).

Other sensors can be used to detect UV, laser light or any other kind of radiation. However, sensors are costly and, since some fault sources only affect very small parts of the chip, it is impossible to protect a complete chip against every attack option.

7.3.2 Concurrent Error Detection (CED) Schemes

Over the last decades several proposals for CED schemes, i.e., algorithmic FIA coun- termeasures, have been published. In [GMK12] the authors presented an overview of common techniques which are depicted in Fig. 7.1. They can be grouped into four cat- egories: information redundancy, time redundancy, hardware redundancy, and hybrid redundancy.

78 7.4. Joint Motivation and Contributions

Information redundancy can for example be achieved by adding parity bits [KR10, WKKG04] (Fig. 7.1(a)) or by a robust code (Fig. 7.1(b)) approach [KKT04], which at the round input tries to predict a nonlinear property of the round output. Time re- dundancy [MSY06] can be reached by computing the same cipher round twice using a single hardware instance in consecutive clock cycles (Fig. 7.1(c)). Hardware redun- dancy [MSY06] is achieved by duplicating the hardware instance and computing both the real cipher round and its check in the same clock cycle (Fig. 7.1(d)). The last cat- egory is hybrid redundancy, where different characteristics of the previous more general schemes are combined and sometimes rely on certain properties of the underlying algo- rithm. For example, the scheme displayed in Fig. 7.1(e) is a variant of hardware redundancy, but instead of computing the same function twice, it inverts the output of the previous round and checks it against the input of that round [KWMK02, SSHA08]. The newest scheme presented in [GK12] is the invariance-based CED (Fig. 7.1(f)) that is based on a variant of time redundancy. Again, instead of simply computing the same function twice and comparing both results, for the second execution the input and the output are additionally permuted exploiting internal properties of the Advanced Encryption Standard (AES) [Nat01].

7.4 Joint Motivation and Contributions

Presented at CHES 2010, FSA [LSG+10] is a new fault attack which uses the fact that the critical path of an AES S-box implementation is data dependent because of the nature of the underlying gates. Using clock glitches and a simple hypothetical (timing) model for the S-box – which can be obtained by simulation knowing the design details or by a profiling phase – the authors showed a complete break of one AES ASIC implementation using the critical path details of 50 ciphertexts. Another contribution to CHES 2010 was a collision attack enhanced by correlation [MME10]. Compared to classical power analysis attacks, its main feature is that it does not rely on the knowledge of an underlying (hypothetical) power model. Instead, it di- rectly correlates power traces to each other and – by finding colliding S-box computations – is able to recover the relations between key parts. Using such an attack a complete break of a masked FPGA implementation of AES has been demonstrated. The improvement or combinations of these two ideas is the topic of this part of the disser- tation. In Chapter 8 we present an attack that exploits the timing characteristics of AES S-boxes, but in order to recover the secret it does not need to know specifically how these characteristics are and how they relate to the inputs. The aim of the proposed attack is to exploit the timing characteristics of combinatorial functions, e.g., S-boxes implemented in ASICs, thereby recovering the relations between key parts and restricting the key search

79 Chapter 7. Preliminaries space in a way that the secret key can be revealed knowing a single plaintext-ciphertext pair. Similarly to [LSG+10], we have chosen the SASEBO-R board [sas] as the evalua- tion platform. The board can hold different ASICs, and the proposed attack is able to break all AES implementations available in the SASEBO LSI2 [LSI09], both in 130 nm and 90 nm technology, as well as the SASEBO LSI3 [LSI10] in 65 nm technology. The implementations themselves differ in the style of the S-box realization and in side-channel countermeasures. Note that in this dissertation we only present some of the evaluation results, namely the successful attack on the AES COMB core, since the evaluation was exclusively performed by our co-author Amir Moradi. The complete results have been published in 2013 in IEEE Transactions on Computers [MMP13]. In Chapter 9 we take a closer look at the zero-value vulnerability observed in the previous Chapter 8. We provide the first practical evidence that a zero-value attack model, which was first discovered in the context of multiplicative masking in [GT02], is also applicable in the context of FSA on S-boxes implemented using composite field arithmetic. Because of the special case that a zero input to the inversion circuit of the AES S-box is directly mapped to the zero output, the critical path delay of this input is especially short. After the first publication of this assumption in [MMP+11b] it was later also confirmed by simulation in [MP13]. More importantly, we demonstrates how this weakness can be exploited even if the eval- uation circuit is equipped with a state-of-the-art fault detection scheme, which usually renders FSA-like attacks infeasible since they rely on observing the fault rates for each byte individually. Our evaluation circuit, implemented on the SASEBO-G2 FPGA evalu- ation platform, can mimic different Concurrent Error Detection (CED)) schemes, and we will demonstrate a successful complete key recovery on the invariance-based CED scheme presented at DAC 2012 [GK12]. Indeed, the only information required from the circuit to successfully mount our attack is the binary information whether or not a fault occurred while processing chosen 128-bit plaintext inputs. Results of this work were presented at FDTC in 2014 [MMG14].

80 Chapter 8

Correlation Timing Analysis

When complex functions, e.g., substitution boxes of block ciphers, are realized in hardware, timing attributes of the underlying combinatorial circuit depend on the input/output changes of the function. These characteristics can be exploited by the help of a relatively new scheme called Fault Sensitivity Analy- sis (FSA). A collision timing attack which exploits the data-dependent timing characteristics of combinatorial circuits is demonstrated in this chapter. The attack is based on an also recently published Correlation-Enhanced Collision Attack (CECA), which avoids the need for a hypothetical timing model for the underlying combinatorial circuit to recover the secret materials. While the attack itself is able to break all 14 analyzed AES ASIC cores, each produced in three different process technologies and including those protected by various countermeasures, here we focus on the successful evaluation of the unprotected AES COMB core. It is selected because it showcases the distinct Zero-Value characteristic which will be further analyzed in Chapter 9.

Contents of this Chapter 8.1 Introduction ...... 81 8.2 Notations ...... 82 8.3 Attack Description ...... 84 8.4 Evaluation ...... 85 8.5 Conclusion ...... 89

8.1 Introduction

In Chapter 7 we have presented both the Fault Sensitivity Analysis (FSA) and the Correlation-Enhanced Collision Attack (CECA), originally published in [LSG+10] and

81 Chapter 8. Correlation Timing Analysis

i:aa 00 i:aa 55 i:aa ff

Figure 8.1: Dependency of the timing of combinatorial circuits to the input changes. The gray bars denote a stable output signal after the S-box input has changed from 0xaa to 0x00, to 0x55, or to 0xff.

[MME10], respectively. As stated in more detail in Section 7.4, the combination of these two ideas, leading to the proposed Collision Timing Attack (CTA), is the main contribu- tion of this chapter. Note again, that only some of the evaluation results available in the full publication [MMP13] are present, and this was joint work with Amir Moradi.

8.2 Notations

8.2.1 How to Measure the Timing

The focus of this work is to analyze the timing characteristics of combinatorial circuits like S-boxes. As shown in Fig. 8.1, when the input of a combinatorial circuit changes, its output stops toggling after a certain time (so-called ∆t). The maximum value of ∆t for different inputs is known as the longest critical path of the circuit, and defines the maximum frequency of the clock signal which triggers the flip-flops providing the input and storing the output of the considered combinatorial circuit. Timing characteristics of a circuit are therefore defined as a set of ∆t (T = {∆t1, ∆t2,..., ∆tn}), where ∆ti is the minimum ∆t for the given input i. Let us suppose that the target combinatorial circuit is a part of a bigger circuit, e.g., a co-processor, which provides some I/O signals for communication. If the output signals of the target combinatorial circuit are accessible from the outside – which is quite unlikely – one can easily probe the output signals for the given input i and measure the time when the output signals get stable. Otherwise, if the output of the target combinatorial circuit is stored into registers which are triggered by a clock signal that can be controlled from the outside, as shown in Fig. 8.2, one can steadily shorten the time interval between the input transition and the output storage (known as setup time) till an incorrect value is

82 8.2. Notations

i aa ff o ac 2f 0f 16 12.6ns o´= 16

ti =12.4ns o´= 16 12.2ns o´= 0e

Figure 8.2: A simplified example on how to use clock glitches to extract a ∆ti

stored into the registers while input i is repeatedly given to the combinatorial circuit. The minimum time interval when the considered register stores the correct value can be concluded to ∆ti. Note that this procedure is similar to clock glitches, which are mostly used in DFA to intentionally inject faults or to skip the execution of an instruction and analyze the faulty outputs based on the target algorithm [BS97, BS03, PQ03]. However, measuring ∆t in this case does not deal with the faulty outputs; once a faulty output is detected, ∆ti can be concluded. It should be noted that, because of the environmental noise, it might be required to repeat the same procedure and shorten the clock glitch period until the probability of detecting faulty output gets higher than a threshold. Also, if the target combinatorial circuit is not a single-bit function and it is possible to detect which output bit is faulty, one can measure ∆t for each output bit independently. Therefore, we define the adversary model and define his capabilities in order to be able to measure ∆ti of the target combinatorial circuit for the given input i:

 The adversary can access and control the clock signal which triggers the registers providing the input and saving the output of the target combinatorial circuit.

 He knows in which clock cycle the target combinatorial circuit processes the desired data, e.g., known or guessed input or output.

 He can control the target device in a way that the same input value i is repeatedly processed by the target combinatorial circuit during shortening the time interval of the clock glitch.

83 Chapter 8. Correlation Timing Analysis

 He is equipped with appropriate instruments to shorten the duration of the clock glitch with suitable accuracy.

8.2.2 Definitions

i Bitwise Capture: let us define BitCapb,∆t as the result of a Bernoulli trial whether the output of the target combinatorial circuit at bit b is faulty while processing the input i i and when ∆t is the time interval of the clock glitch. Correspondingly, pb,∆t is defined as i the probability of “success” in independently repeated BitCapb,∆t trials. i W i i Capture: defining Cap∆t = BitCapb,∆t means that Cap∆t is the same as the above b defined trial regardless of a certain output bit. It is meaningful when differentiating between different faulty output bits is not possible, e.g., if a circuit is equipped with a i fault detection scheme and prevents the propagation of faulty results. p∆t is also defined i as the probability of “success” in independently repeated Cap∆t trials. Time: To represent the timing characteristics of the target combinatorial circuit, we i i define Tb = ∆t; pb,∆t ≈ pTH as the time required to compute the corresponding output bit b when input i is given, where pTH is a threshold for the probability and is defined based on physical characteristics of the target circuit and is also based on the maximum probability achieved by shortening ∆t. Accordingly, the time required to complete the i i computation of all bits when processing input i is defined as T = ∆t; p∆t ≈ pTH . Depending on the target device, its architecture, and the role of the target combinatorial circuit inside the target device, it might not be possible to know the input i processed. However, if the output of the target combinatorial circuit is accessible, one can make all o o o the above defined terms based on the fault-free output o, i.e., BitCapb,∆t, pb,∆t, Cap∆t, o o o p∆t, Tb , and T .

8.3 Attack Description

For simplicity let us suppose that the targeted combinatorial circuit is an S-box of the last round of an AES encryption, i.e., o = S(i) ⊕ k, where o is the corresponding output ciphertext byte and k the target key byte. o o If (bitwise) timing characteristics of an S-box implementation, i.e., T (T b), exhibit different values of ∆t depending on the output o, it is possible to perform an attack and recover the secret. However, this requires knowledge on how the secret k contributes in o o T (T b). In other words, if the timing characteristics of the S-boxes are known or could be obtained in a profiling stage on an equivalent circuit, one could create a hypothetical model or template to compare estimated timing characteristics for each key guess with

84 8.4. Evaluation

Algorithm 1: Correlation Timing Attack (at the last round of an AES encryption) o o=0 o=255 Require: 1T : {∆t ,..., ∆t }; o = S(i) ⊕ k1 o o=0 o=255 Require: 2T : {∆t ,..., ∆t }; o = S(i) ⊕ k2 1: for ∆ ∈ {0, 1,..., 255} do ∆ o o⊕∆ 2: C = Correlation(1T ,2 T ) 3: end for 4: return arg max C∆ ∆

o o + the actually measured T (T b). In [LSG 10] a comparable approach was presented. By profiling the timing characteristics of a PPRM AES S-box implementation it could be ascertained that the timing delay correlates with the Hamming weight and a successful o correlation attack could be performed. In fact, a set of Cap∆t for a specific ∆t is used in [LSG+10] to mount the attack at the last round of the AES encryption. Using information theoretic tools, e.g., Mutual Information Analysis (MIA) [GBTP08], could be another option to overcome the uncertainty of the leakage model. However, this still requires to select a suitable leakage model which is not possible without extra knowledge about the (timing) characteristics of the target combinatorial function, or several different models must be examined to find a suitable one [VS11]. Also note that o the leakages (Cap∆t) consist of only two values (“fail” and “success”), i.e., probability distributions (as used in e.g., mutual information analysis) are only represented by two bins in a histogram. Instead of relying on a suitable leakage model to perform a correlation attack or MIA, we apply a Correlation-Enhanced Collision Attack (CECA) [MME10] to avoid the necessity of considering any such model. We can use the CECA to compare the timing character- o o istics T (T b) of two S-box instances generating two output sets, each of which is later o o XORed by a secret key byte. Suppose that 1T and 2T (or their corresponding bitwise versions) are the timing characteristics of the S-box when processing o = S(i) ⊕ k1 and o = S(i) ⊕ k2, respectively. As stated in [MME10], the aim of a Correlation-Enhanced Collision Attack is to find the linear difference between k1 and k2, i.e., ∆ = k1 ⊕ k2. This o o will be achieved by comparing 1T and 2T for all possible guesses of ∆. For clarification of the attack scheme see Algorithm 1.

8.4 Evaluation

In this dissertation we focus our evaluation on the AES COMB core of one of the three ASIC test chips built for the SASEBO-R board, namely the SASEBO LSI2 in 130nm pro-

85 Chapter 8. Correlation Timing Analysis

1

Start of Last Round 1 START signal goes high 8 clock cycles before the first round of encryption and goes low 8 clock cycles after the last round [LSI09].

Figure 8.3: The block diagram of the experimental setup and the timing diagram of the relevant signals cess technology. For evaluation results of the other cores and LSIs in 90nm and 65nm, we refer to the full article published in IEEE Transactions on Computers in 2013 [MMP13]. Note that the thesis author’s main responsibility was in developing the test setup, i.e., interfacing the ASIC with the control FPGA, the function generator, and the glitch generation circuit. The practical evaluations were not performed by the author of this dissertation but by his co-author Amir Moradi.

As stated previously, we are using a similar approach for fault injection as in [LSG+10]. We have first tried to generate the glitchy clock inside the control FPGA without using an external function generator, but the width of the glitchy clock could only be adjusted in large steps (e.g., of around 170ps [ESH+11]), which were not small enough to meet our desired condition. Therefore, we had to use a programmable digital function generator to externally provide the precise clock frequencies. The external clock is fed into the SASEBO-R control FPGA where it is multiplied by a factor of 32 using a Digital Clock Manager (DCM) unit. This fast clock signal is then used together with some logic to shape the glitchy clock signal. An internal circuit controls the clock signal of the LSI to inject the glitchy clock at the preferred instance of time synchronized to the AES computation of the target core. The block diagram of the setup and the timing diagram of the signals are presented in Fig. 8.3.

As it is presented in the following, we change the width of the glitchy clock in steps of 25ps to 5ps. Also, the multiplication of the clock frequency is necessary because of the limitation (maximum frequency of 15 MHz) of the function generator we have used, while the frequencies necessary to inject a fault in the combinatorial circuit are up to the range of 300MHz. Also, the DCMs inside the Virtex-II control FPGA of

86 8.4. Evaluation

the SASEBO-R can, when fed with a low frequency input signal, only generate output frequencies up to 210 MHz. Since some of the cores require a higher frequency for fault injection, for these cores it was necessary to daisy chain two DCMs, one for generating a high frequency signal out of the function generator output and another one to reach the maximum supported output frequency which can only be generated by the DCM using a high frequency input [XIL07]. Note that using this configuration care has to be taken that the DCMs actually get locked after each frequency shift of the function generator because of the excessive jittering. We should mention that we kept core voltage of all LSIs at 1.1 V during the experiments shown here. The maximum clock frequencies drop down significantly when decreasing the core voltage, e.g., to 0.6 V. Therefore, the problem mentioned above due to the necessity of two daisy chained DCMs can be avoided, and the width of the glitchy clock can be modified in larger steps if the adversary is able to modify the core voltage in the target setup.

In the evaluated core, 16 instances of the S-box are implemented in parallel to perform the complete SubBytes operation in a single clock cycle (see Fig. 8.4 as the general architecture of the encryption datapath). The core realizes a round based architecture, i.e., S-boxes and MixColumns are performed consecutively during each clock cycle except for the last round where MixColumns is absent [LSI09]. Therefore, extracting the timing characteristics of the S-boxes in the first 9 rounds is not easily possible, and one needs

Din[127:0] Kin[127:0] 32 32 32 32 32 32 32 32 AddRound Key Key Register

Dout Data [127:0] 32 Register 32 32 32

ShiftRows <<8

SubBytes S-box 32-bit 32-bit 32-bit MxCo Slice Slice Slice Rconi Temporary Key Register

32 32 32 32 AddRound Key Kout[127:0]

Figure 8.4: General architecture of the encryption datapath of the attacked AES cores (taken from [LSI09,LSI10])

87 Chapter 8. Correlation Timing Analysis to inject faults by shortening the width of the clock glitches in the last round, when the target core only computes the ShiftRows and SubBytes operations followed by the final key addition and the result is stored in a 128-bit register (similar scheme as used in [LSG+10]). In addition, one can see from the design architecture of the core (Fig. 8.4), that the round key of the last round is already computed in the previous round and is stored into the key register. The glitchy clock at the last round, hence, does not affect the key scheduling computations. Note that while the core supports both encryption and decryp- tion functionality, the attack was performed when the core was operated in encryption mode. In the following the result of the attack on the unprotected AES COMB-core of the SASEBO LSI2 is presented. The S-boxes of that core have been implemented using a composite field approach [SMTM01]. As stated before, 16 separate S-box instances are active at the same time. Therefore, it is not possible to compare the timing characteristics of one S-box instance when processing e.g., two values with different key bytes, that would be an ideal case for a Collision Timing Attack (CTA). In contrast, the timing characteristics of different S-box instances must be compared, which may slightly vary because of different placement and routing even when being based on the same netlist. Since changing the width of the glitchy clock in our measurement setup requires resetting the DCM(s), we have not fixed plaintexts and collected the timing characteristics while reducing the ∆t. Instead, we opted to provide random plaintexts to the core at a fixed o ∆t, and collected the corresponding BitCapb,∆t. This was repeated while shortening ∆t o by steps of 10ps, after which the bitwise timing characteristics T b were extracted. o The pb,∆t of the LSB (i.e., b = 0) for a few output byte values of two S-box instances are depicted in Figure 8.5. They were extracted from their corresponding bitwise captures o when using 10 000 captures for each ∆t. The bitwise timing characteristics T b=0 of

0.7 0.7 Probability Probability

0.1 0.1

6300 Δt [ps] 5100 6300 Δt [ps] 5100

o Figure 8.5: Some pb=0,∆t curves for S-box instances no. (left) 0 and (right) 4 of AES Comp (130nm)

88 8.5. Conclusion

6400 6400 t [ps] t [ps] Δ Δ

5000 5000 0 Output value 255 0 Output value 255

o Figure 8.6: Bitwise timing characteristics T b=0 of S-box instances no. (left) 0 and (right) 4 of AES Comp (130nm) these two S-box instances are shown in Figure 8.5. Note that they were obtained by defining pTH = 0.1 (see Fig. 8.5). Since the ∆t are different for both S-box outputs, it demonstrates the dependency between the timing characteristics and processed values.

o Computing T b=0 of all S-box instances and performing the attack Algorithm 1 allowed us to recover all key relations between all key bytes. Two exemplary attack results and the minimum number of required captures for a successful key relation recovery are shown in Fig. 8.7. Beside our choice of using the LSB bit in the attack, all other choices for b equally allow to extract all key relations. Note that the interval of the used ∆t, i.e., the step size used when shortening the width of the clock glitch, also influence how many captures are required.

When having a closer look at the timing characteristics depicted in Fig. 8.6, it becomes apparent that one ∆t is significantly different to all other cases. The corresponding S-box input belongs to the zero value. It seems that comparable to the case of SCA on multiplicative masking, where a zero value power model can be used to recover the secrets (c.f. Sections 2.4.2 and 2.4.2), a similar vulnerability exists in the context of FSA and composite field S-boxes. In fact, instead of mounting a full CTA we would have been able o to recover the secret key bytes from observing the T b of each S-box instance separately. However, as it is shown in the full article this property does not hold for the other cores realized by different S-boxes, and mounting our proposed attack is essential to reveal the secrets in other implementations.

8.5 Conclusion

We have presented a collision attack which efficiently utilizes the data dependent timing characteristics of combinatorial circuits to reveal the secrets. While the attack is based

89 Chapter 8. Correlation Timing Analysis

0.9 Correlation

−0.3 0 Δk 255 0.9 Correlation

−0.3 0 Δk 255

Figure 8.7: Result of the attack on the last round of AES Comp (130nm) recovering ∆k between key bytes (up to bottom) (0,1) and (0,2), (left) using 10 000 captures and (right) over the number of captures on the idea of [LSG+10], it is significantly more powerful since it does not require any knowledge about the characteristics of the target combinatorial circuit. We have presented how the proposed Collision Timing Attack (CTA) is able to allow a complete key recovery on the AES COMB core of the SASEBO LSI2 in 130nm technol- ogy. In addition, we highlighted a very distinct zero value vulnerability which allows key recovery even without applying a CTA. CTA is powerful enough to break any of the available AES implementations on the SASEBO LSIs, including the ones applying SCA countermeasures, as demonstrated in the full version [MMP13]. Therefore, just using randomizing countermeasures itself cannot prevent data-dependent timing characteris- tics of combinatorial circuits, and they therefore remain vulnerable against the attack introduced here. This implies the need for a special unit in –especially side-channel protected– designs, which is able to detect clock glitches to thwart such kind of attacks.

90 Chapter 9

Zero-Value Fault Sensitivity Analysis

Previous works have shown that the combinatorial path delay of a crypto- graphic function, e.g., the AES S-box, depends on its input value. Since the relation between critical path delay and input value seems to be relatively ran- dom and highly dependent on the routing of the circuit, up to now only tem- plate or some collision attacks could reliably extract the used secret key of implementations not protected against fault attacks. Here we present a new attack which is based on the fact that, because of the zero-to-zero mapping of the AES S-box inversion circuit, the critical path when processing the zero input is notably shorter than for all other inputs. Applying the attack to an AES design protected by a state-of-the-art fault detection scheme, we are able to fully recover the secret key in less than eight hours. Note that we neither require a known key measurement step (template case) nor a high similar- ity between different S-box instances (collision case). The only information gathered from the device is whether a fault occurred when processing a chosen plaintext.

Contents of this Chapter 9.1 Introduction ...... 92

9.2 Designing an Architecture for Evaluation ...... 92

9.3 Zero-Value Fault Sensitivity Analysis (ZV-FSA) ...... 94

9.4 Evaluation of the Attack ...... 97

9.5 Conclusion ...... 102

91 Chapter 9. Zero-Value Fault Sensitivity Analysis

9.1 Introduction

In Chapter 7 we have already given the general motivation for our interest in Fault Sensitivity Analysis (FSA) and introduced the original scheme. Additional motivation is stated below. In [MMP+11b] two improvements on FSA were presented. Both proposals allowed the recovery of the linear difference between key bytes of the analyzed ASIC cores by utilizing the principles of Correlation-Enhanced Collision Attacks first presented in [MME10]. One of the improvements was presented in compressed form in Chapter 8, the other one requires the analysis of the distribution of faulty ciphertext bytes. Both presented improvements rely on the similarity of different S-box instances in the targeted ASIC implementations, which is usually not the case in FPGA-based imple- mentations. In addition, if a fault detection scheme is considered in the implementation, these and other FSA attacks are infeasible since the required byte-wise fault information or faulty outputs are no longer available from the circuit. We therefore require a different attack technique, and in this chapter we demonstrate how a fault detection scheme can be circumvented if the S-boxes are realized using a composite- field approach by our proposed Zero-Value Fault Sensitivity Analysis (ZV-FSA). The main reason for this to be possible is a very distinct zero value vulnerability when looking at the critical path details of those S-box implementation techniques.

9.2 Designing an Architecture for Evaluation

A major goal of our evaluation architecture is to test different countermeasures using exactly the same hardware instance. The invariance-based design of [GK12] has been reported to provide almost complete fault coverage under different models, hence we adopted their design (that is depicted in Fig. 7.1(f)) as the base of our AES implemen- tation. The evaluation architecture used in this work is depicted in Fig. 9.1. The used abbreviations are: SR for ShiftRows, SB for SubBytes, and MC for MixColumns. P denotes a Permutation, what is a slightly differently designed mod- ule compared to that proposed in [GK12]. As shown by Fig. 7.1(f), the original design contains a multiplexer to skip the permutation step. The permutation step itself was designed as a fixed column-based permutation. It is stated in [GK12] that only three permutations, denoted as P1, P2, and P3, are possible if random inputs are given. Each round is computed in two clock cycles. In the first clock cycle, the plaintext saved in register RegX in a prior round, passes through the AES data path without entering the permutation and the result of this operation is stored

92 9.2. Designing an Architecture for Evaluation

(a) (b) (c)

Figure 9.1: (a) Architecture of our evaluation circuit; (b) computation step; (c) checking step in RegY. In the following we call this cycle the computation step. In the second clock cycle, the data path including the permutation is used. Simultaneously with storing the computation result in RegX for the next round, the result is also compared to the previous result stored in RegY (denoted as checking step in the following). In case they differ, an exception is raised since a fault has been detected.

Note that for our design (Fig. 9.1) we slightly modify this approach by moving the ShiftRows operation to the beginning of the data path. Since SubBytes works solely on single bytes, P only operates on full columns and ShiftRows is simply a fixed wire permutation, this is possible without impact on area or performance. However, this simple but beneficial modification enables to implement P not as fixed permutation but as dynamic switch matrix which allows to swap columns in any way and differently in each clock cycle. The amount of possible valid permutations is thereby increased to 4! = 24 obviously raising the complexity for an attack with respect to the original countermeasure as stated in [GK12].

Since P can be adapted on-the-fly as part of the data path, we are able to imitate the behavior of several different CED schemes. If P is configured to operate as pass-through in both clock cycles of a round, this leads to a CED implementing time redundancy. If it operates as pass-through in the first clock cycle but performs a cyclic right shift of the columns in the second clock cycle, the scheme is identical to the invariance-based hybrid redundancy scheme of [GK12] (using the proposed permutation P1). Finally, by randomly selecting the column order in both clock cycles we can implement a type of

93 Chapter 9. Zero-Value Fault Sensitivity Analysis

(a)

(b)

(c)

Figure 9.2: Timing behavior of the used AES S-box [Can05]

shuffling countermeasure which is sometimes applied to thwart side-channel attacks1. This also allows us to analyze possible side effects of combining fault and side-channel countermeasures. Note that, if a different permutation for each clock cycle is selected and when looking at a single column, there exist only four different possibilities for which column will be processed by a certain hardware instance. This still significantly increases the difficulty for an attacker since there is only a 25% chance that the target column is processed by the target instance, while in the original scheme the column order for both the computation and checking step is fixed resulting in a 100% chance for the attacker.

9.3 Zero-Value Fault Sensitivity Analysis (ZV-FSA)

In the following we will briefly restate the input-dependent timing behavior of the AES S-box when implemented in combinatorial logic. Based on this, we show the zero-value

1Here, since our implementation realizes a round-based architecture and all S-boxes are performed at the same time, by shuffling we mean randomly assigning different S-box instances (resp. MixColumns) to the cipher state bytes.

94 9.3. Zero-Value Fault Sensitivity Analysis (ZV-FSA) vulnerability and describe the corresponding methodology to fully recover the secret key of our round-based AES design which is equipped with a state-of-the-art fault detection scheme as described in Section 9.2.

9.3.1 Zero-Value Vulnerability

In [LSG+10] an input-dependent timing behavior of the AES S-box was first observed and exploited as Fault Sensitivity Analysis (FSA). Their results were confirmed in [MMP13] by post place-and-route simulation. In the following, we review these findings by simula- tions based on the S-box of [Can05] as part of our architecture. The results shown here are based on the Virtex-5 LX50 FPGA of our target platform.

Figure 9.2(a) depicts the timing behavior until the S-box output becomes stable after a given input switches to three different exemplary values. First, it confirms that the critical path delay is highly dependent on the new input given to the circuit. However, considering the results in Fig. 9.2(b) switching back and forth, it can be observed that not only the next state but also the previous (stable) state of the S-box influences the timing behavior of the circuit2.

This shows that an attack model that takes both the next input and the previous state into account should be superior to one which only relies either on the next input or the previous state. Another important observation on the timing behavior is depicted by Fig. 9.2(c). It can be seen that the time required to have a stable output if the S-box input switches to the zero value is notably shorter compared to the other cases, which was also discovered in [MMP+11b] and [MP13]. In fact, this is a variant of the zero- value vulnerability first described in [GT02]. In this work it was noticed that, when using multiplicative masking as countermeasure to thwart differential power analysis, the S-box circuit consumes notably less power when processing the zero input compared to all other cases since the zero value cannot be masked by multiplication. The same effect can be observed even if no masking countermeasure is applied to the AES S-box if the implementation is based on composite field arithmetic (as is the case in our designs). It is because during the inversion part of the S-box (at least) one operand of all multiplications is zero [MOP07].

The special zero mapping of the S-box inversion circuit is an exception and respectively multiplications by zero are fast, hence the critical path is particularly short. This distinc- tive feature is obviously easier to exploit than the complex models based on the previous state and the next value. Therefore, we will now investigate the resistance of our CED design against an FSA-like attack using this zero-value attack model.

2It is, in fact, the same concept used in power analysis attacks when Hamming Distance (HD) is applied to model the leakage associated to a sequential circuit.

95 Chapter 9. Zero-Value Fault Sensitivity Analysis

9.3.2 Attack Methodology

Notations: Let suppose that a clock glitch with length ∆t is injected to the target device while it i∈{1,...,n} i i i processes n plaintexts X = (x1, . . . , x16). Let also f∆t denote a random variable indicating whether a fault is detected while Xi was being processed. Now we define total error rate as P i i f∆t p∆t,total = . n We also define local error rate associated to plaintext byte j as

P i x f i∈Sj ∆t p∆t,j(x) = x , |Sj |

x where Sj stands for a set of plaintexts whose j-th byte with j ∈ {1,..., 16} is equal to x: x i Sj = {i | xj = x}. For the sake of simplicity we omit the ∆t term in later specification. Attack Description: Target of our attack is the first computation step in the first AES round. For clarity, we explicitely highlighted the data path in Fig. 9.1(b) in black. We assume that the critical path for each S-box is minimal whenever it processes a zero input, hence our final goal is to find the 16-byte plaintext X that force all S-box inputs to zero in the target clock cycle. While gradually shortening the injected clock glitch ∆t, this byte configuration X will be the one which remains unaffected for the longest time. Since the S-box inputs of the first round are computed by xor’ing the plaintext bytes xj with the corresponding key bytes kj, this implies that the plaintext bytes must be equal to the key bytes: ∀j, xj = kj. Note that it is not possible to directly measure the influence of each plaintext byte for a single computation i. If a fault is caused by the clock glitch during the computation i i on plaintext X , we can only deduce that at least one of the 16-byte values xj caused the faulty execution, and is therefore not one of the candidates for the key bytes kj. However, by repeating the experiments for a considerably large n, we can observe the local error rates pj for each byte and discard those key candidates kj which exhibit a significantly high local error rate (pj  ptotal). The rationale behind this decision is that, if a local error rate pj is high compared to other values and therefore the corresponding i kj appeared in a higher amount of faulty X , it can be deduced that the critical path length of the candidate kj is longer than the average and can therefore not be the correct candidate which should have one of the shortest critical path lengths. This is exactly the strategy that we follow as described in Algorithm 2. We can iteratively restrict the key space until only the correct key remains. The number of required iteration

96 9.4. Evaluation of the Attack

k k k k

k k

Figure 9.3: Block diagram of the experimental setup and timing diagram of the core clock

steps is influenced by the chosen threshold , which determines how many values are discarded from the set of remaining key candidates Kj (Algorithm 2, line 11). The smaller  the more key candidates are discarded per iteration. Choosing a smaller  also increases the chance to falsely exclude a correct key candidate kj. In case the correct key candidate for a byte j has been excluded in a previous run, Algorithm 2 will later discard all other candidates in Kj and run Algorithm 3 to recover a valid set Kj as intermediate step (line 15). While the situation is unlikely and was not observed during our evaluations, if for some reason a valid intermediate set Kj cannot be recovered, or the recovered full key after completion of Algorithm 2 is not the correct one, analog to a power analysis attack for each byte j the most likely candidates could be ranked according to the order they were excluded. This would allow an efficient brute force attack assuming that the number of uncertain sets Kj or their size is sufficiently small.

9.4 Evaluation of the Attack

In order to practically verify the feasibility of our attack we have chosen a SASEBO- GII [sas] as evaluation platform. We implemented our fault-protected AES architecture in the Virtex-5 (XC5VLX50) FPGA on the evaluation platform, and injected faults from clock glitches generated by an Agilent 33521A Function Generator. Since this function generator only supplies signals with a maximum frequency of 30 MHz, we feed this clock signal to the control FPGA of the SASEBO-GII to increase its frequency using the clock multiplier of the digital clock manager (DCM). The control FPGA supplies the clock line to the target FPGA and can, at a desired clock cycle, create a clock glitch by switching between a slow clock and the faster clock originated from the function generator. The complete setup is depicted in Fig. 9.3. In total we have analyzed three different settings of the implemented CED architecture given in Section 9.2, all requiring two clock cycles to compute a single AES round.

97 Chapter 9. Zero-Value Fault Sensitivity Analysis

 In profile 1 we have configured the permutation matrix P to act as pass-through, i.e., no actual permutation is taking place in any clock cycle mimicking a simple time redundancy CED scheme. We use this design only for exploration purposes.

 During the evaluation of profile 2, which is the main focus of our work, the design acts equivalent to the invariance-based CED, where P1 – defined in [GK12] – is used as the permutation matrix P.

 Finally, for profile 3 we configure P to randomize the order of the columns during every clock cycle of the AES computation, i.e., in both the computation and the checking step. Note that the CED schemes implemented here all aim at detecting faults and, by sub- sequently zeroing the output, try to make it impossible for an attacker to use the faulty output in, e.g., differential fault analysis. When using an FSA-based technique, like in our case, we do not require the faulty output. Just the information that a fault occurred is sufficient to extract the secret. Also note that the CED schemes of profile 1 and profile 2 are practically equivalent during the computation step, which is the target of our fault injection. In other words, in all of the experiments shown below we injected the faults by a clock glitch during the first computation step, i.e., when RegY stores the result of the first cipher round (see Fig. 9.1). Therefore, the checking step of all our design profiles have no impact on the efficiency of the performed attack.

9.4.1 Profile 1: Time redundancy-based CED

In order to practically verify the results achieved by simulation (as given in Fig. 9.2), we performed some experiments on profile 1.

By setting all plaintext bytes xj to the correct key bytes kj we could confirm that we are still able to inject a fault, but by a very short ∆t. By sending plaintexts Xi, where only i one plaintext byte xj∗ is random and the others are still equivalent to the correct keys i (xj6=j∗ = kj), we are able to observe the local error rates pj∗ for all possible values xj∗ . Figure 9.4 shows the local error rates p8 when shortening ∆t from 17ns to 12ns in steps of 250ps. One can see that the local error rates for each input usually rise from 0% to 100% over a ∆t span of 1.5ns − 2ns and are distributed over the whole tested spectrum. While this allowed us to clearly discern the order in which each plaintext value gets faulty when shortening ∆t for S-box 8, examining the same for other S-boxes showed dissimilarity between local error rates as well as critical path delay of different S-boxes. Because of routing differences the S-boxes are also too different from each other to mount e.g., a collision attack. However, more importantly, we could verify that when the S-box input equals zero, the faults can be induced only for a very short ∆t.

98 9.4. Evaluation of the Attack

Δ

Figure 9.4: Evolution of local error rates of S-box 8 while shortening the clock glitch length on profile 1. The local error rate of the correct key candidate is high- lighted in black.

9.4.2 Profile 2: Invariance-based CED

We now perform the complete key recovery attack described in Section 9.3 on the invariance-based CED scheme. Note that, we do not force resets on any registers, there- fore increasing the difficulty of the attack by accepting somewhat randomly precharged values on the registers during the attack execution.

Figure 9.5(a) shows the local error rates of all 16 S-boxes during the first exclusion round of Algorithm 2 where the key space has not been restricted yet. It can clearly be seen i that the S-box computing on the x10’s has the highest impact on the observed local error i i rates. A p10(k) of 90% here means that 90% of X where x10 = k were faulty and therefore candidate k is unlikely to be the correct key candidate k. Redoing the same experiment, without restricting the key space and computing the Pearson’s correlation coefficient between the local error rates of each S-box (Fig. 9.5(b)), shows that only S-box 10 and – to a smaller degree – S-box 12 are having an impact on the observed local error rates. This is the reason why we exclude improbable key candidates based on the observed mean and maximum local error rates. A different straightforward choice would be to define a threshold  (Algorithm 2, line 11) for key exclusion individually for each S-box. If so, we would falsely discard key candidates based on local error rates which were not caused by the critical path delay of this S-box but on others instead.

Note that the reason why only S-box 10 and S-box 12 are having an impact on the error rates initially is that their critical path is longer compared to other S-box instances and they therefore are affected by the shortened clock glitches first. Once the wrong key candidates corresponding to the longer critical path delays have been excluded, the other S-boxes start to get affected by the faults as well.

99 Chapter 9. Zero-Value Fault Sensitivity Analysis

− − (a) (b)

Figure 9.5: (a) Local error rates computed by the first round of Algorithm 2 on profile 2 with S-box 10 highlighted and (b) correlation between local error rates ob- tained by two runs of the first round of Algorithm 2 (with no prior excluded candidates)

(a) (b)

Figure 9.6: Remaining key space after each exclusion round; (a) per S-box, (b) for the complete 128-bit key (profile 2)

Figure 9.6 depicts the remaining key space after each exclusion round of Algorithm 2 both per S-box (Fig. 9.6(a)) and for the complete AES key (Fig. 9.6(b)). After 58 iterations of Algorithm 2, the complete secret could be extracted.

Each exclusion round was completed in about 8 minutes sending each plaintext Xi and receiving the response over the SASEBO-GII UART connection from the control FPGA. The time for a complete recovery could either be reduced by creating the Xi of each round on the control FPGA to reduce the communication bottleneck or by reducing n (Algorithm 2, line 6) for further iterations of the algorithm. The parameter n could be chosen based on the number of remaining key candidates per S-box.

100 9.4. Evaluation of the Attack

− Δ (a) (b) (c)

Figure 9.7: (a), (b) local error rates obtained by running Algorithm 3 on profile 3 for (a) S-box 6 and (b) S-box 10; (c) Correlation-Enhanced Collision Attack recovering the correct ∆k = k6 ⊕ k10

9.4.3 Profile 3: Randomized permutations

When randomizing the position of the columns during every clock cycle, this obviously complicates the attack. Now each plaintext byte will randomly be processed by one of four possible S-box instances (since the row index of each byte remains constant). Since the local error rates are no longer dependent on a specific S-box instantiation but are potentially influenced by four instances, it is harder to exclude keys since if kj has a high local error rate in one S-box instance, this could be mitigated by having lower local error rates in other instances.

Because of this randomization we do not see different timing characteristics for each S-box. The observed behavior of each row of the cipher state will basically be the same since the used S-box instances in each row of the state are randomized for every computation. By iterating Algorithm 2 we can still discard unlikely key candidates, and once the candidates with the highest local error rates are excluded, we can skip to Algorithm 3 to retrieve local error rates for each S-box.

Figure 9.7(a) depicts the local error rates p6 gathered by performing Algorithm 3 after the remaining key space Kj has been reduced to 32 per S-box on average. The same is shown for p10 in Fig. 9.7(b). While we cannot directly recover the corresponding key bytes from these measurements, it is possible to recover the linear difference between the correct key bytes. We used the same technique as [MME10], but on local error rates instead of mean power traces. In this setting in order to recover the correct ∆k = k6 ⊕ k10, correlation 8 between p6(∀k ∈ {0, 1} ) and p10(k ⊕ ∆k) is estimated. The result which is shown by Fig. 9.7(c) indicates the efficiency of the scheme. This way we can recover all ∆k of each row, reducing the remaining key space to 28 per row and to 232 in total.

101 Chapter 9. Zero-Value Fault Sensitivity Analysis

9.5 Conclusion

We have presented a case study of a fault-protected AES encryption engine where S-boxes are realized as combinatorial circuits. We exploit the fact that the critical path delay is dependent on their input which is especially short in case it is the zero value. Contrary to some previous work based on collisions, we do not require that the S-box instances exhibit a high similarity to each other. We were able to extract the secret key from the circuit based on the concept of an enhanced fault sensitivity analysis, even though the applied CED scheme usually renders FSA-like attacks infeasible. The attack works whenever the implemented circuit performs a normal AES encryption round followed by checking operation in a different clock cycle. This is for example the case if an invariance-based CED [GK12] is used or one utilizing time-redundancy. The time required to completely recover the key of our design is less than eight hours. In our case the attack speed could further be increased significantly by improving the measurement setup, since the currently used slow UART connection to request each AES execution is the bottleneck of the attack flow. This would, however, not be the case if the actual computation is the bottleneck, as can be the case if the target is a smart card. We have not verified the attack against other CED schemes like those computing the inverse of a previous round in parallel. However, previous work [MMP13] has shown that some information can be recovered if the targeted part of the circuit is on the critical path and is thereby affected by the attack first. We have also shown that in case a countermeasure like column shuffling is implemented to thwart fault or power analysis attacks, this might even increase the susceptibility of the implementation to successful attacks. Since the error rates of S-boxes processing the same row are unified and averaged by randomization, collision attacks become easily possible – while they were not feasible before due to the different S-box layouts. However, if the evaluation circuit would be equipped with a side-channel countermeasure such as Boolean masking, this would substantially decrease the success rate of an attacker from performing the proposed fault attack, since the zero value vulnerability would be mitigated by most masking schemes. This again demonstrates that countermeasures against different attacks should not be regarded independently of each other and potential interactions should be carefully analyzed.

102 9.5. Conclusion

Algorithm 2: Fault Sensitivity Analysis Exploiting Zero Values

Output: Key candidate sets Kj∈{1,...,16}, forming 16-byte secret key(s)

1 Choose ∆t to have a low total error rate (0.1 < ptotal < 0.2) 2 for j ∈ {1,..., 16} do 8 3 Kj ← ∀k ∈ {0, 1} /* initializing the sets */ 4 end 5 while ∃j, |Kj| > 1 do i∈{1,...,n} i 6 Select n = 25, 000 random plaintexts X as xj ∈ Kj 7 Send all n plaintexts to the target device while a clock glitch (∆t) is injected in the first computation of round 1 8 Compute the local error rates pj(k), ∀j, ∀k ∈ Kj P P pj (k) j k∈Kj 9 µ = P |K | j j /* overall average */ ∗ ∗ 10 Find max = pj∗ (k ) as ∀j, ∀k ∈ Kj, pj∗ (k ) ≥ pj(k) /* overall maximum */ max −µ 11 λ = µ + ;  = 3 /* threshold */ 12 for j ∈ {1,..., 16} do 13 Kj ← {k ∈ Kj|pj(k) < λ} /* candidates with low error rate */ 14 if Kj = ∅ then 15 Kj ← Algorithm 3 16 end 17 end 18 if µ < 0.1 then 19 Shorten ∆t (nominally by 125ps) 20 end 21 end

103 Chapter 9. Zero-Value Fault Sensitivity Analysis

Algorithm 3: Recover Key Candidates

Input : pj∈{1,...,16}(k ∈ Kj), local error rates of candidates for the plaintext bytes j, ∗ j , the index of the empty candidate list Kj∗ = ∅, λ, the threshold ∗ Output: Kj∗ , the new set of candidates for the plaintext byte j ∗ 1 for ∀j =6 j do 2 kj = argmin pj(k) /* the most k probable candidates */ 3 end i∈{1,...,n} i 8 4 Select n = 25, 000 random plaintexts X as xj∗ ∈ {0, 1} and ∗ i ∀j =6 j , xj = kj 5 Send all n plaintexts to the target device while a clock glitch (∆t) is injected 8 6 Compute the local error rates pj∗ (k), ∀k ∈ {0, 1} 7 Kj∗ ← {k|pj∗ (k) < λ}

104 Part III

Conclusion

Chapter 10

Conclusion

This chapter summarizes our results and discusses open topics and guidelines for future work

In this thesis we have first taken a closer look at different countermeasures and their effectiveness to protect hardware implementations of AES against side-channel attacks. Following this, we have introduced some variants of Fault Sensitivity Analysis (FSA), which are not only able to overcome the analyzed SCA countermeasures but also a fault attack protected AES implementation if the S-box is implemented using a composite field approach. In the SCA Part I, we focused mainly on algorithmic countermeasures which are not restricted to ASIC implementations but can be realized in reconfigurable hardware. This allowed easy prototyping on different specialized FPGA side-channel analysis platforms, i.e., SASEBOs, to verify the suitability and security of a countermeasure on real hard- ware. This is especially important since simulation-based security evaluations are either hindered by the accuracy of the simulation software, the computation time required for meaningful numbers of traces, or the fact that some complex interactions of measurements on real hardware cannot be simulated. In Chapter 3, we have analyzed the resistance of a Boolean masked S-box against the Correlation-Enhanced Collision Attack (CECA). Our results have shown that, although Boolean masking is based on a mathematically sound concept, its straightforward imple- mentation is unable to provide the required level of protection. The reason for this, which is well known in the community, is the occurrence of glitches in the underlying gates of the hardware implementation. The security of the implementation can be increased by additionally integrating hiding techniques as shuffling or unrolling. However, while the number of required traces to extract the secrets can be increased significantly, the under- lying flaw of the scheme, namely not considering glitches in hardware gates, cannot be mitigated completely. By manually mapping the first-order masked S-box of Canright to the available hardware resources of the FPGA evaluation platform, we created several implementations to further

107 Chapter 10. Conclusion analyze the influence of glitches. The results of our evaluations in Chapter 4 showed that, if the propagation and occurrence of glitches is completely avoided by special enable stages and pipeline registers, the masking countermeasure actually delivers the advertised level of security and could not be broken by a first-order attack. Even when considering second- order attacks the countermeasure provided a very high level of resistance. This makes this countermeasure especially suitable for FPGA implementations, where the high number of required registers has a lower impact compared to ASIC implementations.

As an alternative, a different countermeasure based on multi-party computation and Shamir’s secret sharing was evaluated in Chapter 5. In contrast to the masking scheme of Canright, this scheme should be secure even in the presence of glitches if the computations are correctly spread in time, which was confirmed by our practical evaluation. However, as a result of the extremely large area requirements and high number of clock cycles for the modular inversion circuit, the proposed scheme is unfortunately not of high practical significance.

Lastly, the suitability of AES dual ciphers as countermeasure has been evaluated and dismissed in Chapter 6. Due to a multitude of flaws, like the mask reuse, violation of the balance property, and the inability to mask the zero value, no variation of a dual-cipher countermeasure is able to provide a sufficient level of security required for a protected implementation.

In Part II of this dissertation, we focused on some variants of Fault Sensitivity Analysis (FSA), which are able to overcome side-channel protected as well as fault protected implementations. First, we presented how the concept of FSA can be nicely combined with the Correlation-Enhanced Collision Attack (CECA) that was also predominantly used as evaluation technique in the SCA part of the thesis. We demonstrated that the AES COMB core of an SASEBO ASIC LSI can be broken using a low number of measurements with the proposed Collision Timing Attack (CTA). We have further shown that the zero value vulnerability of S-boxes implemented in a composite field approach, which is a known weakness for multiplicative masking in a SCA setting, is also exploitable by an FSA attack. By iteratively restricting the key search space until only the correct key remains, we showcased that the vulnerability can be exploited to overcome even sophisticated fault attack countermeasures.

This research again shows that when trying to protect a cryptographic algorithm such as the AES, it is not enough to focus only on one type of physical attack branch and the corresponding countermeasures. If an implementation is equipped with the most sophisticated side-channel countermeasures, it will still remain vulnerable to many fault attacks. The same is more obviously true vice versa, since, e.g., a redundancy scheme to detect faults offers no protection against even basic side-channel attacks. Taking other active physical attacks like probing into account, it becomes clear that a more holistic approach has to be taken to protect a cryptographic circuit. This should not only include

108 algorithmic SCA and fault countermeasures, but also active components like light sensors, voltage spike detectors, etc. There exists also a gap between academia and industry. In academia usually one re- searcher tries to get the best results focusing only on a very specific area (his specialty) not taking any other attack possibilities/countermeasures into account, e.g., just focus- ing on either the SCA or fault research. In industry a larger attack tree must be taken into account and all attack vectors, which threaten the security of the implementation, must be mitigated by appropriate countermeasures until an attack becomes financially unattractive for an attacker. Future research on countermeasures should, if possible, try to take industry concerns more into account. If a SCA countermeasure can provide an almost perfect level of security but uses too much area, power, or has too low performance, it will simply not be used in the real world. The trade-offs between security, area, power, and performance are obviously manifold and will heavily depend on the use case. While very complex countermeasures using high amounts of area and power are useful from an academic point of view to push research boundaries, from an industry perspective it would be more beneficial to concentrate on low-power and low-area countermeasures which, in an ideal world, would also be easy to understand and implement. Unfortunately, this is easier said than done. In general, a network gateway in a data center requires different countermeasures against different attacks compared to a smart card used as payment device. Therefore it would be helpful for a digital designer to have a large selection of countermeasures, which can be adjusted and scaled as required. Moreover, additional care has to be taken that the complex interactions of countermeasures do not open additional attack vectors.

109

Part IV

Appendix

Bibliography

[AG01] Mehdi-Laurent Akkar and Christophe Giraud. An Implementation of DES and AES, Secure against Some Attacks. In C¸etin Kaya Ko¸c,David Naccache, and Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2001, Third International Workshop, Paris, France, May 14-16, 2001, Proceedings, volume 2162 of Lecture Notes in Computer Science, pages 309–318. Springer, 2001. [BB02a] Elad Barkan and Eli Biham. In How Many Ways Can You Write Rijndael? In Yuliang Zheng, editor, Advances in Cryptology - ASIACRYPT 2002, 8th International Conference on the Theory and Application of Cryptology and Information Security, Queenstown, New Zealand, December 1-5, 2002, Pro- ceedings, volume 2501 of Lecture Notes in Computer Science, pages 160–175. Springer, 2002. [BB02b] Elad Barkan and Eli Biham. The Book of Rijndaels. IACR Cryptology ePrint Archive, 2002:158, 2002. [BCO04] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation Power Anal- ysis with a Leakage Model. In Marc Joye and Jean-Jacques Quisquater, editors, Cryptographic Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11-13, 2004. Pro- ceedings, volume 3156 of Lecture Notes in Computer Science, pages 16–29. Springer, 2004. [BGK04] Johannes Bl¨omer,Jorge Guajardo, and Volker Krummel. Provably Secure Masking of AES. In Helena Handschuh and M. Anwar Hasan, editors, Se- lected Areas in Cryptography, 11th International Workshop, SAC 2004, Wa- terloo, Canada, August 9-10, 2004, Revised Selected Papers, volume 3357 of Lecture Notes in Computer Science, pages 69–83. Springer, 2004. [BGN+14] Beg¨ulBilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent Rijmen. A More Efficient AES Threshold Implementation. In David Pointcheval and Damien Vergnaud, editors, Progress in Cryptology - AFRICACRYPT 2014 - 7th International Conference on Cryptology in Africa, Marrakesh, Morocco, May 28-30, 2014. Proceedings, volume 8469 of Lecture Notes in Computer Science, pages 267–284. Springer, 2014. Bibliography

[BGSD10] Shivam Bhasin, Sylvain Guilley, Laurent Sauvage, and Jean-Luc Danger. Unrolling Cryptographic Circuits: A Simple Countermeasure Against Side- Channel Attacks. In Josef Pieprzyk, editor, Topics in Cryptology - CT- RSA 2010, The Cryptographers’ Track at the RSA Conference 2010, San Francisco, CA, USA, March 1-5, 2010. Proceedings, volume 5985 of Lecture Notes in Computer Science, pages 195–207. Springer, 2010.

[BNN+12] Beg¨ulBilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, and Georg St¨utz.Threshold Implementations of All 3 ×3 and 4 ×4 S-Boxes. In Em- manuel Prouff and Patrick Schaumont, editors, Cryptographic Hardware and Embedded Systems - CHES 2012 - 14th International Workshop, Leuven, Belgium, September 9-12, 2012. Proceedings, volume 7428 of Lecture Notes in Computer Science, pages 76–91. Springer, 2012.

[BNN+15] Beg¨ulBilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, Natalia Tokareva, and Valeriya Vitkup. Threshold implementations of small S- boxes. Cryptography and Communications, 7(1):3–33, 2015.

[Bog08] Andrey Bogdanov. Multiple-Differential Side-Channel Collision Attacks on AES. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hard- ware and Embedded Systems - CHES 2008, 10th International Workshop, Washington, D.C., USA, August 10-13, 2008. Proceedings, volume 5154 of Lecture Notes in Computer Science, pages 30–44. Springer, 2008.

[BS97] Eli Biham and Adi Shamir. Differential Fault Analysis of Secret Key . In Burton S. Kaliski Jr., editor, Advances in Cryptology - CRYPTO ’97, 17th Annual International Cryptology Conference, Santa Barbara, California, USA, August 17-21, 1997, Proceedings, volume 1294 of Lecture Notes in Computer Science, pages 513–525. Springer, 1997.

[BS03] Johannes Bl¨omerand Jean-Pierre Seifert. Fault Based Cryptanalysis of the Advanced Encryption Standard (AES). In Rebecca N. Wright, editor, Fi- nancial Cryptography, 7th International Conference, FC 2003, Guadeloupe, French West Indies, January 27-30, 2003, Revised Papers, volume 2742 of Lecture Notes in Computer Science, pages 162–181. Springer, 2003.

[Can05] David Canright. A Very Compact S-Box for AES. In Josyula R. Rao and Berk Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science, pages 441–455. Springer, 2005.

[CB08] David Canright and Lejla Batina. A Very Compact ”Perfectly Masked” S-Box for AES. In Steven M. Bellovin, Rosario Gennaro, Angelos D.

114 Bibliography

Keromytis, and Moti Yung, editors, Applied Cryptography and Network Se- curity, 6th International Conference, ACNS 2008, New York, NY, USA, June 3-6, 2008. Proceedings, volume 5037 of Lecture Notes in Computer Science, pages 446–459, 2008. the corrected version is available at Cryptol- ogy ePrint Archive, Report 2009/011. http://eprint.iacr.org/. [CCD00] Christophe Clavier, Jean-S´ebastienCoron, and Nora Dabbous. Differen- tial Power Analysis in the Presence of Hardware Countermeasures. In C¸etin Kaya Ko¸cand Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2000, Second International Workshop, Worces- ter, MA, USA, August 17-18, 2000, Proceedings, volume 1965 of Lecture Notes in Computer Science, pages 252–263. Springer, 2000. [CFG+11] Christophe Clavier, Benoit Feix, Georges Gagnerot, Myl`eneRoussellet, and Vincent Verneuil. Improved Collision-Correlation Power Analysis on First Order Protected AES. In Bart Preneel and Tsuyoshi Takagi, editors, Crypto- graphic Hardware and Embedded Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1, 2011. Proceedings, vol- ume 6917 of Lecture Notes in Computer Science, pages 49–62. Springer, 2011. [CRR02] Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi. Template Attacks. In Burton S. Kaliski Jr., C¸etin Kaya Ko¸c,and Christof Paar, editors, Crypto- graphic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes in Computer Science, pages 13–28. Springer, 2002. [DR02] Joan Daemen and Vincent Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard. Information Security and Cryptography. Springer, 2002. [epr11] Error in Report 2011/516: Protecting AES with Shamir’s Secret Shar- ing Scheme by Louis Goubin and Ange Martinelli. Discussion forum of ePrint Archive: Report 2011/516 http://eprint.iacr.org/forum/read. php?11,549,549#msg-549, Sep 2011. [ESH+11] Sho Endo, Takeshi Sugawara, Naofumi Homma, Takafumi Aoki, and Akashi Satoh. An on-chip glitchy-clock generator and its application to safe-error attack. In Constructive Side-Channel Analysis and Secure Design - 2th International Workshop, COSADE 2011, Darmstadt, Germany, February 24-25, 2011, pages 175–182, 2011. [GBTP08] Benedikt Gierlichs, Lejla Batina, Pim Tuyls, and Bart Preneel. Mutual Information Analysis. In Elisabeth Oswald and Pankaj Rohatgi, editors,

115 Bibliography

Cryptographic Hardware and Embedded Systems - CHES 2008, 10th Inter- national Workshop, Washington, D.C., USA, August 10-13, 2008. Proceed- ings, volume 5154 of Lecture Notes in Computer Science, pages 426–442. Springer, 2008. [GK12] Xiaofei Guo and Ramesh Karri. Invariance-Based Concurrent Error Detec- tion for Advanced Encryption Standard. In Patrick Groeneveld, Donatella Sciuto, and Soha Hassoun, editors, The 49th Annual Design Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012, pages 573–578. ACM, 2012. [GL08] Felipe Ghellar and Marcelo Lubaszewski. A Novel AES Cryptographic Core Highly Resistant to Differential Power Analysis Attacks. In Marcelo Lubaszewski, Michel Renovell, and Rajesh K. Gupta, editors, Proceedings of the 21st Annual Symposium on Integrated Circuits and Systems Design, SBCCI 2008, Gramado, Brazil, September 1-4, 2008, pages 140–145. ACM, 2008. [GL09] Felipe Ghellar and Marcelo Lubaszewski. A Novel AES Cryptographic Core Highly Resistant to Differential Power Analysis Attacks. Journal Integrated Circuits and Systems, 4(1):29–35, 2009. [GM11a] Louis Goubin and Ange Martinelli. Protecting AES with Shamir’s Secret Sharing Scheme. In Bart Preneel and Tsuyoshi Takagi, editors, Crypto- graphic Hardware and Embedded Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1, 2011. Proceedings, vol- ume 6917 of Lecture Notes in Computer Science, pages 79–94. Springer, 2011. [GM11b] Tim G¨uneysuand Amir Moradi. Generic Side-Channel Countermeasures for Reconfigurable Devices. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems - CHES 2011 - 13th In- ternational Workshop, Nara, Japan, September 28 - October 1, 2011. Pro- ceedings, volume 6917 of Lecture Notes in Computer Science, pages 33–48. Springer, 2011. [GMK12] Xiaofei Guo, Debdeep Mukhopadhyay, and Ramesh Karri. Provably Se- cure Concurrent Error Detection Against Differential Fault Analysis. IACR Cryptology ePrint Archive, 2012:552, 2012. [GPQ10] Laurie Genelle, Emmanuel Prouff, and Micha¨elQuisquater. Secure Multi- plicative Masking of Power Functions. In Jianying Zhou and Moti Yung, editors, Applied Cryptography and Network Security, 8th International Con- ference, ACNS 2010, Beijing, China, June 22-25, 2010. Proceedings, volume 6123 of Lecture Notes in Computer Science, pages 200–217, 2010.

116 Bibliography

[GPQ11] Laurie Genelle, Emmanuel Prouff, and Micha¨elQuisquater. Thwarting Higher-Order Side Channel Analysis with Additive and Multiplicative Mask- ings. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1, 2011. Proceedings, volume 6917 of Lecture Notes in Computer Science, pages 240–255. Springer, 2011.

[GT02] Jovan Dj. Golic and Christophe Tymen. Multiplicative Masking and Power Analysis of AES. In Burton S. Kaliski Jr., C¸etin Kaya Ko¸c,and Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes in Computer Science, pages 198–212. Springer, 2002.

[JCCC07a] Ming-Haw Jing, Jian-Hong Chen, Zih-Heng Chen, and Yaotsu Chang. The Secure DAES Design for Embedded System Application. In Mieso K. Denko, Chi-Sheng Shih, Kuan-Ching Li, Shiao-Li Tsao, Qing-An Zeng, Soo- Hyun Park, Young-Bae Ko, Shih-Hao Hung, and Jong Hyuk Park, editors, Emerging Directions in Embedded and Ubiquitous Computing, EUC 2007 Workshops: TRUST, WSOC, NCUS, UUWSN, USN, ESO, and SECUBIQ, Taipei, Taiwan, December 17-20, 2007, Proceedings, volume 4809 of Lecture Notes in Computer Science, pages 617–626. Springer, 2007.

[JCCC07b] Ming-Haw Jing, Zih-Heng Chen, Jian-Hong Chen, and Yan-Haw Chen. Re- configurable system for high-speed and diversified AES using FPGA. Mi- croprocessors and Microsystems, 31(2):94–102, 2007.

[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Anal- ysis. In Michael J. Wiener, editor, Advances in Cryptology - CRYPTO ’99, 19th Annual International Cryptology Conference, Santa Barbara, Califor- nia, USA, August 15-19, 1999, Proceedings, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer, 1999.

[KKT04] Mark G. Karpovsky, Konrad J. Kulikowski, and Alexander Taubin. Robust Protection against Fault-Injection Attacks on Smart Cards Implementing the Advanced Encryption Standard. In 2004 International Conference on Dependable Systems and Networks (DSN 2004), 28 June - 1 July 2004, Florence, Italy, Proceedings, pages 93–101. IEEE Computer Society, 2004.

[Koc96] Paul C. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In Neal Koblitz, editor, Advances in Cryp- tology - CRYPTO ’96, 16th Annual International Cryptology Conference, Santa Barbara, California, USA, August 18-22, 1996, Proceedings, volume 1109 of Lecture Notes in Computer Science, pages 104–113. Springer, 1996.

117 Bibliography

[KR10] Mehran Mozaffari Kermani and Arash Reyhani-Masoleh. Concurrent Structure-Independent Fault Detection Schemes for the Advanced Encryp- tion Standard. IEEE Trans. Computers, 59(5):608–622, 2010.

[KWMK02] Ramesh Karri, Kaijie Wu, Piyush Mishra, and Yongkook Kim. Concur- rent Error Detection Schemes for Fault-Based Side-Channel Cryptanalysis of Symmetric Block Ciphers. IEEE Trans. on CAD of Integrated Circuits and Systems, 21(12):1509–1517, 2002.

[LSG+10] Yang Li, Kazuo Sakiyama, Shigeto Gomisawa, Toshinori Fukunaga, Junko Takahashi, and Kazuo Ohta. Fault Sensitivity Analysis. In Stefan Mangard and Fran¸cois-Xavier Standaert, editors, Cryptographic Hardware and Em- bedded Systems, CHES 2010, 12th International Workshop, Santa Barbara, CA, USA, August 17-20, 2010. Proceedings, volume 6225 of Lecture Notes in Computer Science, pages 320–334. Springer, 2010.

[LSI09] ISO/IEC 18033-3 Standard Cryptographic LSI – with Side Chan- nel Attack Countermeasures – Specification, ver 1.0. http: //staff.aist.go.jp/akashi.satoh/SASEBO/resources/crypto_lsi/ CryptoLSI2_Spec_Ver1.0_English.pdf, 2009.

[LSI10] Standard Cryptographic LSI Specification – Countermeasures against Side Channel Attacks (65nm) – Specification, ver 0.9. http://staff.aist.go.jp/akashi.satoh/SASEBO/resources/crypto_ lsi/CryptoLSI3_Spec_Ver0.9_English.pdf, 2010.

[MM12a] Amir Moradi and Oliver Mischke. Glitch-Free Implementation of Masking in Modern FPGAs. In 2012 IEEE International Symposium on Hardware- Oriented Security and Trust, HOST 2012, San Francisco, CA, USA, June 3-4, 2012, pages 89–95. IEEE, 2012.

[MM12b] Amir Moradi and Oliver Mischke. How Far Should Theory Be from Practice? - Evaluation of a Countermeasure. In Emmanuel Prouff and Patrick Schau- mont, editors, Cryptographic Hardware and Embedded Systems - CHES 2012 - 14th International Workshop, Leuven, Belgium, September 9-12, 2012. Proceedings, volume 7428 of Lecture Notes in Computer Science, pages 92– 106. Springer, 2012.

[MM13a] Amir Moradi and Oliver Mischke. Comprehensive Evaluation of AES Dual Ciphers as a Side-Channel Countermeasure. In Sihan Qing, Jianying Zhou, and Dongmei Liu, editors, Information and Communications Security - 15th International Conference, ICICS 2013, Beijing, China, November 20-22, 2013. Proceedings, volume 8233 of Lecture Notes in Computer Science, pages 245–258. Springer, 2013.

118 Bibliography

[MM13b] Amir Moradi and Oliver Mischke. On the Simplicity of Converting Leakages from Multivariate to Univariate - (Case Study of a Glitch-Resistant Mask- ing Scheme). In Guido Bertoni and Jean-S´ebastienCoron, editors, Crypto- graphic Hardware and Embedded Systems - CHES 2013 - 15th International Workshop, Santa Barbara, CA, USA, August 20-23, 2013. Proceedings, vol- ume 8086 of Lecture Notes in Computer Science, pages 1–20. Springer, 2013.

[MME10] Amir Moradi, Oliver Mischke, and Thomas Eisenbarth. Correlation- Enhanced Power Analysis Collision Attack. In Stefan Mangard and Fran¸cois-Xavier Standaert, editors, Cryptographic Hardware and Embedded Systems, CHES 2010, 12th International Workshop, Santa Barbara, CA, USA, August 17-20, 2010. Proceedings, volume 6225 of Lecture Notes in Computer Science, pages 125–139. Springer, 2010.

[MMG14] Oliver Mischke, Amir Moradi, and Tim G¨uneysu. Fault Sensitivity Analysis Meets Zero-Value Attack. In Assia Tria and Dooho Choi, editors, 2014 Workshop on Fault Diagnosis and Tolerance in Cryptography, FDTC 2014, Busan, South Korea, September 23, 2014, pages 59–67. IEEE, 2014.

[MMP11a] Amir Moradi, Oliver Mischke, and Christof Paar. Practical Evaluation of DPA Countermeasures on Reconfigurable Hardware. In HOST 2011, Pro- ceedings of the 2011 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 5-6 June 2011, San Diego, California, USA, pages 154–160. IEEE, 2011.

[MMP+11b] Amir Moradi, Oliver Mischke, Christof Paar, Yang Li, Kazuo Ohta, and Kazuo Sakiyama. On the Power of Fault Sensitivity Analysis and Collision Side-Channel Attacks in a Combined Setting. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1, 2011. Proceedings, volume 6917 of Lecture Notes in Computer Science, pages 292–311. Springer, 2011.

[MMP13] Amir Moradi, Oliver Mischke, and Christof Paar. One Attack to Rule Them All: Collision Timing Attack versus 42 AES ASIC Cores. IEEE Trans. Computers, 62(9):1786–1798, 2013.

[MOP07] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, 2007.

[Mor12] Amir Moradi. Statistical Tools Flavor Side-Channel Collision Attacks. In David Pointcheval and Thomas Johansson, editors, Advances in Cryptology - EUROCRYPT 2012 - 31st Annual International Conference on the Theory and Applications of Cryptographic Techniques, Cambridge, UK, April 15-19,

119 Bibliography

2012. Proceedings, volume 7237 of Lecture Notes in Computer Science, pages 428–445. Springer, 2012.

[MP13] Filippo Melzani and Andrea Palomba. Enhancing Fault Sensitivity Analysis Through Templates. In 2013 IEEE International Symposium on Hardware- Oriented Security and Trust, HOST 2013, Austin, TX, USA, June 2-3, 2013, pages 25–28. IEEE Computer Society, 2013.

[MPL+11] Amir Moradi, Axel Poschmann, San Ling, Christof Paar, and Huaxiong Wang. Pushing the Limits: A Very Compact and a Threshold Implemen- tation of AES. In Kenneth G. Paterson, editor, Advances in Cryptology - EUROCRYPT 2011 - 30th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Tallinn, Estonia, May 15- 19, 2011. Proceedings, volume 6632 of Lecture Notes in Computer Science, pages 69–88. Springer, 2011.

[MPO05] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Successfully Attacking Masked AES Hardware Implementations. In Josyula R. Rao and Berk Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science, pages 157–171. Springer, 2005.

[MR02] Sean Murphy and Matthew J. B. Robshaw. Essential Algebraic Structure within the AES. In Moti Yung, editor, Advances in Cryptology - CRYPTO 2002, 22nd Annual International Cryptology Conference, Santa Barbara, California, USA, August 18-22, 2002, Proceedings, volume 2442 of Lecture Notes in Computer Science, pages 1–16. Springer, 2002.

[MSY06] Tal Malkin, Fran¸cois-Xavier Standaert, and Moti Yung. A Comparative Cost/Security Analysis of Fault Attack Countermeasures. In Luca Breveg- lieri, Israel Koren, David Naccache, and Jean-Pierre Seifert, editors, Fault Diagnosis and Tolerance in Cryptography, Third International Workshop, FDTC 2006, Yokohama, Japan, October 10, 2006, Proceedings, volume 4236 of Lecture Notes in Computer Science, pages 159–172. Springer, 2006.

[Nat01] National Institute of Standards and Technology (NIST). Announcing the Advanced Encryption Standard (AES). November 2001. Published: Federal Information Processing Standards Publication 197.

[NRR06] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Threshold Imple- mentations Against Side-Channel Attacks and Glitches. In Peng Ning, Sihan Qing, and Ninghui Li, editors, Information and Communications Security, 8th International Conference, ICICS 2006, Raleigh, NC, USA, December

120 Bibliography

4-7, 2006, Proceedings, volume 4307 of Lecture Notes in Computer Science, pages 529–545. Springer, 2006. [NRS08] Svetla Nikova, Vincent Rijmen, and Martin Schl¨affer. Secure Hardware Implementation of Non-linear Functions in the Presence of Glitches. In Pil Joong Lee and Jung Hee Cheon, editors, Information Security and Cryp- tology - ICISC 2008, 11th International Conference, Seoul, Korea, December 3-5, 2008, Revised Selected Papers, volume 5461 of Lecture Notes in Com- puter Science, pages 218–234. Springer, 2008. [NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schl¨affer.Secure Hardware Im- plementation of Nonlinear Functions in the Presence of Glitches. J. Cryp- tology, 24(2):292–321, 2011. [OMPR05] Elisabeth Oswald, Stefan Mangard, Norbert Pramstaller, and Vincent Ri- jmen. A Side-Channel Analysis Resistant Description of the AES S-Box. In Henri Gilbert and Helena Handschuh, editors, Fast Software Encryption: 12th International Workshop, FSE 2005, Paris, France, February 21-23, 2005, Revised Selected Papers, volume 3557 of Lecture Notes in Computer Science, pages 413–423. Springer, 2005. [Paa94] C. Paar. Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields. PhD thesis, Institute for Experimental Mathematics, University of Essen, Germany, 1994. [PM05] Thomas Popp and Stefan Mangard. Masked Dual-Rail Pre-charge Logic: DPA-Resistance Without Routing Constraints. In Josyula R. Rao and Berk Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science, pages 172–186. Springer, 2005. [PMK+11] Axel Poschmann, Amir Moradi, Khoongming Khoo, Chu-Wee Lim, Huax- iong Wang, and San Ling. Side-Channel Resistant Crypto for Less than 2, 300 GE. J. Cryptology, 24(2):322–345, 2011. [PQ03] Gilles Piret and Jean-Jacques Quisquater. A Differential Fault Attack Tech- nique against SPN Structures, with Application to the AES and KHAZAD. In Colin D. Walter, C¸etin Kaya Ko¸c,and Christof Paar, editors, Crypto- graphic Hardware and Embedded Systems - CHES 2003, 5th International Workshop, Cologne, Germany, September 8-10, 2003, Proceedings, volume 2779 of Lecture Notes in Computer Science, pages 77–88. Springer, 2003. [PR11] Emmanuel Prouff and Thomas Roche. Higher-Order Glitches Free Im- plementation of the AES Using Secure Multi-party Computation Proto- cols. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware

121 Bibliography

and Embedded Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1, 2011. Proceedings, volume 6917 of Lecture Notes in Computer Science, pages 63–78. Springer, 2011. [Rad04] H˚avard Raddum. More Dual Rijndaels. In Hans Dobbertin, Vincent Rijmen, and Aleksandra Sowa, editors, Advanced Encryption Standard - AES, 4th International Conference, AES 2004, Bonn, Germany, May 10-12, 2004, Revised Selected and Invited Papers, volume 3373 of Lecture Notes in Com- puter Science, pages 142–147. Springer, 2004. [RP10] Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-Order Masking of AES. In Stefan Mangard and Fran¸cois-Xavier Standaert, editors, Cryptographic Hardware and Embedded Systems, CHES 2010, 12th Interna- tional Workshop, Santa Barbara, CA, USA, August 17-20, 2010. Proceed- ings, volume 6225 of Lecture Notes in Computer Science, pages 413–427. Springer, 2010. [RP12] Thomas Roche and Emmanuel Prouff. Higher-order glitch free implementa- tion of the AES using Secure Multi-Party Computation protocols - Extended version. J. Cryptographic Engineering, 2(2):111–127, 2012. [sas] Side-channel Attack Standard Evaluation Board (SASEBO). Further information are available via http://www.risec.aist.go.jp/project/ sasebo/ and http://www.morita-tech.co.jp/SAKURA/en/hardware. html. [Sha79] Adi Shamir. How to Share a Secret. Commun. ACM, 22(11):612–613, 1979. [Sko05] Sergei P. Skorobogatov. Semi-invasive attacks – A new approach to hard- ware security analysis. Technical Report UCAM-CL-TR-630, University of Cambridge, Computer Laboratory, April 2005. [SMMG15] Pascal Sasdrich, Oliver Mischke, Amir Moradi, and Tim G¨uneysu. Side- Channel Protection by Randomizing Look-Up Tables on Reconfigurable Hardware - Pitfalls of Memory Primitives. In Stefan Mangard and Axel Y. Poschmann, editors, Constructive Side-Channel Analysis and Secure De- sign - 6th International Workshop, COSADE 2015, Berlin, Germany, April 13-14, 2015. Revised Selected Papers, volume 9064 of Lecture Notes in Com- puter Science, pages 95–107. Springer, 2015. [SMTM01] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A Com- pact Rijndael Hardware Architecture with S-Box Optimization. In Colin Boyd, editor, Advances in Cryptology - ASIACRYPT 2001, 7th Interna- tional Conference on the Theory and Application of Cryptology and Infor- mation Security, Gold Coast, Australia, December 9-13, 2001, Proceedings,

122 Bibliography

volume 2248 of Lecture Notes in Computer Science, pages 239–254. Springer, 2001. [SSHA08] Akashi Satoh, Takeshi Sugawara, Naofumi Homma, and Takafumi Aoki. High-Performance Concurrent Error Detection Scheme for AES Hardware. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and Embedded Systems - CHES 2008, 10th International Workshop, Wash- ington, D.C., USA, August 10-13, 2008. Proceedings, volume 5154 of Lecture Notes in Computer Science, pages 100–112. Springer, 2008. [SWP03] Kai Schramm, Thomas J. Wollinger, and Christof Paar. A New Class of Collision Attacks and Its Application to DES. In Thomas Johansson, ed- itor, Fast Software Encryption, 10th International Workshop, FSE 2003, Lund, Sweden, February 24-26, 2003, Revised Papers, volume 2887 of Lec- ture Notes in Computer Science, pages 206–222. Springer, 2003. [TAV02] K. Tiri, M. Akmal, and I. Verbauwhede. A Dynamic and Differential CMOS Logic with Signal Independent Power Consumption to Withstand Differen- tial Power Analysis on Smart Cards. In Solid-State Circuits Conference, 2002. ESSCIRC 2002. Proceedings of the 28th European, pages 403–406, Sept 2002. [TV04] Kris Tiri and Ingrid Verbauwhede. A Logic Level Design Methodology for a Secure DPA Resistant ASIC or FPGA Implementation. In 2004 Design, Automation and Test in Europe Conference and Exposition (DATE 2004), 16-20 February 2004, Paris, France, pages 246–251. IEEE Computer Soci- ety, 2004. [VS11] Nicolas Veyrat-Charvillon and Fran¸cois-Xavier Standaert. Generic Side- Channel Distinguishers: Improvements and Limitations. In Phillip Rogaway, editor, Advances in Cryptology - CRYPTO 2011 - 31st Annual Cryptology Conference, Santa Barbara, CA, USA, August 14-18, 2011. Proceedings, volume 6841 of Lecture Notes in Computer Science, pages 354–372. Springer, 2011. [WKKG04] Kaijie Wu, Ramesh Karri, Grigori Kuznetsov, and Michael G¨ossel. Low Cost Concurrent Error Detection for the Advanced Encryption Standard. In Proceedings 2004 International Test Conference (ITC 2004), October 26- 28, 2004, Charlotte, NC, USA, pages 1242–1248. IEEE Computer Society, 2004. [WLL04] Shee-Yau Wu, Shih-Chuan Lu, and Chi-Sung Laih. Design of AES Based on Dual Cipher and Composite Field. In Tatsuaki Okamoto, editor, Topics in Cryptology - CT-RSA 2004, The Cryptographers’ Track at the RSA Con- ference 2004, San Francisco, CA, USA, February 23-27, 2004, Proceedings,

123 Bibliography

volume 2964 of Lecture Notes in Computer Science, pages 25–38. Springer, 2004. [XIL07] XILINX. Virtex-II Pro and Virtex-II Pro X FPGA User Guide. Technical report, version 4.2, 2007. http://www.xilinx.com/support/ documentation/user_guides/ug012.pdf. [Xil08] Xilinx. Constraints Guide. Available via http://www.xilinx.com/itp/ xilinx10/books/docs/cgd/cgd.pdf, 2008. [Xil09] Xilinx. Virtex-5 Libraries Guide for HDL Designs. Avail- able via http://www.xilinx.com/support/documentation/sw_manuals/ xilinx11/virtex5_hdl.pdf, September 2009.

124 List of Abbreviations

AES Advanced Encryption Standard

ARC International Symposium on Applied Reconfigurable Computing: Architectures, Tools, and Applications

ASIC Application Specific Integrated Circuit

BRAM Block-RAM

CECA Correlation-Enhanced Collision Attack

CED Concurrent Error Detection

CHES International Workshop on Cryptographic Hardware and Embedded Systems

CMOS Complementary Metal Oxide Semiconductor

COSADE International Workshop on Constructive Side-Channel Analysis and Secure Design

CPA Correlation Power Analysis

CryptArchi International Workshops on Cryptographic Architectures Embedded in Re- configurable Devices

CRYPTO Advances in Cryptology Conference Series

CTA Collision Timing Attack

DAC Design Automation Conference

DES Data Encryption Standard

DCM Digital Clock Manager

DFA Differential Fault Analysis

DPA Differential Power Analysis

ECC Elliptic Curve Cryptography Abbreviations

EM Electro-Magnetic

FDTC Workshop on Fault Diagnosis and Tolerance in Cryptography

FA Fault Analysis

FF Flip Flop

FIA Fault Injection Analysis

FIB Focused Ion Beam

FPGA Field Programmable Gate Array

FSA Fault Sensitivity Analysis

GF Galois Field

HD Hamming Distance

HDL Hardware Description Language

HGI Horst G¨ortzInstitute for IT Security

HOST IEEE International Symposium on Hardware-Oriented Security and Trust

HW Hamming Weight

ICICS International Conference on Information and Communications Security

IEEE Institute of Electrical and Electronics Engineers

I/O Input/Output

IPSec Internet Protocol Security

LNCS Lecture Notes in Computer Science

LSB Least Significant Bit

LSI Large Scale Integration

LUT Look-Up Table

MIA Mutual Information Analysis

MPC Multi-Party Computation

NIST National Institute of Standards and Technology

126 Abbreviations

PPRM Positive Polarity Reed-Miller

PRNG Pseudo-Random Number Generator

RAM Random Access Memory

ReConFig International Conference on ReConFigurable Computing and FPGAs

SASEBO Side-channel Attack Standard Evaluation BOard

SCA Side-Channel Analysis

SNR Signal to Noise Ratio

TI Threshold Implementation

S-box Substitution Box

UART Universal Asynchronous Receiver/Transmitter

UV Ultraviolet (light)

WDDL Wave Dynamic Differential Logic

ZV-FSA Zero-Value Fault Sensitivity Analysis

127

List of Figures

2.1 Glitches at the AES S-box output after the input has changed from 0xaa to 0x00, to 0x55, or to 0xff...... 10

3.1 Architecture of the 32bit implementation, allowing masking as well as column-wise and S-box instance shuffling ...... 21 3.2 Architecture of the unrolled designs ...... 23 3.3 Profile A: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces 25 3.4 Profile B: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing . . . 26 3.5 Profile C: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 5, 000, 000 traces and (d) over the number of traces 27 3.6 Profile D: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing . . . 28 3.7 Profile E: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing . . . 29 3.8 One round unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces ...... 31 3.9 Two rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces ...... 32 3.10 Three rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces ...... 33 3.11 Four rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 30, 000, 000 traces and (d) over the number of traces ...... 34

4.1 Masked GF(28) Inverter by Canright-Batina (taken from [MME10]) . . . 39 4.2 Two possible LUTs in Virtex-5: (a) 6-input LUT, (b) 32-bit shift-register [Xil09] 40 List of Figures

4.3 Design of our full-custom optimized S-box (inversion part only) . . . . . 41 4.4 Signal timings on LUT inputs and outputs ...... 44 4.5 Profile 3: evaluation results (a) a sample trace, (b) attack result using 50 000 traces, (c) over the number of traces...... 46 4.6 Profile 5: evaluation results (a) a sample trace, (b) attack result using 20 000 000 traces, (c) over the number of traces...... 47 4.7 Profile 6: evaluation results (a) a sample trace, (b) attack result using 50 000 000 traces, (c) attack result on 50 000 000 squared mean-free traces, (d) over the number of traces...... 48

5.1 Block diagram of sequential operations necessary for an AES S-box and a fourth of MixColumns ...... 55 5.2 Our design of the shared multiplication and addition to realize the AES S-box...... 57

6.1 Inversion circuit in GF(28)...... 65 6.2 Overall architecture of the AES dual ciphers circuit ...... 67 6.3 Distributions of the S-box output for (top) 11x and (bottom) 44x as original input over all 240 dual ciphers ...... 69 6.4 A sample power trace, PRNG ON ...... 70 6.5 Correlation-Enhanced Collision Attack results, (a) using 500 000 traces, (b) and (c) over the number of traces ...... 71 6.6 Zero-value attack results, (a) using 100 000 traces, (b) and (c) over the number of traces ...... 72

7.1 CED schemes: (a) information redundancy: parity; (b) information redun- dancy: robust code; (c) time redundancy; (d) hardware redundancy; (e) hybrid redundancy: inverse function; (f) hybrid redundancy: invariance- based CED (taken from [GMK12])...... 78

8.1 Dependency of the timing of combinatorial circuits to the input changes. The gray bars denote a stable output signal after the S-box input has changed from 0xaa to 0x00, to 0x55, or to 0xff...... 82 8.2 A simplified example on how to use clock glitches to extract a ∆ti .... 83 8.3 The block diagram of the experimental setup and the timing diagram of the relevant signals ...... 86 8.4 General architecture of the encryption datapath of the attacked AES cores (taken from [LSI09,LSI10]) ...... 87 o 8.5 Some pb=0,∆t curves for S-box instances no. (left) 0 and (right) 4 of AES Comp (130nm) ...... 88 o 8.6 Bitwise timing characteristics T b=0 of S-box instances no. (left) 0 and (right) 4 of AES Comp (130nm)...... 89

130 List of Figures

8.7 Result of the attack on the last round of AES Comp (130nm) recovering ∆k between key bytes (up to bottom) (0,1) and (0,2), (left) using 10 000 captures and (right) over the number of captures ...... 90

9.1 (a) Architecture of our evaluation circuit; (b) computation step; (c) check- ing step ...... 93 9.2 Timing behavior of the used AES S-box [Can05] ...... 94 9.3 Block diagram of the experimental setup and timing diagram of the core clock...... 97 9.4 Evolution of local error rates of S-box 8 while shortening the clock glitch length on profile 1. The local error rate of the correct key candidate is highlighted in black...... 99 9.5 (a) Local error rates computed by the first round of Algorithm 2 on profile 2 with S-box 10 highlighted and (b) correlation between local error rates obtained by two runs of the first round of Algorithm 2 (with no prior excluded candidates) ...... 100 9.6 Remaining key space after each exclusion round; (a) per S-box, (b) for the complete 128-bit key (profile 2) ...... 100 9.7 (a), (b) local error rates obtained by running Algorithm 3 on profile 3 for (a) S-box 6 and (b) S-box 10; (c) Correlation-Enhanced Collision Attack recovering the correct ∆k = k6 ⊕ k10 ...... 101

131

List of Tables

4.1 Synthesis results for all profiles (inversion only) ...... 44

5.1 Area and time overhead of our design based on XC5VLX50 Virtex-5 FPGA (excluding state register, KeySchedule, PRNGs, initial masking, and final unmasking) ...... 59

6.1 Performance figures (excluding the PRNG) ...... 66

About the Author

About the Author

Author information as of April 2016.

Personal Information

Oliver Marc Mischke Born in Frankfurt am Main, Germany on November 14th, 1983

Education

 06/2011–03/2015: Ph.D. (Dr.-Ing.) in Electrical Engineering and Information Technology, Ruhr-Universit¨atBochum, Germany

 10/2004–04/2010: Diploma in IT Security, Ruhr-Universit¨atBochum, Germany

 08/2008–05/2009: DAAD Exchange Student at Purdue University, West Lafayette, IN, USA

Professional and Academic Experience

 01/2015–Present: Digital Design Engineer, Infineon Technologies AG, Chip Card & Security Division, Munich, Germany

 06/2011–12/2014: Research Associate, Hardware Security Group, Horst G¨ortz Institute for IT Security, Ruhr-Universit¨atBochum, Germany

 05/2013–08/2013: Product Security Intern, Qualcomm Technologies Inc., OOTCS, QPSI, San Diego, CA, USA

 05/2010–05/2011: Security Engineer, ESCRYPT GmbH – Embedded Security, Bochum, Germany

135 About the Author

 04/2009–08/2009: Intern, ESCRYPT GmbH – Embedded Security, Bochum, Germany

 05/2007–09/2009: Teaching Assistant, Embedded Security Group, Horst G¨ortz Institute for IT Security, Ruhr-Universit¨atBochum, Germany

136 Author’s Publications

Author’s Publications

Author’s publications as of March 2015.

Peer-Reviewed Publications in Journals

 A. Moradi, O. Mischke, and C. Paar. One Attack to Rule Them All: Collision Timing Attack versus 42 AES ASIC Cores. In IEEE Transactions on Computers, volume 62, no. 9, pages 1786-1798, IEEE Computer Society, 2013.

 M. Kasper, A. Moradi, G. T. Becker, O. Mischke, T. G¨uneysu,C. Paar, and W. Burleson. Side Channels as Building Blocks. In Journal of Cryptographic Engi- neering, volume 2, number 3, pages 143-159. Springer, 2012.

Peer-Reviewed Publications in the Proceedings of International Conferences and Workshops

 P. Sasdrich, A. Moradi, O. Mischke, T. G¨uneysu. Achieving Side-Channel Pro- tection with Dynamic Logic Reconfiguration on Modern FPGAs. In 2015 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST’15), pages 130-136. IEEE Computer Society, 2015.

 P. Sasdrich, O. Mischke, A. Moradi, T. G¨uneysu. Side-Channel Protection by Randomizing Look-Up Tables on Reconfigurable Hardware – Pitfalls of Memory Primitives. In 6th International Workshop on Constructive Side-Channel Analy- sis and Secure Design (COSADE’15), volume 9064 of Lecture Notes in Computer Science, pages 95-107. Springer, 2015.

 O. Mischke, A. Moradi, T. G¨uneysu. Fault Sensitivity Analysis Meets Zero- Value Attack. In 11th Workshop on Fault Diagnosis and Tolerance in Cryptography (FDTC’14), pages 59-67. IEEE Computer Society, 2014.

137 Author’s Publications

 C. P¨opper, O. Mischke, T. G¨uneysu.MicroACP - A Fast and Secure Reconfig- urable Asymmetric Crypto-Processor. In 10th International Symposium on Applied Reconfigurable Computing: Architectures, Tools, and Applications (ARC’14), vol- ume 8405 of Lecture Notes in Computer Science, pages 240-247. Springer, 2014.

 A. Moradi, O. Mischke. Comprehensive Evaluation of AES Dual Ciphers as a Side- Channel Countermeasure. In 15th International Conference on Information and Communications Security (ICICS’13), volume 8233 of Lecture Notes in Computer Science, pages 245-258. Springer, 2013.

 A. Moradi and O. Mischke. On the Simplicity of Converting Leakages from Multivariate to Univariate – Case Study of a Glitch-Resistant Masking Scheme In 15th International Workshop on Cryptographic Hardware and Embedded Sys- tems (CHES’13), volume 8086 of Lecture Notes in Computer Science, pages 1-20. Springer, 2013.

 B. Driessen, T. G¨uneysu,E. B. Kavun, O. Mischke, C. Paar, and T. P¨oppelmann. IPSecco: A Lightweight and Reconfigurable IPSec Core. In 2012 International Conference on ReConFigurable Computing and FPGAs (ReConFig’12), pages 1-7. IEEE Computer Society, 2012.

 A. Moradi and O. Mischke. How Far Should Theory be from Practice? – Eval- uation of a Countermeasure. In 14th International Workshop on Cryptographic Hardware and Embedded Systems (CHES’12), volume 7428 of Lecture Notes in Computer Science, pages 92-106. Springer, 2012.

 A. Moradi and O. Mischke. Glitch-Free Implementation of Masking in Modern FPGAs. In 2012 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST’12), pages 89-95. IEEE Computer Society, 2012.

 M. Varchola, T. G¨uneysu, O. Mischke. MicroECC: A Lightweight Reconfigurable Elliptic Curve Crypto-Processor. In 2011 International Conference on ReConFig- urable Computing and FPGAs (ReConFig’11), pages 204-210. IEEE Computer Society, 2011.

 A. Moradi, O. Mischke, C. Paar, Y. Li, K. Ohta, and K. Sakiyama. On the Power of Fault Sensitivity Analysis and Collision Side-Channel Attacks in a Combined Setting. In 13th International Workshop on Cryptographic Hardware and Embedded Systems (CHES’11), volume 6917 of Lecture Notes in Computer Science, pages 292- 311. Springer, 2011.

 A. Moradi, O. Mischke, and C. Paar. Practical Evaluation of DPA Counter- measures on Reconfigurable Hardware. In 2011 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST’11), pages 154-160. IEEE Computer Society, 2011.

138 Author’s Publications

 A. Moradi, O. Mischke, and T. Eisenbarth. Correlation-Enhanced Power Analysis Collision Attack. In 12th International Workshop on Cryptographic Hardware and Embedded Systems (CHES’10), volume 6225 of Lecture Notes in Computer Science, pages 125-139. Springer, 2010.

Invited Talks

 25/04/2013: Masking the AES S-box – Implementation Aspects and Side-Channel Analysis. In HGI Kolloquium, Bochum, Germany.

 20/06/2012: Glitch-Free Implementation of Masking in Modern FPGAs. In 10th International Workshops on Cryptographic Architectures Embedded in Recon- figurable Devices (CryptArchi’12), Marcoux, France.

Attended Conferences and Workshops During Ph.D.

 CHES’14, Busan, South Korea (September 2014)

 FDTC’14, Busan, South Korea (September 2014)

 ARC’14, Vilamoura, Portugal (April 2014)

 ICICS’13, Beijing, China (November 2013)

 CHES’13, Santa Barbara, CA, US (August 2013)

 CRYPTO’13, Santa Barbara, CA, US (August 2013)

 Keccak & SHA-3 Day, Brussels, Belgium (March 2013)

 ECRYPT II Workshop: Crypto for 2020, Tenerife, Spain (January 2013)

 CHES’12, Leuven, Belgium (September 2012)

 ECRYPT II Workshop: Challenges in Security Engineering, Bochum, Germany (September 2012)

 CryptArchi’12, Marcoux, France (June 2012)

 HOST’12, San Francisco, CA, USA (June 2012)

 ReConFig’11, Canc´un,Mexico (December 2011)

 CHES’11, Nara, Japan (September 2011)

 CryptArchi’11, Bochum, Germany (June 2011)

139 Author’s Publications

 HOST’11, San Diego, CA, USA (June 2011)

140