PHYSICAL ATTACKS AND COUNTERMEASURESONTHE ADVANCED ENCRYPTION STANDARD
DISSERTATION
for the degree of Doktor-Ingenieur of the Faculty of Electrical Engineering and Information Technology at the Ruhr-Universitat¨ Bochum, Germany
by Oliver Marc Mischke Bochum, April 2015 Copyright © 2015 by Oliver Marc Mischke. All rights reserved. Printed in Germany. To Elfi, Norbert, and Melanie
Oliver Marc Mischke Place of birth: Frankfurt am Main, Germany Author’s contact information: [email protected] http://www.sha.rub.de/
Thesis Advisor: Prof. Dr.-Ing. Tim G¨uneysu Ruhr-Universit¨atBochum, Germany Secondary Referee: Prof. Dr.-Ing. Stefan Mangard Graz Universtity of Technology, Austria Tertiary Referee: Dr. Amir Moradi Ruhr-Universit¨atBochum, Germany Thesis submitted: April 14, 2015 Thesis defense: May 18, 2015 Last revision: May 2, 2016
v
Abstract
With the increasing pervasion of embedded computing devices in our everyday life, there arises also the need to protect these devices by means of strong cryptography. This may either be required to protect the intellectual property of a vendor, secure confidentiality of sensitive data, or to establish secure means of communication. The preferred cryptographic algorithm in many – especially commercial – applications is the Advanced Encryption Standard (AES). It was selected in 2001 by the National Institute of Standards and Technology (NIST) in a public competition, whose aim was to find the most suitable successor to the outdated Data Encryption Standard (DES) algorithm. Due to the short key size and low performance in software implementations, DES could no longer satisfy the requirements imposed by many applications. While AES remains a very secure algorithm considering a black-box attack scenario, care has to be taken when designing a physical implementation for embedded devices. Since these devices are in the field and must therefore be considered as operating in a hostile environment, they are susceptible to a multitude of physical attacks. This includes passive attacks like measuring the data-dependent power consumption while computing on sensitive data (so-called power analysis), and also active attacks where the device is forced into faulty behavior by being operated outside the defined operating conditions (e.g., clock or voltage spikes). Many countermeasures have been proposed to protect implementations of AES against those attacks, but the resistance of these countermeasures when deployed on actual hard- ware is seldom evaluated in sufficient detail. For example, even recently, some coun- termeasures were proposed claiming resistance to power analysis purely considering a Hamming Distance (HD) leakage metric on registers. Considering that glitches in un- derlying hardware gates are a major reason for the leakage of supposedly masked data, designs based on such a pure HD metric can never provide a sufficient level of protection when implemented in hardware. This dissertation aims to address the problems arising from the practical utilization of the theoretical countermeasures in hardware implementations. We have evaluated the suitability of several countermeasure proposals for achieving a high level of resistance when implemented on FPGAs. Using collision attacks, we are able to detect leakages without relying on hypothetical power models, which are usually not able to adequately capture real device behavior. We also propose a new technique on how to implement a Boolean masking scheme in a glitch-free manner making use of special FPGA resources and characteristics. In addition, we present new variants of an active fault attack. They allow the recovery of data-dependent timing behavior of S-boxes and can thereby extract the secrets. It is also shown how a Zero-Value vulnerability in S-boxes implemented using a composite field approach can be exploited to break implementations even if they are equipped with sophisticated error detection schemes.
Keywords.
Physical Attacks, Side-Channel Attacks, Side-Channel Countermeasures, Power Analysis, Fault Analysis, Advanced Encryption Standard (AES), Masking Schemes, Concurrent Error Detection (CED), Collision Attacks, Fault Sensitivity Analysis (FSA), Glitches.
viii Kurzfassung
Physikalische Angriffe und Gegenmaßnahmen auf die Advanced Encryption Standard Blockchiffre
Mit der fortschreitenden Verbreitung von eingebetteten Prozessoren in Ger¨aten des t¨aglichen Gebrauchs, w¨achst auch der Bedarf diese mittels starker Kryptographie zu schutzen.¨ Dies kann sowohl zum Schutz des geistigen Eigentums, der Vertraulichkeit sen- sibler Nutzerdaten, als auch zur Etablierung sicherer Kommunikationskan¨ale erforderlich sein. Der bevorzugte Algorithmus, vor allem fur¨ gewerbliche Anwendungen, ist der Advan- ced Encryption Standard (AES). AES wurde im Jahre 2001 vom National Institute of Standards and Technology (NIST) nach einem ¨offentlichen Wettkampf als am besten ge- eigneter Nachfolger des Data Encryption Standard (DES) ausgew¨ahlt. DES wurde Auf- grund einer zu geringen Schlussell¨ ¨ange sowie unzureichender Ausfuhrungsgeschwindigkeit¨ in Softwareimplementierungen aktuellen und zukunftigen¨ Anforderungen nicht mehr ge- recht. Zwar ist AES als mathematisch hochsicher anzusehen, bei physischen Realisierungen des Algorithmus in Hardware ergibt sich jedoch ein anderes Bild. Da sich die Ger¨ate in der Hand des Nutzers befinden ergeben sich eine Vielzahl an M¨oglichkeiten physika- lische Angriffe durchzufuhren.¨ Ein Beispiel fur¨ einen passiven Angriff ist die Messung des datenabh¨angigen Stromverbrauchs w¨ahrend Ver- und Entschlusselungsoperationen¨ durchgefuhrt¨ werden (Stromprofilanalyse). Auch aktive Angriffe, wie beispielsweise die M¨oglichkeit uber¨ Spannungsspitzen eine fehlerhafte Berechnung im Ger¨at zu erzwingen (Fehlerinjektionsangriffe), sind in diesen Einsatzgebieten durchfuhrbar.¨ Zwar wurden in der Vergangenheit bereits zahlreiche Gegenmaßnahmen vorgestellt um physikalische Angriffe auf AES Implementierungen zu erschweren, jedoch wurde die Wirk- samkeit in der Praxis h¨aufig nur unzureichend untersucht. Erst vor kurzem wurde eine Gegenmaßnahme pr¨asentiert, deren Sicherheit auf der Annahme beruht, dass nur der dy- namische Stromverbrauch beim Uberschreiben¨ von Registern in der Schaltung zu schutzen¨ ist. Vor dem Hintergrund, dass einer der Hauptgrunde¨ fur¨ die Unsicherheit von eigentlich geschutzten¨ Implementierungen physikalische Effekte auf Gatter-Ebene sind, kann eine solche Gegenmaßnahme nicht die Erwartungen erfullen.¨ Der Fokus dieser Dissertation liegt auf der praktischen Untersuchung der oftmals nur theoretisch fundierten Gegenmaßnahmen. Es wird das erreichbare Sicherheitsniveau einer Vielzahl von Gegenmaßnahmen auf rekonfigurierbaren Hardware Plattformen (FPGAs) evaluiert. Mit Hilfe von Kollisionsangriffen war es m¨oglich auch solche Informationslecks zu finden, welche nicht bekannten theoretischen Modellen entsprechen. Einer der For- schungsbeitr¨age dieser Dissertation ist eine neue Implementierungstechnik, mit welcher kryptographische Schaltungen in FPGAs sicher realisiert werden k¨onnen. Zus¨atzlich werden zwei neuartige Varianten eines aktiven Fehlerinjektionsangriffes pr¨asen- tiert, welcher es erm¨oglicht die datenabh¨angige Laufzeit von Signalen in kryptographi- schen S-boxen zu ermitteln und so die Implementierung zu brechen. Ebenso wird demons- triert, wie mittels einer speziellen Schwachstelle von S-boxen, welche in einem Erweite- rungsk¨orper implementiert wurden, die Schlusselextraktion¨ sogar in solchen F¨allen ge- lingt, in denen Implementierungen mit speziellen Fehlerdetektionsalgorithmen geschutzt¨ sind.
Schlagworte.
Physikalische Angriffe, Seitenkanalangriffe, Seitenkanalgegenmaßnahmen, Stromprofil- analyse, Fehleranalyse, Advanced Encryption Standard (AES), Maskierungsschemata, Fehlerdetektion, Kollisionsangriffe, Fault Sensitivity Analysis (FSA), Glitches.
x Acknowledgements
This thesis is the outcome of three and a half years in the Hardware Security group (SHA) of the Horst G¨ortzInstitute for IT-Security (HGI), Ruhr University Bochum (RUB). I found not only colleagues and co-authors, but also close friends who made sure that my time in and outside the university was always enjoyable. Same is true for everyone in the Embedded Security group (EMSEC), with whom we shared the office space. Thanks to all of you for a wonderful time, I will not forget you! Special thanks go to my advisor Tim G¨uneysu, who accepted me as PhD student when I wanted to leave industry to pursue an academic career. A big shout-out also to Amir Moradi, who took me with him on his exciting journey of side-channel research. Thanks a lot to both of you, without your guidance and support I would not be who I am today. I would also like to thank Stefan Mangard for taking his time being my secondary referee and providing me excellent feedback on my thesis. Our groups would not have been the same without our non-scientific staff; Irmgard K¨uhn and Horst Edelmann, who kept a lot of administrative or technical issues away from us so that we could focus on the science. Thanks for all your support and for always providing kind words when needed. Another special shout-out goes to Elif Kavun and Alexander Wild, who endured sharing offices with me and made sure that we always had fun no matter how close the next deadline was. This thesis would also not have been possible without the help of all my co-authors (in alphabetical order): Georg Becker, Wayne Burleson, Benedikt Driessen, Thomas Eisen- barth, Tim G¨uneysu,Markus Kasper, Elif Kavun, Yang Li, Amir Moradi, Kazuo Ohta, Christof Paar, Thomas P¨oppelmann, Christopher P¨opper, Kazuo Sakiyama, Pascal Sas- drich, Michal Varchola. Thanks to all of you for the fruitful discussions and collaborations! I would also like to express my gratitude to all the people who proof-read this thesis, and especially Gesine Hinterw¨alderfor keeping me sane during the last days of thesis writing and defense. All the loved ones; my family and friends: you provided me the support I needed all the way along... This work would not be possible without you, thank you for being there! Last but not the least, to all other people I forgot to mention here: You know I owe you, thanks!
Table of Contents
Imprint ...... v Abstract ...... v Kurzfassung ...... viii Acknowledgements ...... xi
1 Introduction 1 1.1 Physical Attacks ...... 2 1.2 Motivation ...... 2 1.3 Research Contributions and Outline ...... 3
I Side-Channel Analysis of AES 7
2 Preliminaries 9 2.1 Introduction ...... 9 2.2 Glitches in Hardware Circuits ...... 10 2.3 Side-Channel Attacks ...... 11 2.3.1 Differential Power Analysis (DPA) ...... 11 2.3.2 Template Attacks ...... 12 2.3.3 Correlation-Enhanced Collision Attack (CECA) ...... 12 2.4 Side-Channel Countermeasures ...... 14 2.4.1 Hiding ...... 14 2.4.2 Masking ...... 15 2.5 Contributions ...... 16
3 Analysis of Boolean Masking & Hiding 19 3.1 Introduction ...... 19 3.2 Architecture ...... 20 3.2.1 32-bit Architecture ...... 21 3.2.2 Unrolled Architecture ...... 23 3.2.3 Target Platform and Measurement Setup ...... 24 3.3 Evaluation of Masking & Shuffling ...... 24 3.3.1 Profile A: No Countermeasures ...... 24 3.3.2 Profile B: Column-wise Shuffling Only ...... 25 Table of Contents
3.3.3 Profile C: Masking Only ...... 27 3.3.4 Profile D: Masking and Column-wise Shuffling ...... 28 3.3.5 Profile E: Masking, Column-wise and Instance Shuffling ...... 29 3.4 Evaluation of Masking & Pipelining ...... 30 3.4.1 One Round per Clock Cycle ...... 30 3.4.2 Two Rounds per Clock Cycle ...... 31 3.4.3 Three Rounds per Clock Cycle ...... 32 3.4.4 Four Rounds per Clock Cycle ...... 33 3.5 Conclusion ...... 33
4 Glitch-free Implementation of Masking 37 4.1 Introduction ...... 37 4.2 Preliminaries ...... 38 4.2.1 Masked AES S-box ...... 38 4.2.2 Xilinx FPGA Resources ...... 40 4.3 Design ...... 42 4.4 Evaluation ...... 44 4.5 Conclusion ...... 47
5 Implementation of a Glitch-resistant Masking Scheme 51 5.1 Introduction ...... 51 5.2 Target Scheme ...... 52 5.3 Implementation & Caveats ...... 56 5.4 Conclusion ...... 60
6 Are Dual Ciphers a Side-Channel Countermeasure? 61 6.1 Introduction ...... 62 6.2 Dual Cipher Concept ...... 63 6.3 Design ...... 65 6.4 Evaluation ...... 67 6.4.1 Mask Reuse ...... 68 6.4.2 Concurrent Processing of Mask and the Masked Data ...... 68 6.4.3 Unbalance ...... 69 6.4.4 Zero Value ...... 69 6.4.5 Practical Investigations ...... 70 6.5 Conclusion ...... 71
II Fault Analysis of AES 73
xiv Table of Contents
7 Preliminaries 75 7.1 Introduction ...... 75 7.2 Fault Injection Attacks ...... 76 7.2.1 Differential Fault Analysis (DFA) ...... 76 7.2.2 Fault Sensitivity Analysis (FSA) ...... 76 7.3 Countermeasures ...... 77 7.3.1 Sensors ...... 77 7.3.2 Concurrent Error Detection (CED) Schemes ...... 78 7.4 Joint Motivation and Contributions ...... 79
8 Correlation Timing Analysis 81 8.1 Introduction ...... 81 8.2 Notations ...... 82 8.2.1 How to Measure the Timing ...... 82 8.2.2 Definitions ...... 84 8.3 Attack Description ...... 84 8.4 Evaluation ...... 85 8.5 Conclusion ...... 89
9 Zero-Value Fault Sensitivity Analysis 91 9.1 Introduction ...... 92 9.2 Designing an Architecture for Evaluation ...... 92 9.3 Zero-Value Fault Sensitivity Analysis (ZV-FSA) ...... 94 9.3.1 Zero-Value Vulnerability ...... 95 9.3.2 Attack Methodology ...... 96 9.4 Evaluation of the Attack ...... 97 9.4.1 Profile 1: Time redundancy-based CED ...... 98 9.4.2 Profile 2: Invariance-based CED ...... 99 9.4.3 Profile 3: Randomized permutations ...... 101 9.5 Conclusion ...... 102
III Conclusion 105
10 Conclusion 107
IV Appendix 111
Bibliography 113
xv Table of Contents
List of Abbreviations 125
List of Figures 127
List of Tables 131
About the Author 135
Author’s Publications 137
xvi Chapter 1
Introduction
This chapter briefly introduces physical attacks, states the motivation, the research contributions, and outlines the structure of the complete dissertation.
Contents of this Chapter 1.1 Physical Attacks ...... 2 1.2 Motivation ...... 2 1.3 Research Contributions and Outline ...... 3
We are surrounded by an increasing number of embedded computing platforms. Besides traditional smart cards used in the banking sector, mobile phones have also grown to powerful processing platforms and are often accompanied by other electronic gadgets like smart watches or fitness trackers. Home electronics are also upgraded with more and more computing and remote control capabilities creating the so-called Smart Home. All those devices handle sensitive data and therefore require cryptographic solutions to protect the privacy of a user, confidentiality of data and allow secure means of communication.
There exist a multitude of cryptographic algorithms capable of providing the required level of protection considering a traditional black-box attack scenario, where only the input and output of a cipher can be observed. However, with the discovery of physical attacks in the late 1990s [Koc96,KJJ99], implementations without dedicated countermea- sures can nowadays be broken easily. Instead of relying on mathematical cryptanalysis, attackers can measure the data-dependent timing behavior or power consumption of a device during the execution of a cryptographic algorithm and extract secret data.
This chapter provides a broader overview on physical attacks, the research motivation, and outlines the structure of the dissertation and the research contributions.
1 Chapter 1. Introduction
1.1 Physical Attacks
Physical attacks can be categorized in different ways. The main categories are whether an attack is invasive or non-invasive. In addition, they can be further classified into active and passive attacks. In an invasive attack a chip package is opened to allow probing or modifications. If the chip’s functionality is not altered but, e.g., micro-probes are used to read out confidential data from a bus, this constitutes a passive invasive attack. Forcing signals on bus lines or rerouting signals by means of a Focused Ion Beam (FIB) are examples of active invasive attacks. Often a third category – semi-invasive attacks – is used, in which the package is removed but the passivation layer remains intact. Faults can then be introduced using, e.g., lasers, UV light, X-rays, or alpha radiation. For further reading on (semi-)invasive attacks we refer to [Sko05]. Non-invasive attacks are harder to detect since no observable damage is done to the chip. This makes them especially dangerous since if an outside attack cannot be detected by the owner, mitigation steps like revocation or exchange of compromised keys might not be performed. Examples for active non-invasive attacks are the use of clock glitches or voltage spikes to introduce a faulty behavior into the chip. Faults might make the chip accept wrong passwords as correct or skip a number of encryption rounds, thereby weakening the used cryptographic primitives and allowing mathematical cryptanalysis. Passive non-invasive attacks require neither chip manipulations nor fault injections. In those so-called side-channel attacks physical and logical characteristics of a device are measured. Examples include the timing [Koc96], power consumption [KJJ99], or electro- magnetic emanation. If the execution time, i.e., the instruction flow, depends on some secret information, or if the power consumption is dependent on the processed data, as is the case in CMOS technology, gathering this information often is enough to deduce the secrets. For more information on passive non-invasive attacks – especially power analysis–, we refer to [MOP07] as a starting point. Chapter 2 and Chapter 7 also provide more insights into relevant non-invasive passive and active attacks.
1.2 Motivation
The AES algorithm is the de-facto industry standard for symmetric encryption solutions. While many countermeasures have been proposed to protect hardware implementations of the AES from physical attacks, they often lack a sufficient practical security evaluation, i.e., when implemented on a real hardware platform. The problem relies in the fact that many countermeasures are designed based on specific assumptions on the leakage of the intermediate states of an algorithm in hardware. Even if an implementation should
2 1.3. Research Contributions and Outline in theory be perfectly secure against a specific attack by application of one of those countermeasures, if the practical leakage behavior of the target platform is not a perfect match to the model, it cannot provide the promised level of security.
In this dissertation we therefore aim to analyze different ways to secure hardware im- plementations of AES against non-invasive physical attacks. Our evaluations of counter- measure are performed on different Field Programmable Gate Arrays (FPGAs) platforms, sold under the name SASEBO [sas], which were specifically designed to facilitate power analysis attacks. By utilizing collision attacks, we avoid relying on hypothetical power models which usually do not adequately cover real device behavior.
Using FPGA platforms has several advantages, not only for an evaluation of the con- sidered algorithms, but also in commercial settings. Their reconfiguration capabilities enable the implementation and subsequent analysis of various configurations of a target countermeasures in a timely manner. The same reconfigurability in FPGA-based com- mercial products also makes it possible to upgrade any integrated cryptographic engines. This could be required in case flaws are found in the implementations of countermeasures or if new attacks make more robust security measures necessary. Because of the reduced time-to-market and the lower development costs, FPGAs are often a preferred solution compared to Application Specific Integrated Circuits (ASICs), if the product volume does not exceed a certain threshold.
1.3 Research Contributions and Outline
Besides this introductory chapter, this dissertation consists of three parts. Part I, Side- Channel Analysis (SCA) of AES, presents the relevant work we have performed on coun- termeasures to protect AES. It includes evaluations of externally proposed countermea- sures as well as an own proposal. Part II, Fault Analysis (FA) of AES, showcases some advancements on FSA, namely Collision Timing Attack (CTA) and ZV-FSA. Part III, Conclusion, concludes the dissertation and outlines future directions.
Note that the thesis author has partnered with Amir Moradi during all the practical eval- uations presented in this thesis. With the exception of Chapter 9, where the thesis author was solely responsible, the workload was split with the thesis author being responsible for all implementation aspects while the evaluations were led by Amir Moradi. To keep the focus on the thesis author’s contributions, the evaluation aspects of some chapters are therefore shortened or cut in this thesis, with references being given to the complete publications when required.
In the following, we list the four parts and their chapters together with the respective contributions and the corresponding publications.
3 Chapter 1. Introduction
Chapter 1: A general introduction to this dissertation is given. This in- cludes the motivation, the general structure, and a summary of the research contributions.
Part I
Chapter 2: This chapter provides an introduction into the field of side- channel attacks and countermeasures. Techniques relevant to this dissertation are discussed, which helps to sort the dissertation into the larger field of SCA research. In particular, the Correlation-Enhanced Collision Attack (CECA) is introduced, which will be used to evaluate the resistance of the analyzed countermeasures.
Chapter 3: First a Boolean masking scheme is evaluated, then different hiding techniques are also considered in combination to strengthen the resistance of the masking scheme under test. Results of this work were published at the IEEE International Symposium on Hardware-Oriented Security and Trust (HOST) in 2011 [MMP11a].
Chapter 4: As follow up to the previous chapter, the problem of glitches in hardware gates is addressed by manually mapping the masking scheme to the available FPGA resources and the usage of special enable signals. Results of this work were accepted to HOST in 2012 [MM12a].
Chapter 5: A masking scheme based on Shamir’s secret sharing and Multi- Party Computation (MPC) is presented with a focus on the intricate im- plementation aspects. Results from this research were presented at the International Workshop on Cryptographic Hardware and Embedded Systems (CHES) in 2013 [MM13b].
Chapter 6: This chapter analyzes if the idea of using dual ciphers of AES as SCA countermeasure has any merits. Detailed reasoning is pro- vided on the infeasibility of this idea. Results of this work were published at the International Conference on Information and Communications Secu- rity (ICICS) in 2013 [MM13a].
Part II
Chapter 7: The purpose of this chapter is to provide an introduction into the field of non-invasive fault injection attacks and countermeasures. In par- ticular, Fault Sensitivity Analysis (FSA) is described which is the basis for the advanced attacks in the following chapters.
Chapter 8: The Collision Timing Attack (CTA) is presented, which merges the benefits of the CECA and FSA. Results from this work were first accepted
4 1.3. Research Contributions and Outline
to CHES in 2011 [MMP+11b]. An extended version was published in IEEE Transactions on Computers in 2013 [MMP13].
Chapter 9: Taking a closer look at the Zero-Value vulnerability of composite- field S-boxes, we present another advancement to FSA which is able to recover the secret from an AES implementation protected by a state-of-the-art CED scheme. Results of this work were published at the Workshop on Fault Diag- nosis and Tolerance in Cryptography (FDTC) in 2014 [MMG14].
Part III
Chapter 10: We summarize our results and discuss open topics and guidelines for future work.
Note that the author of this thesis has contributed to other topics, such as Elliptic Curve Cryptography (ECC) accelerators; however, they are not included in this thesis due to its focus on physical attacks and countermeasures on AES. A detailed list of all publications can be found at the end of this thesis.
5
Part I
Side-Channel Analysis of AES
Chapter 2
Preliminaries
This chapter provides a brief introduction into side-channel attacks and countermeasures. In particular the Correlation-Enhanced Collision Attack (CECA) is described, which is used as a means to evaluate the security of our implemented countermeasures in the following chapters. Lastly, a brief outline of the remaining work presented in this SCA part of the dissertation is given.
Contents of this Chapter 2.1 Introduction ...... 9 2.2 Glitches in Hardware Circuits ...... 10 2.3 Side-Channel Attacks ...... 11 2.4 Side-Channel Countermeasures ...... 14 2.5 Contributions ...... 16
2.1 Introduction
With the increasing pervasion of cryptography in more and more embedded systems to protect either the intellectual property of a vendor or to preserve privacy by allowing secure communications, the need of secure implementations of cryptographic primitives like AES is at an all-time high. These implementations should not only be resistant to classical attacks but also be protected against side-channel attacks like power analy- sis [KJJ99,MOP07]. Many different kinds of countermeasures have been proposed for the protection of ei- ther software implementations or hardware platforms (see [MOP07] for instance). Mask- ing of sensitive values is one of the most considered solutions, and the scientific com- munity has shown a huge interest in different aspects of masking countermeasures,
9 Chapter 2. Preliminaries e.g., [AG01,GT02,BGK04,OMPR05,NRR06,CB08,RP10,GPQ10,GPQ11,NRS11,PR11]. They indeed have been presented to the community in an arms race to counteract the also evolving new side-channel attacks. Implementations of most of these earlier mask- ing schemes, while considered secure under the used security model at that time, still exhibit a detectable univariate first-order leakage which is caused by glitches in the com- binatorial circuits of hardware. For instance, we can mention the schemes presented in [OMPR05] and [CB08] which have later been shown to be vulnerable in [MPO05] and [MME10], respectively. Taking these occurring glitches into account, new masking schemes have been developed claiming glitch resistance, e.g., Threshold Implementa- tion (TI) [NRR06,NRS08,NRS11]. In the following we briefly explain the problem of glitches in hardware and give an overview over the side-channel attacks and countermeasures which are relevant to this dissertation.
2.2 Glitches in Hardware Circuits
When the inputs of a hardware gate change, the output can change its value more than once. If the inputs do not arrive at exactly the same time, the first arriving input can toggle the circuit output and the next arriving input can again create another output activity. Since combinatorial functions usually consist of many underlying hardware gates, glitches in one gate do propagate and cause even more glitches in the following gates, since now the inputs of those gates are switching multiple times. An illustrative example for this phenomenon is given in Fig. 2.1. As can be seen, when the input of the AES S-box changes, the circuit output does toggle multiple times before reaching a stable state. Also note that the amount of activity and the critical path
i:aa 00 i:aa 55 i:aa ff
Figure 2.1: Glitches at the AES S-box output after the input has changed from 0xaa to 0x00, to 0x55, or to 0xff.
10 2.3. Side-Channel Attacks length appears to be data-dependent, which indicates that information can leak if the input depends on a secret. In addition to being aware of glitches in combinatorial functions, a designer should also make sure that control signals are hazardless. Otherwise, glitches on, e.g., multiplexer lines, can create additional sources of exploitable leakage.
2.3 Side-Channel Attacks
In the following, we explain the basic ideas of Differential Power Analysis (DPA), Correlation Power Analysis (CPA), template attacks, as well as the Correlation-Enhanced Collision Attack (CECA). For an in-depth mathematical explanation we refer to [MOP07] as well as the referenced original publications.
2.3.1 Differential Power Analysis (DPA)
The concept of Differential Power Analysis (DPA) was presented at CRYPTO’99 [KJJ99] by Kocher et al. It was the first work demonstrating how the secret key of a DES implementation can be recovered exploiting the data-dependent power consumption of a device implemented in CMOS technology, and was a main driving factor boosting the interest in research on side-channel attacks and countermeasures. DPA aims to recover the complete secret key by employing a divide-and-conquer ap- proach. An attacker must gather a large amount of power consumption traces for variable plaintexts and a fixed secret key, and be able to predict intermediate values of the algo- rithm, which requires access to either the plaintext or ciphertext of each computation. A choice for the intermediate values V in case of AES could be the output of a first round S-box i, since it relies only on the corresponding plaintext byte pi and unknown secret key byte ki: Vk,i = Sbox(pi ⊕ ki).
Difference of Means
For each key candidate and power trace an attacker computes the hypothetical interme- diate value and applies a selection function to sort each trace into one of two sets S0 and S1. As selection function usually the value of a single bit of the intermediate value Vk,i, for example the LSB, is used. Therefore, if it is 0, the trace is added to the set S0, otherwise it belongs to S1. When computing the difference of means between the two sets for each key candidate, the key candidate which yields the highest difference should be the correct one. The
11 Chapter 2. Preliminaries intuition is that if a wrong key candidate is chosen, the traces are sorted based on wrongly computed intermediate values. Therefore the difference between the two sets should be close to zero. On the other hand, if the correct key candidate is used to compute the intermediate values, then all traces in set S0 have a certain bit set to zero while it is 1 in set S1, and a slight difference in the power consumption traces should be visible.
Correlation Power Analysis (CPA)
Instead of using the difference of means between two sets of power traces, in a Correlation Power Analysis (CPA) [BCO04] the attacker computes Pearson’s correlation coefficient of each point of the real power measurement traces and hypothetical power values H. The hypothetical power values are generated by applying a power model to the computed intermediate values V . The strength of the attack therefore depends on finding a suit- able power model which closely approximates the real power consumption of the target platform. Choices for a suitable power model often are the Hamming Weight (HW) of an intermedi- ate value, the Hamming Distance (HD) of two intermediate values, e.g., if one overwrites the other in a register, and the zero value power model. The zero value model works on the assumption that a combinatorial circuit will consume the least amount of power if all input bits are zero. This is for example the case in the AES S-box, since the zero input is a special case in the inversion part of the function and is mapped to the zero output.
2.3.2 Template Attacks
Template Attacks, which were first discussed in [CRR02], do not rely on a hypotheti- cal power model. An attacker obtains a similar device as the target one where he can have more direct influence on the cryptographic operations. By setting a known key he can analyze the real-world power consumption of different operations and create power consumption templates in a profiling stage. During an attack, the measured power con- sumption is not compared to a hypothetical model, but to the created templates.
2.3.3 Correlation-Enhanced Collision Attack (CECA)
The Correlation-Enhanced Collision Attack (CECA) has been introduced in [MME10]. In contrast to classical power analysis attacks, it requires neither a hypothetical power model –as in CPA–, nor an offline profiling phase –as in template attacks–. Also, unlike other collision attacks, it works well against certain masked implementations which still have some kind of first order leakage.
12 2.3. Side-Channel Attacks
Similar to other collision attacks [SWP03], it recovers the differences between the parts of the secrets, e.g., the xor between the key bytes in the case of AES, which finally allows an easy key recovery. Since the big combinatorial circuits, e.g., AES S-boxes, are usually shared in the computation of a cipher round because of area constraints, the collision attacks can compare the side-channel leakage of the same instance of the circuit in two different instances of time. Due to the bijectivity of the AES S-box, output collisions, i.e., two different S-box computations in time t1 and t2 taking the same value, require input collisions. Means, ∆ = i1 ⊕ i2 = k1 ⊕ k2 when Sbox(i1 ⊕ k1) = Sbox(i2 ⊕ k2), which is known as linear collision attack on AES [Bog08]. The advantage of the correlation collision attack in comparison to the other collision attacks is that it tries to detect the case when maximum collisions occur selecting the correct ∆.
In order to perform the correlation collision attack, the power values corresponding to time t1 are sorted based on the values of input byte i1 such that all traces where i1 = α are summed and averaged to an average power consumption M1(α). Hence, due to the 256 possible values that α can take for AES, we get 256 different M1(α). Repeating the same procedure for input byte i2 on power values at time t2 leads to 256 different M2(α). The attack now assumes that the power consumption of the S-box computation for two different bytes at t1 and t2 has the same leakage if the same values are processed. For a known key difference ∆ = k1 ⊕k2, the S-box inputs are the same if i1 = i2 ⊕∆, and hence the average power consumptions M1(α) ≈ M2(α ⊕ ∆) should be highly similar. Such a similarity can be detected computing the correlation between all possible (M1,M2)-pairs for all possible key differences ∆. The correlation for a correct key difference ∆ is very high, as each (M1,M2)-pair is a direct collision, while for false ∆’s the correlation is low as unrelated computations are correlated. Repeating the same scheme for different S-box computations corresponding to different input bytes recovers sufficient ∆’s to reveal the complete secret.
Since this attack compares the power consumption characteristics of two combinatorial circuits, as illustrated in [MME10], the best result is achieved by comparing the power consumption of one instance of the target combinatorial circuit, e.g., an S-box, in different clock cycles. Therefore, the best target for this attack is when only one instance of the S-box is implemented and shared in all S-box computations. If there are more instances of the S-box, e.g., a 32-bit architecture where four instances of the S-box are implemented and four S-box computations are performed in a clock cycle, the attack should compare the power consumption of each instance of the S-box in different clock cycles. If the archi- tecture does not share any S-box for a round computation, and comparing the leakages of the same instance of the circuit in different clock cycles is not possible any more, the effectiveness of such an attack depends strongly on the similarity of power consumption characteristics of different instances of the S-box whose leakages are compared.
13 Chapter 2. Preliminaries
2.4 Side-Channel Countermeasures
The two main classes of side-channel countermeasures are hiding and masking. A good overview is given in [MOP07], and the following explanations closely follow the descrip- tions of that publication.
2.4.1 Hiding
In implementations applying purely hiding countermeasures, the same intermediate values as in unprotected implementations are computed. The goal of the countermeasure is to still have a power consumption which is independent of the intermediate values, and in theory this could be achieved by either having the same computation consume random amounts of power in different clock cycles, or make every possible computation consume exactly the same amount of power.
Time Domain
One practical solution is to apply hiding in the time domain. By inserting dummy clock cycles, the power consumption traces can be de-synchronized, making it very hard for an attacker to properly align the traces to attack a specific operation. Care has to be taken that the dummy operations cannot be easily distinguished from real operations rendering the countermeasure ineffective. Another solution is to shuffle the order of executions. Let us consider an AES design employing a single S-box to compute the whole SubBytes transformation on 16 input bytes in 16 consecutive clock cycles. If the order in which the different inputs are processed by the Sbox is randomized, then an attacker does not know in which cycle the target S-box of his attack was processed, which makes performing a successful attack more difficult.
Amplitude Domain
Instead of applying hiding in the time domain, one can also try to influence the power consumption in the amplitude domain. By adding noise generators, the Signal to Noise Ratio (SNR) of the real cryptographic operations can be reduced. Also some special logic styles were proposed [PM05, TV04, TAV02] to reduce the leakage of a signal or to consume the same amount of power for any given input.
14 2.4. Side-Channel Countermeasures
2.4.2 Masking
In designs implementing masking, the goal is to perform all computations only on masked intermediate states xm, where x denotes a unmasked intermediate value and m is a mask m: xm = x ∗ m. In Boolean masking schemes –and considering the AES algorithm–, the ∗ operation is the exclusive-OR ⊕, while in multiplicative masking schemes it would denote a finite field multiplication ×. The idea is that if a device performs operations only on masked data, which is randomized for each computation, then the data-dependent power consumption no longer leaks infor- mation of the unmasked intermediate states. Care has to be taken that all intermediate states are masked at all times, and that operations with two masked values still lead to a masked result and do not leak additional information during storage. As example, let us consider a Boolean masked value xm which is stored in a register. If it is overwritten in a later clock cycle by another masked value ym which uses the same mask m, then the Hamming distance of the two unmasked values will leak:
HD(xm ⊕ ym) = HD(x ⊕ m ⊕ y ⊕ m) = HD(x ⊕ y).
Boolean and Multiplicative Masking
In an cryptographic algorithm like AES, some circuit parts will perform linear operations, e.g., key addition or MixColumns, and some will perform non-linear operations, e.g., SubBytes. Masking a linear functions L(x) is trivial using a Boolean masking scheme, since the relation L(x ⊕ m) = L(x) ⊕ L(m) holds. However, this does not work with non-linear functions NL(x), since NL(x⊕m) =6 NL(x)⊕NL(m). A non-linear function, like the inversion part of the AES S-box, could be masked by a multiplicative scheme, since NL(x × m) = NL(x) × NL(m). Using a multiplicative masking scheme, however, has one important weakness. For all possible values of the mask, the zero input will not be masked since multiplication with zero always yields a zero result [AG01,GT02]. There are two options how to implement a Boolean masking scheme for S-boxes. The first is to implement the S-box as masked table lookup [GM11b,SMMG15] –using independent input masks m and output masks n–, which is recomputed with fresh masks before every computation:
yn = Sboxmasked(xm), with Sboxmasked = Sbox(x ⊕ m) ⊕ n. For the AES S-box, the second implementation option is using a tower-field approach [Paa94,SMTM01]. If the inversion is performed in GF(22) instead of GF(28), it becomes linear since inversion in GF(22) is equivalent to squaring which can be implemented as bitswap [OMPR05, CB08]. Note that it has been demonstrated that naive implementa- tions of those schemes are still susceptible to first order SCA attacks due to glitches in the circuit (see Section 2.2).
15 Chapter 2. Preliminaries
Glitch-Resistant Masking
Several masking schemes have been proposed to deal with the issue of glitches in hardware. Threshold Implementation (TI), which utilizes Boolean masking and multi- party computation, and another scheme using Shamir’s secret sharing and multi-party computation [PR11]. Creating a TI-based implementation of the AES S-box is very challenging, and so far only proposals masking certain parts of the AES S-box ex- ist [MPL+11, BGN+14]. Note that smaller S-boxes used in other schemes than AES, e.g., 4-bit S-box, can be efficiently masked using TI [BNN+12,BNN+15]. The scheme of [PR11] is presented in great detail in Section 5.2.
2.5 Contributions
The practical evaluation of different side-channel countermeasures on our targeted FPGA evaluation platform is the topic of this part of the dissertation. Since we are unable to make modifications to the hardware platform, e.g., adding analog sensors or implementing the actual hardware resources in a special logic style, we kept our focus on algorithmic countermeasures, i.e., hiding in the time domain and different masking schemes. Since hiding in the time domain mainly reduces the SNR, thereby requiring an attacker to gather a larger amount of power traces and/or use sophisticated alignment techniques, we have –with one exception in Chapter 3– avoided using hiding countermeasures when evaluating masking schemes in order to create a best case scenario for an attacker. This includes reducing the noise by making sure that only the design under test is active on the target FPGA during our practical experiments. Using the Correlation-Enhanced Collision Attack (CECA) we are also independent of a hypothetical power model and do not require a profiling stage, which leads to –in our eyes– very strong but at the same time realistic assumptions on the capabilities of an attacker. In Chapter 3 we analyze the robustness of a Boolean masking scheme [CB08] versus the CECA. We evaluate the achievable resistance of the countermeasure either alone or in combination with a time randomization or noise addition hiding techniques (shuffling, unrolling). We show that the masking scheme, while perfectly secure under the theoreti- cal model it was developed, in practice exhibits an exploitable first-order leakage because of glitches in the circuit. While neither the masking nor the hiding countermeasure is able to provide a sufficient level of protection, combining both weak techniques leads to a considerably higher number (>1 million) of required traces to break the implementa- tion. As alternative to implementing a glitch-resistant masking scheme, Chapter 4 proposes a new technique to implement the Boolean masking scheme used in the previous Chapter 3 in a glitch-free manner. This is achieved by manually mapping the masking circuit to the
16 2.5. Contributions available resources of the FPGA evaluation platform and avoiding any glitches by spe- cial enable signals. The resulting implementation showed no vulnerability to the CECA even using 50 million power measurements. As comparison, we have also implemented the glitch-resistant scheme of [PR11] in Chapter 5. While the security claims have been confirmed during practical evaluation, the extremely high area demand and low through- put of the design – even in the simplest setting – hinder a practical utilization of the scheme. The question about the validity of using AES dual ciphers as SCA countermeasure has been stated as early as in 2002 [BB02a]. However, a sufficient practical evaluations of this idea has never been performed. In Chapter 6 we provide detailed reasoning why no variant of AES dual ciphers is able to provide a meaningful level of SCA resistance.
17
Chapter 3
Analysis of Boolean Masking & Hiding
This chapter examines the effectiveness of well-known DPA countermeasures versus the Correlation-Enhanced Collision Attack. The considered counter- measures include masking, shuffling, and noise addition, when applied in hardware. Practical evaluations, which all have been performed using power traces acquired from an SASEBO platform, show an increase in the number of required traces, e.g. from 10,000 to 1,500,000, when combining different coun- termeasures. This study allows for a fair comparison between the hardware countermeasures and helps identifying an appropriate key lifetime.
Contents of this Chapter 3.1 Introduction ...... 19 3.2 Architecture ...... 20 3.3 Evaluation of Masking & Shuffling ...... 24 3.4 Evaluation of Masking & Pipelining ...... 30 3.5 Conclusion ...... 33
3.1 Introduction
When the Correlation-Enhanced Collision Attack (CECA) was first published at CHES 2010 [MME10], its target was an 8-bit serialized masked implementation of the AES. Since this architecture only computes one S-box per clock cycle, it provides a suitable situation for a collision attack. A question may arise on the efficiency of such an attack having different architectures in addition to different countermeasures. Although some notes about a 32-bit architecture have been given by [MME10], it seems necessary to have a comprehensive study on the influence of different randomizing and noise-additive countermeasures. Two different architectures are considered in this paper. The first one,
19 Chapter 3. Analysis of Boolean Masking & Hiding
which has a 32-bit data path, in contrast to the architecture of [MME10] does not em- ploy any shift register and implements four S-box instances. Both masking and shuffling schemes are the options which can be enabled in this architecture. The second architec- ture is based on the unrolling scheme, proposed in [BGSD10], as a DPA countermeasure. Using this approach, we are able to execute a whole AES encryption in 10, 5, 4, and 3 clock cycles.
We investigate the efficiency of the CECA when masking, shuffling, unrolling, and their possible combinations are enabled in our target architectures. The practical evalua- tions are performed using power consumption traces measured from the same platform as in [MME10], i.e., a Virtex-II Pro FPGA. The result of these investigations can be summarized as: none of the previously mentioned countermeasures can perfectly provide resistance against the considered attack. The reasons, which are well-known to the com- munity, are that (i) implementing masking in hardware still leads to a kind of first-order leakage caused by glitches in the circuit that is detectable by our considered attack, (ii) shuffling which does randomization in the time domain is also defeated by increasing the number of traces or using a “windowing” scheme, and (iii) unrolling, which seems to have the most effect on collision-like attacks, adds noise to the measurements and is overcome by averaging, which is done inherently by the Correlation-Enhanced Collision Attack. However, enabling each (or a combination) of these countermeasures leads to an increase in the number of required traces. Depending on the target application this can be considered as an important parameter helping to define the key life time of the device under evaluation.
The remainder of this chapter is organized as follows: The different implemented designs and countermeasures are presented in Section 3.2. The evaluation results of the attack on the 32-bit architecture employing masking and shuffling are shown in Section 3.3, while the results of the attack on the unrolled architectures are depicted in Section 3.4. A conclusion is finally given in Section 3.5. Note that the Correlation-Enhanced Collision Attack (CECA) has already been introduced in 2.3.3.
Results of this work were published at HOST 2011 [MMP11a] in a joint work with Amir Moradi.
3.2 Architecture
This section gives an overview of the architectures used to evaluate the countermeasures. We also describe characteristics of our implementation platform and the setup used for side-channel measurements.
20 3.2. Architecture
Data Output
32 32
Add Mask m AddRoundKey 32
0 1 KeySchedule
32 Mask n Mask m State Registers Registers Registers
ShiftRows ShiftRows
Sel_Col
Sel_Col Key 32 Registers 32 0 1 KeySchedule 32 KeySchedule
!KeySch Switching Matrix 32 8 Sel_Col Masked Masked Masked Masked S-Box S-Box S-Box S-Box
8 Instance Shuffling Switching Matrix 32 32 32 32 32 Final Round Masked Decrypt 32 MixColumns 32 1 MixColumns AddRoundKey 0
32 32 AddRoundKey 32 Remove Mask n
0 1 Final Round 32
Figure 3.1: Architecture of the 32bit implementation, allowing masking as well as column- wise and S-box instance shuffling
3.2.1 32-bit Architecture
The objective of choosing a 32-bit architecture was to use a real-world scenario during our measurements. While choosing an 8-bit architecture would be the best choice from an attacker’s point-of-view because of the reduced amount of noise and the option to observe each processed byte independently, selecting a 32-bit architecture provides a good
21 Chapter 3. Analysis of Boolean Masking & Hiding compromise between size and throughput while still enabling us to use countermeasures like shuffling.
32-bit architecture in this context means that all module interconnections are using a 32-bit datapath including the outside ports. Module internals are not bound to this restriction, so the ShiftRows transformation and the key scheduling are performed in a single clock cycle using an internal 128-bit datapath.
The complete datapath of the AES engine excluding the key schedule and key registers are masked. Similarly to the scheme used in [MME10], we apply the additively masked AES S-box by Canright and Batina [CB08] that uses a tower-field approach [Paa94] to implement the inversion in GF(22). Each S-box operation needs two mask bytes, one for the input masking and one to mask the output byte. These mask values are independent of each other, and are generated by a PRNG with uniform distribution. The general order of mask switches is as follows: in the beginning of an encryption each input byte is masked by the corresponding input mask (let us call it m for one input byte)1. While passing through the inversion part of the S-box the input mask is replaced by the output mask n. This process is reversed by the masked MixColumns module while in the last round, where no MixColumns operation is performed, the S-box output mask n is removed after the final AddRoundkey operation, and the final result of the AES operation is stored unmasked in the state register. An overview of this architecture is depicted in Fig. 3.1.
Since the S-boxes are shared between the normal data operation and the key schedule and four instances of the S-box are implemented in our target architecture, each round of the AES needs five clock cycles. During the first clock cycle the S-boxes are used by the key schedule unit. Since the key schedule is not masked, the input and output masks of the S-boxes are set to all zero by means of AND gates. Simultaneously, while the S-boxes are utilized by the key schedule, the ShiftRows transformation is performed both on the data state and the mask m registers. This is necessary because the data state is masked by the m masks which therefore have to be transformed as well to keep the relation between the mask bytes and the data state. In the next four clock cycles the SubBytes, MixColumns, and AddRoundkey transformations are applied on one column at a time. This allows implementing the second countermeasure, i.e., shuffling, which needs each column of the data, mask, or key registers to be selected and stored independently. It is therefore possible to switch the processing order of the columns during each encryption (we call this option column-wise shuffling). We also implemented another option to switch which byte of each column is processed by which S-box instance (we call this option instance shuffling). It should be noted that the same procedure and options have been considered for the decryption operation which shares some building blocks with the encryption unit.
1No mask reuse is applied in a computation of a cipher round; two 128-bit masks are required for each encryption or decryption run.
22 3.2. Architecture
Our final architecture has different options, i.e., masking, column-wise shuffling, and instance shuffling, which can be selected during the operation of the target device. Based on these options we define five different profiles, and later investigate the efficiency of each to a Correlation-Enhanced Collision Attack. These profiles are as follows:
Profile A: no countermeasure, using always zero for all the masks and turning off both shuffling options
Profile B: column-wise shuffling only
Profile C: masking only
Profile D: masking and column-wise shuffling
Profile E: masking, column-wise shuffling and instance shuffling
3.2.2 Unrolled Architecture
In addition to the masking and shuffling schemes we have tried to examine the effective- ness of unrolling, which has been explained in [BGSD10], in counteracting Correlation- Enhanced Collision Attacks. An overview of an unrolled design is shown in Fig. 3.2. In
Ciphertext Plaintext Key
AddRoundkey
Init 0 1 Init 1 0
Data State Key State R
ShiftRows o u n
S-Boxes S-Boxes S-Boxes S-Boxes Roundkey d F
Computation u n
MixCol MixCol MixCol MixCol c t i o AddRoundkey n R
ShiftRows o u n
S-Boxes S-Boxes S-Boxes S-Boxes Roundkey d F
Computation u n
MixCol MixCol MixCol MixCol c t i o AddRoundkey n
Figure 3.2: Architecture of the unrolled designs
23 Chapter 3. Analysis of Boolean Masking & Hiding order to reduce the required area of our unrolled architecture, we chose the very com- pact unmasked S-box by Canright [Can05] in an encryption-only scenario. Since the key scheduling is unrolled as well, twenty S-boxes are needed to implement each round function. We implemented four different designs, varying the number of rounds which are computed per clock cycle. In the smallest design only one complete round is computed at each clock cycle. The second design features two complete rounds, the third computes three and the last design computes four complete rounds of the AES at each clock cycle creating a highly glitching circuit.
3.2.3 Target Platform and Measurement Setup
All designs have been implemented on a Xilinx Virtex-II Pro FPGA (xc2vp7) of a SASEBO circuit board which is particularly designed for side-channel attack experi- ments [sas]. All experiments are performed on the power consumption of the Virtex- FPGA containing our implementation. Measurements are performed using a LeCroy WP715Zi 1.5GHz oscilloscope at a sampling rate of 1GS/s and by means of a differential probe which captures the voltage drop over an 1Ω resistor in the VDD (1.6V) supply path of the FPGA. In all the experiments the clock signal of our cryptographic engine is supplied by a stable oscillator at a frequency of 3MHz.
3.3 Evaluation of Masking & Shuffling
The later parts of this section deal with evaluating the resistance/vulnerability of different profiles of the 32-bit architecture addressed in Section 3.2.1 to Correlation-Enhanced Collision Attacks. Note that the evaluation described in this section was performed by Amir Moradi, not the dissertation author.
3.3.1 Profile A: No Countermeasures
Performing the Correlation-Enhanced Collision Attack as described in Section 2.3.3, we start by creating two sets of 256 mean traces according to the plaintext byte values corresponding to the target S-box instances. As explained in [MME10], the best situation for a collision attack is when the side-channel leakages of an S-box instance in two different clock cycles are compared. We therefore have selected two plaintext bytes which are processed by the same S-box instance. Looking at the variance traces computed over a set of mean traces, e.g., Fig. 3.3(b), clarifies in which clock cycle the corresponding plaintext byte is processed.
24 3.3. Evaluation of Masking & Shuffling
1.9
120 3 10 × ] 2 Voltage [mV] Variance [mV 20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
Figure 3.3: Profile A: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces
In order to perform the attack, aiming at recovering the relation between two key bytes, the mean traces must first be aligned to have both S-box executions – virtually – at the same instance of time. Though 1, 000, 000 traces have been used for the attack result depicted in Fig. 3.3(c), plotting the result over the number of traces, i.e., Fig. 3.3(d), shows that for this case even 10, 000 traces are sufficient to mount a successful attack.
3.3.2 Profile B: Column-wise Shuffling Only
When the column-wise shuffling is enabled, the target S-box computation does not take place at a specific clock cycle. But the computation will always be performed by the same S-box instance. Therefore, as can be seen in Fig. 3.4(b), the variance over the mean traces shows high values in all four clock cycles when the S-boxes are computed. This can in fact be considered as evidence of the existing time-randomization countermeasure. Without taking this countermeasure into account and just performing the last attack2, as depicted in Fig. 3.4(c), the correct difference between the target key bytes is still detectable and appears in all four mentioned clock cycles. However, it requires a higher
2It is not needed to shift the mean traces and align them in this case.
25 Chapter 3. Analysis of Boolean Masking & Hiding
120 400 ] 2 Variance [mV Voltage [mV]
20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
(e) (f)
Figure 3.4: Profile B: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing number of traces, i.e., 50, 000, since on average only one fourth of the mean traces are aligned to each other.
As mentioned in [MME10], one can divide a trace into clock cycles and sum them up to have a sum trace with a length of one clock cycle, which is known as “windowing” (integration over a sliding comb) [CCD00]. Doing so on the traces of this profile, consid- ering only those four clock cycles where the SubBytes transformations of the first round are performed, guarantees that the mean traces, which now are as long as a clock cy- cle, are aligned and contain the desired information. Performing the same attack on the
26 3.3. Evaluation of Masking & Shuffling
140 220 ] 2 Variance [mV Voltage [mV]
20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
Figure 3.5: Profile C: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 5, 000, 000 traces and (d) over the number of traces combined traces reveals the correct secret and decreases the required number of traces to 20, 000, as depicted in Fig. 3.4(e) and Fig. 3.4(f).
3.3.3 Profile C: Masking Only
While column-wise shuffling had low area and power overheads, implementing the masked S-boxes (as explained in Section 3.2.1) needs significantly more area and leads to a high power consumption. This can be seen when comparing the sample power traces of these two architectures in Fig. 3.4(a) and Fig. 3.5(a).
Since no shuffling is enabled in this profile, the mean traces must be aligned according to the clock cycles reported by the variance traces, e.g., Fig. 3.5(b). It should be noted that, as expressed in [MME10] and can be seen in the variance trace, the masked S-box implementation still has a first order leakage which is due to the glitches that occur in the combinatorial functions. The fact, that the variance is lower than that of the previous profiles, implies that a higher number of traces is necessary to reveal the correct key relation. The result of the attack using 5, 000, 000 traces is shown in Fig. 3.5(c), but based on Fig. 3.5(d) 150, 000 measurements are sufficient to reveal the desired secret.
27 Chapter 3. Analysis of Boolean Masking & Hiding
280 35 ] 2 Variance [mV Voltage [mV]
20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
(e) (f)
Figure 3.6: Profile D: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing
3.3.4 Profile D: Masking and Column-wise Shuffling
Attacking an implementation that combines both of the previously applied countermea- sures proves to be highly resistant against the CECA. Observing the variance traces, e.g., Fig. 3.6(b), shows that the dependency of the mean traces on the plaintext byte values is decreased because of the used masking scheme, and is spread over four clock cycles because of the column-wise shuffling. Fig. 3.6(c) shows the result of the attack using 10, 000, 000 traces. As expected after comparing the variance traces to those of the previous profiles, Fig. 3.6(d) reports around 4, 500, 000 as the number of traces we
28 3.3. Evaluation of Masking & Shuffling
25 280 ] 2 Variance [mV Voltage [mV]
20 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
(e) (f)
Figure 3.7: Profile E: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 10, 000, 000 traces, (d) over the number of traces, (e) using windowing, and (f) over the number of traces using windowing
required which is considerably higher than in the previous cases. If the same windowing scheme is used to overcome the time-randomization effect of the shuffling, the number of traces decreases to 700, 000 as depicted in Fig. 3.6(e) and Fig. 3.6(f).
3.3.5 Profile E: Masking, Column-wise and Instance Shuffling
As stated before, the Correlation-Enhanced Collision Attack works best if the target plaintext bytes are processed by the same S-box instance. Randomizing which of the
29 Chapter 3. Analysis of Boolean Masking & Hiding four S-box instances compute which bytes of a column (called instance shuffling in Sec- tion 3.2.1) should further increase the resistance of the implementation. This is confirmed comparing a variance trace over the mean traces of this profile (Fig. 3.7(b)) and that of profile D. The results of the attack using 10, 000, 000 traces, in both cases with and with- out windowing in addition to their required number of traces, are shown by Fig. 3.7. While around 5, 500, 000 traces are necessary to distinguish the correct guess when per- forming the considered attack in a straightforward manner, employing windowing reduces this number to 1, 500, 000 which is significantly higher than of previous profiles.
3.4 Evaluation of Masking & Pipelining
The same attacks, which have been done on the 32-bit architecture, are performed on the traces measured from the unrolled implementations. Since in our smallest unrolled architecture one round of the cipher encryption is performed per clock cycle, there is no shared hardware unit during the computation of a round. Therefore, one cannot compare the side-channel leakage of an unit in different clock cycles, and needs to consider different S-box instances to perform a collision attack. This, of course, decreases the efficiency of the attack since two different circuits are compared which, even with the same netlist, are differently placed and routed by the hardware design tool. Moreover, the switching noise level generated by the other parts of the circuits, e.g., S-boxes, which are not considered in the attack is considerably higher than the case of 32-bit architecture. We therefore have expected a higher number of required traces and collected more traces compared to the 32-bit cases. The results of this attack on different unrolled implementations are given by this section. Again, the practical side-channel measurements in this section have been performed by Amir Moradi.
3.4.1 One Round per Clock Cycle
Observing a sample power trace of a whole encryption run by our smallest unrolled implementation depicted in Fig. 3.8(a), verifies the execution of one round per clock cycle3. The same as before, two S-box instances and hence their corresponding plaintext bytes have been selected to make two sets of 256 mean traces. Since both the selected S-boxes are executed at the same clock cycle, their corresponding traces are already aligned and when performing the Correlation-Enhanced Collision Attack we do not need to shift the mean traces. Fig. 3.8(c) shows the attack result using 1, 000, 000 traces. As depicted in Fig. 3.8(d), the attack is still successful using 100, 000 measurements. We
3In Fig.3.8(a) 11 clock cycles with high power consumption are detectable. The last one is due to the case when the ciphertext is saved into the state register and appears at the input of the combinatorial circuit again.
30 3.4. Evaluation of Masking & Pipelining
170 1800 ] 2 Voltage [mV] Variance [mV
0 0 0 1.5 Time [µs] 3.5 5 0 1.5 Time [µs] 3.5 5 (a) (b)
(c) (d)
Figure 3.8: One round unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 1, 000, 000 traces and (d) over the number of traces should emphasize that the difference between the side-channel leakage of the implemented 16 S-box instances varies because of their different similarity in placement and routing. Therefore, the efficiency of the attack also varies selecting different S-box instances. The result shown in Fig. 3.8 is one of the best cases.
3.4.2 Two Rounds per Clock Cycle
When two rounds of the cipher encryption are unrolled, as can be seen in Fig. 3.9(a), the whole encryption is performed in 5 clock cycles. Comparing the variance traces shown in Fig. 3.8(b) and Fig. 3.9(b) of one and two unrolled rounds respectively, shows a significant increase of the noise level. We repeated the same attack procedure as before on 7, 500, 000 traces collected from the two-round unrolled implementation. This led to the result shown by Fig. 3.9(c) as amongst the most successful cases. Also, Fig. 3.9(d) reports around 300, 000 as the number of traces we have required to recover the correct relation between the selected key bytes. Although evaluation of the later rounds knowing their inputs when more than one round is unrolled has been included in the original proposal of the unrolling countermeasure [BGSD10], we have not reported the results of
31 Chapter 3. Analysis of Boolean Masking & Hiding
500 650 ] 2 Voltage [mV] Variance [mV
0 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
Figure 3.9: Two rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces the corresponding collision attacks because of their similarity to the case when attacking the first round. Moreover, knowing the input of the e.g., second round of the AES, in contrary to the DES case, reveals all the secrets used in the first round, and one does not need to perform the attack on the second round.
3.4.3 Three Rounds per Clock Cycle
Fig. 3.10 shows the results of a similar attack on the first round when three unrolled rounds are implemented, and the whole encryption process is completed in 4 clock cycles. As a reference, Fig. 3.10(a) and Fig. 3.10(b) show a sample power trace of this implemen- tation and a variance trace over a set of 256 mean traces. According to the low variance (Fig. 3.10(b)) and unclear distinguishability of the correct hypothesis amongst the others (Fig. 3.10(c)), a high number of required traces is expected, e.g., around 3, 000, 000 as shown in Fig. 3.10(d).
32 3.5. Conclusion
250 500 ] 2 Voltage [mV] Variance [mV
0 0 0 0.5 Time [µs] 1.5 2 0 0.5 Time [µs] 1.5 2 (a) (b)
(c) (d)
Figure 3.10: Three rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 7, 500, 000 traces and (d) over the number of traces
3.4.4 Four Rounds per Clock Cycle
In our last unrolled architecture, where four unrolled encryption rounds are implemented, every encryption run needs 3 clock cycles (see Fig. 3.11(a) as an example), and the switching noise has the highest level compared to all previous examined architectures (see Fig. 3.11(b) as a variance trace over the mean traces). The results of the attack, which are shown in Fig. 3.11, are practical evidence of the success of the CECA using around 3, 500, 000 traces countering unrolling as a countermeasure. Although the required number of traces is comparably higher than in previous cases, since the correlation colli- sion attack employs the mean traces, increasing the number of traces helps removing the switching noise effect and finally recovers the relation between the key bytes.
3.5 Conclusion
The results of the Correlation-Enhanced Collision Attack (CECA) on different hardware countermeasures have been presented. We have chosen this attack scheme for our in- vestigations since no hypothetical power model is required and its efficiency does not
33 Chapter 3. Analysis of Boolean Masking & Hiding
200 500 ] 2 Variance [mV Voltage [mV]
0 0 0 0.25 Time [µs] 0.75 1 0 0.25 Time [µs] 0.75 1 (a) (b)
(c) (d)
Figure 3.11: Four rounds unrolled: sample power trace (a) variance of the mean traces (b) the result of a CECA (c) using 30, 000, 000 traces and (d) over the number of traces rely on the leakage model of the target device which allows for a fair comparison. It is not a surprise that each countermeasure alone is not able to overcome the vulnerability against the attacks since even (theoretically) perfectly Boolean masked implementations still contain a slight first order leakage in practice due to glitches in the circuit. Similarly time randomization or noise addition countermeasures, which diminish the SNR, can be overcome by increasing the number of measurements.
However, when different countermeasures are combined, it is possible to significantly strengthen the resistance against DPA attacks. Applying all implemented countermea- sures of the 32-bit architecture, the number of necessary traces for a successful CECA can be increased from 10, 000 to 1, 500, 000. If area constraints are not an issue (e.g., unused FPGA resources are available), unrolling can further increase the resistance. Increasing the number of rounds per clock cycle from one to four increases the number of required traces from 100, 000 to 3, 500, 000. The combination of unrolling and masking has not been considered because of the rather large and impractical area requirements.
Since our implementation platform has been specifically designed for side-channel pur- poses considering an appropriate measurement setup and a well-defined trigger point, in real-world scenarios especially when the crypto cores are not the only circuits computing
34 3.5. Conclusion at one instance of time even more measurements will expectedly be required. However, knowing the number of required traces for an attack on an implementation in a low-noise environment helps choosing appropriate key lifetimes to further protect the secrets.
35
Chapter 4
Glitch-free Implementation of Masking
Due to the propagation of glitches in combinatorial circuits, side-channel leak- age of most masked S-boxes realized in hardware is a known issue. Our con- tribution in this chapter is to adopt a masked AES S-box circuit to the FPGA resources in order to avoid the glitches. Our design is suitable for the 5, 6, and 7 FPGA series of Xilinx, although our practical investigations are performed using a Virtex-5 chip. In short, compared to the original design synthesized by automatic tools while requiring the same area (slice count) our design reduces power consumption, critical path delay, and more importantly the side-channel leakage. In our practical investigations we could not recover any first-order leakage of our design using up to 50 million traces. However, since the tar- geted S-box realizes a first-order Boolean masking, the second-order leakage could be revealed using around 25 million measurements.
Contents of this Chapter 4.1 Introduction ...... 37 4.2 Preliminaries ...... 38 4.3 Design ...... 42 4.4 Evaluation ...... 44 4.5 Conclusion ...... 47
4.1 Introduction
Contrary to the goals of Threshold Implementation (TI) [NRS11] and the scheme of [PR11], in this work we do not try to create a glitch-resistant implementation but instead aim to avoid causing any glitches at all. The target of our implementation is the Virtex-5 LX-50 FPGA of the readily available side-channel evaluation platform SASEBO-GII [sas].
37 Chapter 4. Glitch-free Implementation of Masking
For this we take the very compact masked S-box by Canright-Batina [CB08] and manually map the combinatorial functions to the resources of our target platform. By efficiently using special enable signals in each FPGA Look-Up Table (LUT), we can suppress any glitches at the LUT outputs by enabling them only sequentially. We have evaluated different versions of our design including a fully pipelined one achieving a very high clock frequency. Note that although our design has been initially optimized to the 6-input LUT architecture of the Xilinx Virtex-5 FPGA, the same architecture is used in their newer Series 6 and 7 FPGAs which allows using the same design on these recent platforms. When evaluating the side-channel leakage of our final design, contrary to the original S-box implementation our design did not show any first-order leakage by analyzing 50 million measurements. Since the scheme only implements a first-order masking, a second- order attack is expected to be successful, which is practically confirmed using a very high amount of 25 million measurements. In the next section we briefly describe the reasons why we have selected the Canright- Batina masked S-box as the basis of our implementation. Moreover, we introduce the Xilinx LUT architecture and how we have used it to eliminate glitches. Section 4.3 gives an overview of our S-box design and names the implementation profiles used in the evaluation whose results are depicted in Section 4.4. Finally, Section 4.5 concludes this chapter. Results from this chapter were generated as joint work with Amir Moradi, who led the SCA evaluation, and were accepted to HOST in 2012 [MM12a].
4.2 Preliminaries
In the following we will first give a short summary of the recent masked S-box designs and state why we have chosen the one of Canright and Batina as basis for our modifications to create a glitch-free version. Afterwards we will describe the architecture of the Xilinx 6-Input LUT and how we use it to minimize the possible leakage.
4.2.1 Masked AES S-box
As stated previously the currently known glitch-resistant schemes come with some draw- backs. Threshold implementation has been shown to be quite effective when using small S-boxes [PMK+11], but because of the large S-box size of AES up to now no expressions could be found to rewrite the AES S-box using this scheme. Note that the implemen- tation reported in [MPL+11] has been made by masking the multipliers of a tower-field
38 4.2. Preliminaries
input output mask mask Optimized xor/sq/scl/ 4 mul
XORS
Figure 4.1: Masked GF(28) Inverter by Canright-Batina (taken from [MME10])
implementation of the AES S-box which could not follow the requirements of the thresh- old implementation. At CHES 2011 a mixture of Shamir secret sharing scheme and multi-party computation was introduced [PR11]. Unfortunately, it is obvious that the hardware resource requirements are quite high (see Chapter 5). Furthermore, because of the sequential way of computing the inversion of the S-box, a large number of clock cycles are necessary to compute only one S-box output. All these predicted area and time overheads may hinder its practical feasibility.
Instead of focusing on glitch resistance in this article we try to avoid any glitches at the FPGA LUTs at all. From the more traditional currently known masking schemes the one of Canright-Batina [CB08] uses an additive masking and implements the S-box in a tower-field approach using carefully chosen normal bases to minimize the circuit size. It is based on the area-optimized S-box by Canright [Can05], and it is still supposed to be the most compact design available. While it was claimed to be perfectly secure by the definition of [BGK04], it was shown in [MME10] that because of glitches in the circuit there still exists an exploitable first-order leakage. Figure 4.1 shows an overview of the GF(28) inverter design omitting the tower-field conversions. The GF(24) inverter is implemented using the same design the only difference being that the inversion in GF(22) is also merged to this module. The authors of the original design were kind enough to supply the HDL source code online1 which we used as basis for our modifications detailed in the following.
1http://faculty.nps.edu/drcanrig/pub/index.html
39 Chapter 4. Glitch-free Implementation of Masking
(a) (b)
Figure 4.2: Two possible LUTs in Virtex-5: (a) 6-input LUT, (b) 32-bit shift- register [Xil09]
4.2.2 Xilinx FPGA Resources
When not using dedicated hardware blocks like Multipliers/DSPs, a combinatorial logic circuit in an FPGA is usually implemented by means of many-to-one Look-Up Tables. Their general design is as a number of single-bit storage elements whose values are ini- tialized during the configuration of the FPGA by the bitstream. The inputs of the LUT control the setting of internal multiplexers thereby choosing which stored bit value is available at the output of the LUT. As example, considering the 6-to-1 LUT of the Xilinx Series 5, 6, and 7 FPGAs, the implementation of this LUT is realized as two 5- to-1 LUTs and a multiplexer as can be seen in Fig. 4.2(a). Each of these 5-to-1 LUTs themselves can again be seen as two 4-to-1 LUTs and a multiplexer and so on.
In our device under test, the Xilinx Virtex-5 LX50 FPGA mounted on a SASEBO-GII Board, each slice consists of four LUT6 and four single-bit flip-flops. The LUT6, as depicted in Fig. 4.2(a), can be hard-instanced in two different configurations. As LUT6 1 any combinatorial function having up to 6 input signals and one output signal can be implemented. Using the LUT in a LUT5 2 configuration allows providing two output signals from the 5 inputs but only if these 5 inputs are the same for both internal 5-to-1 LUTs, i.e., the inputs must be shared.
40 4.2. Preliminaries
Glitches at the output of a LUT happen since the input signals arrive at different instances of time because of the routing specification in the device. In order to avoid this the output of the LUT must be hold stable until all input signals have arrived. We achieve this by using one of the input signals as an active low enable signal, i.e., in our case as long as this input signal is set to logic ’1‘, the LUT output will always be logic ’0‘ no matter the values of the other input signals. It is important to carefully select which LUT input is used as enable signal. Let us consider choosing the input I5 in Fig. 4.2(a) as the enable signal. While the output of the LUT6 will actually not change during the transition period of the other input signals, there will still be glitches at the output of one of the internal LUT5 instances. We therefore have to choose the input signal which controls the very first multiplexer stage so that toggles at the select signals of the following multiplexers do not cause any glitches.
XOR ah GF_MUL_4x4 a al LUT ahal 5x2
XOR MUL.SCL 2x2 bh bl LUT bhbl LUT p 5x2 5x2 an MUL 2x2 XOR LUT ph LUT Q1 5x2 5x2
MUL 2x2 XOR LUT pl LUT Q0 5x2 5x2 en1 en2 en3
XOR ah GF_MUL_4x4 b al LUT ahal 5x2 XOR ah GF_MUL_4x4 LUT XOR MUL.SCL 2x2 al ahal bh 5x2 bl LUT bhbl LUT p GF_INV_8 (masked) 5x2 5x2 XOR MUL.SCL 2x2 bh mb LUT bhbl LUT p MUL 2x2 XOR bl 5x2 5x2 LUT ph LUT Q1 5x2 5x2 MUL 2x2 XOR LUT ph LUT Q1 p MUL 2x2 XOR 5x2 5x2 LUT pl LUT Q0 XOR 5x2 5x2 MUL 2x2 XOR LUT en1 en2 en3 LUT pl LUT Q0 6x1 5x2 5x2 XOR en12 en13 en14 XOR ah GF_MUL_4x4 m al LUT ahal LUT 6x1 5x2 XOR ah QH d GF_MUL_4x4 XOR MUL.SCL 2x2 al LUT ahal XOR bh 5x2 n bl LUT bhbl LUT p LUT XOR 6x1 5x2 5x2 XOR MUL.SCL 2x2 LUT cl bh bl LUT bhbl LUT p XOR MUL 2x2 XOR mn 6x1 LUT ph LUT Q1 5x2 5x2 LUT XOR 6x1 5x2 5x2 MUL 2x2 XOR LUT LUT ph LUT Q1 dn MUL 2x2 XOR 6x1 b GF_INV_4 LUT pl LUT Q0 5x2 5x2 XOR 5x2 5x2 a MUL 2x2 XOR en1 en2 en3 LUT LUT pl LUT Q0 en3 6x1 MUL 2x2 5x2 5x2 LUT c1 cst LUT an en12 en13 en14 XOR 6x1 LUT 5x2 XOR 6x1 ch MUL 2x2 XOR ah GF_MUL_4x4 XOR LUT c3 LUT LUT mb LUT al LUT ahal LUT af8 6x1 6x1 5x2 6x1 5x2 6x1 Q1 XOR MUL.SCL 2x2 XOR MUL 2x2 XOR bh c2 LUT LUT LUT LUT mn LUT bl LUT bhbl LUT p 5x2 6x1 5x2 5x2 6x1 5x2 5x2 c4 cst MUL 2x2 XOR XOR MUL 2x2 c5 LUT cst1 XOR LUT LUT q LUT LUT LUT e LUT LUT ph Q1 5x2 6x1 5x2 n 6x1 5x2 q 5x2 5x2 a 5x2 XOR c6 cst0 MUL 2x2 XOR LUT XOR XOR MUL 2x2 LUT c7 m LUT pl LUT Q0 b LUT LUT 6x1 LUT LUT LUT 6x1 em e 5x2 5x2 5x2 6x1 cm1 6x1 6x1 5x2 en12 en13 en14 XOR en2 c8 m4 en4 LUT m2 6x1 XOR XOR MUL 2x2 XOR LUT 6x1 mn LUT LUT LUT LUT XOR XOR csm LUT cm0 ah GF_MUL_4x4 QL 6x1 6x1 d 5x2 p 6x1 LUT ahal XOR LUT Q0 XOR al 6x1 5x2 6x1 MUL 2x2 XOR LUT LUT 6x1 LUT LUT 5x2 XOR MUL.SCL 2x2 XOR bh LUT 5x2 dn 6x1 bl LUT bhbl LUT p XOR en6 en7 en8 en9 en10 5x2 5x2 6x1 csm LUT 6x1 XOR XOR MUL 2x2 XOR LUT LUT LUT ph LUT Q1 em 6x1 5x2 o1 5x2 5x2
XOR XOR MUL 2x2 XOR LUT LUT o0 LUT pl LUT Q0 6x1 5x2 5x2 5x2 n m en12 en13 en14
m4 m5 en1 en2 en3 en4 en5 en6 en7 en8 en9 en10 en11 en12 en13 en14 en15
Figure 4.3: Design of our full-custom optimized S-box (inversion part only)
41 Chapter 4. Glitch-free Implementation of Masking
Although the details of the internal architecture of the FPGA resources are not publicly available, this input signal can be identified by looking at the architecture of the SRLC32E depicted in Fig. 4.2(b). It is a special mode of operation for LUTs in some slices of Xilinx FPGAs that realizes a shift register. In this mode the content of the LUT storage cells can be changed in a serial fashion during the operation of the FPGA. By using the LUT inputs as select lines, the length of the shift register can be set dynamically. Since the all zero input sets the length to 1 bit, and switching the I0 input signal to logic ’1‘ increases the length to 2 bits, i.e., choosing the neighboring cell, the I0 signal must control the very first multiplexer stage. Therefore, I0 is the correct choice for the enable signal. Note that since the synthesizer permutes the LUT input signals (and accordingly changes the LUT configuration) to optimize the routing, by special constraints [Xil08] one has to keep the PIN positions of the hardinstanced LUTs locked.
4.3 Design
The detailed structure of our design is given by Fig. 4.3. Omitting the tower-field con- version, 15 LUT stages are required to perform the full inversion in GF(28). We give performance figures for 6 different implementation profiles: from the original unmodi- fied design to our optimized one with or without pipelining stages, and when the special enable signals to minimize glitches in the circuit are used or not. The implementation profiles of the S-box are as follows: (1) The original HDL code optimized by the ISE synthesizer (2) The original HDL but avoiding any optimizations or trimming by the synthesizer, i.e., one LUT per gate to keep all hierarchy levels (3) Our modified design using hardinstanced LUTs, all enable signals always ’0‘, no pipeline registers (4) Our modified design without pipelining but activating each stage sequentially by the enable signals (5) Our modified design using pipelining to hinder glitch propagation, but all enable signals always ’0‘ (6) Our modified design using both pipelining to hinder the glitch propagation and using the enable signals to avoid glitches in the circuit In Profiles 1, 2, and 3 the implementations are pure combinatorial functions where at each clock the full S-box is computed at once. Glitches in the first LUT stage therefore are passed through the whole S-box generating a highly glitching circuit until all signals get stable. Therefore, we do not consider Profiles 1 and 2 in our side-channel evaluations (Section 4.4), but restrict the presentation of result to Profile 3.
42 4.3. Design
Profile 4 avoids this issue. Here only one LUT stage is activated in each clock cycle, thereby not only hindering the propagation of glitches, but also not causing any glitches at all. That is because the input signals of the next LUT stage are stable when they are activated in the following clock cycle. The downside of this profile is the apparent non- practicality. One needs 15 clock cycles to compute a single S-box output while the inputs must be held stable. To make matters worse one would need to spend another 15 clock cycles to deactivate each stage in the reverse order before the next S-box computation can begin. In Profile 5 the pipelining stages hinder the glitch propagation. On the other hand, keeping all enable signals at ’0‘ glitches will still occur at the LUT outputs of each stage.
Finally, in the last Profile 6 we combine both the pipelining to avoid any glitch propa- gation and the use of the active-low enable signals to completely shut down glitches at the LUT outputs. In order to reach our goal in a straightforward way one would need to i) first disable all LUTs, ii) clock every second pipelining registers after enabling their corresponding LUTs, iii) disable all LUTs again, iv) clock the other half of pipelining registers having their corresponding LUTs enabled and so on. This means that only every four clock cycles a new S-box input can be feed into the circuit, and it leads to a latency of 30 clock cycles from input to output. This is necessary because if one would simply merge clocking every second register and disabling the connected LUT stage at the same time, the routing of the signals would determine whether the disable signal arrives at the LUT first or if other inputs arrive earlier, which in the latter case would cause glitches at the LUT output.
To avoid this issue we can use the special way the clock signal is routed in the FPGA. The clock is routed on special dedicated paths to each switch box separately to avoid race problems in synchronous circuits. However, the LUT output signals need to first go back to the corresponding slice’s switch box and from there travel to the destination LUT inputs where more switch boxes might be passed. Therefore, a transition, e.g., low-to- high, on the clock signal arrives at the registers and LUTs of each slice earlier than the other signals. Therefore, by tying our active-low LUT enable signals to the clock signal, the LUT gets deactivated at each rising clock edge before the new inputs arrive. At the falling edge of the clock the LUT gets active and provides the output signal to the next flip-flop stage where it will be stored at the next rising edge. This way the pipelining registers can be active at every clock cycle and no glitches will occur. Please note that the maximum clock frequency in this case cannot be faster than twice the longest critical path delay of the S-box circuit. In order to provide a better understanding Fig. 4.4 showcases the different signal timings. Also, the performance results of each implementation profile for only the inversion module of the S-box is given in Table 4.1.
43 Chapter 4. Glitch-free Implementation of Masking
clk/LUT en
LUT output ‘0’ output (i) ‘0’ output (ii) ‘0’
LUT data in inputs (i) inputs (ii) inputs (iii)
Figure 4.4: Signal timings on LUT inputs and outputs
Table 4.1: Synthesis results for all profiles (inversion only) Latency Throughput Profile Max. Freq. #LUTs #FFs (#clocks) (16 Inv. /s) 1 105.519 MHz 99 0 0 6 594 937 2 56.504 MHz 244 0 0 3 531 500 3 88.300 MHz 100 0 0 5 518 750 4 641.026 MHz 100 0 30 1 335 471 5 641.026 MHz 100 649 15 (pipe’d) 20 678 258 6 320.513 MHz 100 649 15 (pipe’d) 10 339 129
4.4 Evaluation
We used a SASEBO-GII [sas] board as the target platform to examine the side-channel leakage of our designs. Different profiles of our design were implemented on the Virtex-5 (XC5VLX50) FPGA embedded on the target platform, and the power consumption traces were gathered using a LeCroy WP715Zi 1.5GHz oscilloscope with the sampling rate set to 1GS/s. Since the aim of some of some of our design is to minimize the number of toggles in each clock cycle, and only a single S-box instance was implemented, the peak-to-peak amplitude of the signal in the power traces was very low. Therefore, we utilized a DC blocker and amplifier while measuring the power traces by means of a passive probe with a 1Ω resistor in the VDD path. Furthermore, a bandwidth limit of 20MHz was set on the oscilloscope to reduce the electrical noise. All measurements were performed while our designs were run by a 3MHz external clock signal to avoid any overlap of power traces. Note that the actual evaluation was not performed by the dissertation author but by his co-author Amir Moradi.
In order to evaluate the resistance of our different profiles in a low-noise environment, we created an exemplary architecture where only the AddRoundKey module and one instance of the targeted S-box exists. After the initial key addition with the already masked data, the result is sequentially given to the S-box module one byte per clock cycle. The method we used to examine the side-channel leakage of our targeted designs is the Correlation- Enhanced Collision Attack (CECA). Since the best case for the attack is if a circuit is
44 4.4. Evaluation
reused in multiple clock cycles, it perfectly suits to our implemented architecture where the targeted S-box instance is shared for all SubBytes transformations. Note that the target masked S-box [CB08] requires two different random mask bytes per input byte, i.e., one to mask the input byte and another one to mask the S-box output. We have therefore provided two different 128-bit masks for each computation of our exemplary architecture.
To have a baseline for comparison purposes, we start our evaluations by analyzing Pro- file 3. It corresponds to a naive implementation where glitches are not controlled and can be propagated. Please note that the same is true for Profiles 1 and 2. We have, however, omitted their evaluation results since they exhibit the same side-channel leakage as that of Profile 3.
A sample power trace of this design is depicted by Fig. 4.5(a). The sixteen S-box com- putations of the SubBytes transformation can be clearly distinguished. We measured 50 000 traces, and performed a CECA on two plaintext bytes which are consecutively processed by the targeted S-box instance. Fig. 4.5(b) depicts the successful attack re- sults, and Fig. 4.5(c) demonstrates the simplicity of recovering the linear key difference if the glitches are not controlled in the circuit. A very low number of 5 000 measurements are sufficient to extract the secret.
The S-box design of Profile 5 was our next evaluation target. Compared to the previous profile, the difference of this design is that, while glitches at each LUT stage still occur, their propagation is hindered by the pipeline stages. Because of the 15 register stages in the S-box we require a total of 31 clock cycles to compute the full SubBytes trans- formations. An example power trace is shown in Fig. 4.6(a). Interestingly, the power consumption of Profile 5 is reduced compared to that of Profile 3 (Fig. 4.5(a)). This is due to the heavily reduced amount of glitches in the S-box, which more than offsets the energy consumed by the very high amount of additional register flip-flops.
We had to collect a significantly higher number of power traces compared to Profile 3, i.e., 20 000 000, to successfully recover the relations between key bytes. The attack results are depicted in Fig. 4.6(b), which indicate that a first-order leakage still remains. This demonstrates that controlling the propagation of the glitches is an effective method to significantly reduce the side-channel leakage. However, the leakage is not completely pre- vented, and we were able to exploit it using 8 000 000 power measurements (Fig. 4.6(c)).
The last design we considered for evaluation is Profile 6, where glitches in the circuit are not only not propagated, but completely prevented by a sophisticated control over the LUT enable signals. The level of power consumption of this design, as shown by Fig. 4.7(a), is basically equal to the one of Profile 5. We measured 50 000 000 traces of this design, but even with this high amount of measurements we were unable to perform a successful attack, which is depicted in Fig. 4.7(b). This demonstrates that preventing
45 Chapter 4. Glitch-free Implementation of Masking
10 Voltage [mV] −5 0 2 Time [µs] 7 9 (a)
(b)
(c)
Figure 4.5: Profile 3: evaluation results (a) a sample trace, (b) attack result using 50 000 traces, (c) over the number of traces. the occurrence of glitches significantly increased the resistance of the target S-box design against first-order attacks.
Since the underlying masking scheme only realizes a first-order masking, we should be able to recover the secret key using a second-order attack. As is illustrated in [Mor12], we can employ the variance traces of our measurements instead of the mean traces and perform a similar CECA to examine the second-order moments. The result of this attack is presented in Fig. 4.7(c). While the second-order leakage can be exploited, it still requires around 25 000 000 measurements to reveal the secret (see Fig. 4.7(d)).
46 4.5. Conclusion
6 Voltage [mV] −5 0 5 Time [µs] 13 18 (a)
(b)
(c)
Figure 4.6: Profile 5: evaluation results (a) a sample trace, (b) attack result using 20 000 000 traces, (c) over the number of traces.
4.5 Conclusion
In this work we have taken the highly optimized for ASICs very compact masked S-box by Canright and Batina, and ported it to use the available resources of the current Xilinx FPGA Series (Virtex-5 onward) in a size-optimized manner. Compared to a design created by an automatic synthesizer, this led to the same number of LUTs and a slight decrease of the operation frequency. We could also, as already pointed out in [MME10], confirm the still available first-order leakage of this S-box design when implemented in a straightforward manner. Since this leakage was caused by glitches in the circuit, we have first eliminated the glitches by placing enable signals in each used LUT, so that no output is propagated
47 Chapter 4. Glitch-free Implementation of Masking
6 Voltage [mV] −5 0 5 Time [µs] 13 18 (a)
(b)
(c)
(d)
Figure 4.7: Profile 6: evaluation results (a) a sample trace, (b) attack result using 50 000 000 traces, (c) attack result on 50 000 000 squared mean-free traces, (d) over the number of traces. while the inputs are not stable. By combining this solution together with pipelining stages and utilizing the special way how the clock signals are routed for the LUT enable
48 4.5. Conclusion signals, we could create an implementation which operates at an extremely high clock frequency while showing no first-order leakage by means of 50 million power consumption measurements. While not specifically focusing on this, we also achieved a quite high resistance against univariate second-order attacks. In this case 25 million traces is the threshold after which the secrets become slowly distinguishable using the attacks of [Mor12]. We should emphasize a comparison between our results and those of a threshold implementation of AES reported in [MPL+11] and [Mor12]. Although their implementation platform is different to ours, their scheme required roughly the same number of traces to exploit the second-order leakage while the area overhead of their design – excluding all the internal PRNGs – is much higher than our optimized one. In order to allow further study of our design and to use it in real applications the HDL source code of our masked S-box design is available online at http://www.sha.rub.de/research/publications.
49
Chapter 5
Implementation of a Glitch-resistant Masking Scheme
Only few of the masking schemes proposed to protect cryptographic imple- mentations against side-channel attacks also considered glitches in the logic circuits. One which is based on multi-party computation protocols and uti- lizes Shamir’s secret sharing scheme was presented at CHES 2011. It aims at providing security for hardware implementations – mainly of AES – against those sophisticated side-channel attacks that also take glitches into account. This Chapter deals with the practical issues and relevance of the aforemen- tioned masking scheme. Following the recommendations given in the extended version of the mentioned article, we provide a guideline on how to implement the scheme for the simplest settings. Constructing an exemplary design of the scheme reveals the impractical area requirements and low throughput, which will most likely prevent it from being practically used.
Contents of this Chapter 5.1 Introduction ...... 51 5.2 Target Scheme ...... 52 5.3 Implementation & Caveats ...... 56 5.4 Conclusion ...... 60
5.1 Introduction
A new masking countermeasure was proposed at CHES 2011 [PR11]. Similiar to TI, it is based on multi-party computation, but instead of Boolean masking it utilizes Shamir’s secret sharing scheme [Sha79]. It claims security not only against 1st-order attacks but
51 Chapter 5. Implementation of a Glitch-resistant Masking Scheme
also, depending on the number of shares, against higher-order multivariate ones.1 The contribution of this chapter is to give guidelines on how to implement the scheme, thereby allowing its practical realization on a hardware platform. More details on its practicability as well as its ambiguous points was later also given in [RP12] by the original authors. In order to make an exemplary architecture of this scheme we have chosen a parameter set based on the minimum number of shares to supposedly provide protection against any univariate attack. We address a couple of challenges on the way of its practical realization because of the very high time and area overheads.
This chapter is joint work with Amir Moradi and was also published as part of [MM13b] at CHES 2013, containing also a detailed evaluation and proposals how the scheme can still be overcome by changes in the measurement setup to detect the remaining multivariate leakage by a univariate attack.
5.2 Target Scheme
In [PR11] a very general description of the scheme is given. We use the most simplified settings given in the original publication, i.e., a polynomial of degree 1, and adjust it to the AES algorithm. The number of shares and accordingly the number of Players is fixed to 3, which should still provide protection against any univariate attack. Multiplication in GF(28) using the AES irreducible polynomial is denoted as ⊗, and finite-field addition as ⊕.
Before starting the shared operations, one needs to select 3 distinct non-zero elements, 8 so-called public points, α1, α2, α3 in GF(2 ). Moreover, it is required to precompute the j first row (λ1, λ2, λ3) of the inverse of the Vandermonde (3 × 3)-matrix (αi )1≤i,j≤3 as
−1 −1 λ1 = α2 ⊗ α3 ⊗ (α1 ⊕ α2) ⊗ (α1 ⊕ α3) −1 −1 λ2 = α1 ⊗ α3 ⊗ (α1 ⊕ α2) ⊗ (α2 ⊕ α3) −1 −1 λ3 = α1 ⊗ α2 ⊗ (α1 ⊕ α3) ⊗ (α2 ⊕ α3) ,
where x−1 denotes the multiplicative inverse of x in GF(28) (again using the Rijndael irreducible polynomial). These elements, α1, α2, α3 and λ1, λ2, λ3, are available to all 3 Players.
1A similar masking scheme using Shamir’s secret sharing with a software platform as target has also been presented at CHES 2011 [GM11a].
52 5.2. Target Scheme
Sharing a secret x is done by randomly selecting a secret coefficient a and computing 3 shares x1, x2, x3 as
x1 = x ⊕ (a ⊗ α1), x2 = x ⊕ (a ⊗ α2), x3 = x ⊕ (a ⊗ α3).
Each Player i receives only one share xi without having any information about the other shares.
Reconstructing the secret x from the 3 shares x1, x2, x3 is possible knowing the pre- computed values λ1, λ2, λ3 as
x = (x1 ⊗ λ1) ⊕ (x2 ⊗ λ2) ⊕ (x3 ⊗ λ3).
In the following we consider the essential operations required for an AES S-box compu- tation, and discuss about the role of each Player. Hence, let us suppose a (unshared) constant c and two secrets x and y, which are each represented by 3 shares, i.e., x1, x2, x3 and y1, y2, y3. The shares were constructed using the same public points α1, α2, α3 and by the secret coefficients a and b, respectively.
Addition with a constant, i.e., z = c ⊕ x, in the shared mode can be performed by each Player as
Player 1 : z1 = x1 ⊕ c = x ⊕ (a ⊗ α1) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α1)
Player 2 : z2 = x2 ⊕ c = x ⊕ (a ⊗ α2) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α2)
Player 3 : z3 = x3 ⊕ c = x ⊕ (a ⊗ α3) ⊕ c = (x ⊕ c) ⊕ (a ⊗ α3).
Therefore, z1, z2, z3 correctly provide the shared representation of z without requiring interaction between the Players.
Multiplication with a constant, i.e., z = c ⊗ x, c =6 0, can be executed in a similar way as
Player 1 : z1 = x1 ⊗ c = (x ⊕ (a ⊗ α1)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α1)
Player 2 : z2 = x2 ⊗ c = (x ⊕ (a ⊗ α2)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α2)
Player 3 : z3 = x3 ⊗ c = (x ⊕ (a ⊗ α3)) ⊗ c = (x ⊗ c) ⊕ (a ⊗ c ⊗ α3), since z1, z2, z3 represent a correctly shared z, if we consider a⊗c as the secret coefficient.
53 Chapter 5. Implementation of a Glitch-resistant Masking Scheme
Addition of two shared secrets, i.e., z = x ⊕ y, is easily performed by
Player 1 : z1 = x1 ⊕ y1 = x ⊕ (a ⊗ α1) ⊕ y ⊕ (b ⊗ α1) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α1)
Player 2 : z2 = x2 ⊕ y2 = x ⊕ (a ⊗ α2) ⊕ y ⊕ (b ⊗ α2) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α2)
Player 3 : z3 = x3 ⊕ y3 = x ⊕ (a ⊗ α3) ⊕ y ⊕ (b ⊗ α3) = (x ⊕ y) ⊕ ((a ⊕ b) ⊗ α3). z1, z2, z3 provide the shared representation of z, again considering a ⊕ b as the secret coefficient.
Multiplication of two shared secrets, i.e., z = x ⊗ y, is the challenging part. If each Player computes the multiplication of two shares as
2 Player 1 : t1 = x1 ⊗ y1 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α1) ⊕ (a ⊗ b ⊗ α1) 2 Player 2 : t2 = x2 ⊗ y2 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α2) ⊕ (a ⊗ b ⊗ α2) 2 Player 3 : t3 = x3 ⊗ y3 = (x ⊗ y) ⊕ (((a ⊗ y) ⊕ (b ⊗ x)) ⊗ α3) ⊕ (a ⊗ b ⊗ α3), t1, t2, t3 are not a correctly shared representation of z because –according to [PR11]– the underlying polynomial is of a higher degree and does not have a uniform distribution. The solution given in [PR11] is as follows:
(1) Each Player i after computing ti, randomly selects a coefficient ai, remasks ti as
qi,1 = ti ⊕ (ai ⊗ α1), qi,2 = ti ⊕ (ai ⊗ α2), qi,3 = ti ⊕ (ai ⊗ α3),
and sends each qi,∀j6=i to the corresponding Player j.
(2) Now each Player i has three elements q1,i, q2,i, q3,i, and reconstructs zi as
zi = (q1,i ⊗ λ1) ⊕ (q2,i ⊗ λ2) ⊕ (q3,i ⊗ λ3).
Indeed, z1, z2, z3 now is a correctly shared representation of z considering (a1 ⊗λ1)⊕(a2 ⊗ λ2) ⊕ (a3 ⊗ λ3) as the secret coefficient.
Square of a shared secret, i.e., z = x2, cannot be computed in a straightforward way in contrast to what is stated in [PR11]. If each Player i squares its share xi as
2 2 2 2 Player 1 : z1 = x1 = x ⊕ (a ⊗ α1 ) 2 2 2 2 Player 2 : z2 = x2 = x ⊕ (a ⊗ α2 ) 2 2 2 2 Player 3 : z3 = x3 = x ⊕ (a ⊗ α3 ), z1, z2, z3 do not provide a correctly shared representation of z unless – as also stated in [GM11a] – the public points α1, α2, α3 as well as λ1, λ2, λ3 are squared. If the result
54 5.2. Target Scheme
x2 x3 x6 x12 x15 x30 x60 x120 x240 x252
(a) S-box
c1 c2 c3 c4
(b) MixColumns
Figure 5.1: Block diagram of sequential operations necessary for an AES S-box and a fourth of MixColumns
of the squaring z1, z2, z3 is used in later computations where other secrets shared by the original public points α1, α2, α3 are involved, z1, z2, z3 must be remasked to provide a correctly shared representation of z using the original public points. To do so a Fresh- Masks scheme is proposed in [GM11a]. Moreover, in [RP12], the extended version of the original scheme, a specific condition is defined for the public points to simplify the square operation. In the simplest settings α1 is set to 1 and the other public points have to 2 2 satisfy the conditions: (α2) = α3, (α3) = α2. Therefore, after each Player squared its share, two Players must exchange their secrets which is called reordering in [RP12].
However, we instead realize the squaring operation by simply reusing our shared mul- tiplication circuit. This creates a correct representation of z without requiring specific public points or a reordering operation. Indeed, following the conditions for the public points given in [RP12] would lead to less computation overhead and higher performance compared to our considered solution. On the other hand, since we are implementing the scheme as hardware circuit, our multiplication reuse solution reduces the area require- ments and leads to a simpler control structure.
55 Chapter 5. Implementation of a Glitch-resistant Masking Scheme
The inversion part of the AES S-box can be computed by the scheme presented in [RP10] as
4 16 4 x−1 = x254 = x2 ⊗ x ⊗ x2 ⊗ x ⊗ x2 ⊗ x ⊗x2.
Since this scheme contains only a couple of square and multiply operations, the inver- sion can be realized by our shared multiplication scheme. In contrast to what is stated in both [PR11] and [GM11a], the affine transformation following the inversion cannot be performed in a straightforward manner. The reason is, as also addressed in [epr11], that the linear part of the affine transformation of the AES S-boxis a linear function over GF(2), not over GF(28). The solution for this problem, as also stated in [RP12], is to rep- resent the affine transformation over GF(28) (using the Rijndael irreducible polynomial). This actually has been presented before in [MR02] and [DR02] as
Affine (x) = 63 ⊕ (05 ⊗ x) ⊕ (09 ⊗ x2) ⊕ (f9 ⊗ x4) ⊕ (25 ⊗ x8) ⊕ (f4 ⊗ x16) ⊕ (01 ⊗ x32) ⊕ (b5 ⊗ x64) ⊕ (8f ⊗ x128) .
Fig. 5.1(a) displays the sequence of operations for a complete S-box computation consider- ing the secret sharing restated above. Note that the modules denoted by black N indicate the shared multiplication, and those by gray N the multiplication with a constant.
5.3 Implementation & Caveats
In order to implement the aforementioned scheme one needs to follow the requirements addressed in [PR11]. The goal of the scheme is to separate the side-channel leakage of the computations done by each Player in order to prevent any univariate leakage. As stated in [PR11], there are two possible ways to separate the leakage. Either the circuit of each Player is realized by dedicated hardware, e.g., one FPGA per Player, which does not seem to be practical, or the operations of each Player are separated in time. We follow the second option and have tried to mount the whole of the scheme in one FPGA – with the goal of a global minimum area-overhead – by the design shown in Fig. 5.2.
By means of a dedicated and carefully designed control unit we made sure that the Players sequentially get active. In other words, no computation or activity is done by the other Players when one Player is active. The design of the shared multiplication module is slightly different to the other modules. In contrast to the others, where the computation on each share by the corresponding Player is independent of that of the other shares, the Players in the shared multiplication module need to communicate with each other. Therefore, we had to divide the computations of each share in this module into two parts
56 5.3. Implementation & Caveats
ea1
em1 3 t1 em2 3 t2 em3 3 t3 AFF1 63 a1 a2 a3
PRNG 2 PRNG 2 PRNG 2 ea1
eo1 ea1 1 1 1 sela1 out1 ea2 M&MSK1 M&MSK2 M&MSK3 05 09 f9 25 em4 em4 em4 f4 em5 em5 em5 em6 em6 em6 01 q q q q q q q q q b5 1,1 2,1 3,1 1,2 2,2 3,2 1,3 2,3 3,3 8f AFF2 63 selc f oe ea2 1 2 3 1 2 3 1 2 3 eo2 ea2
sela2 out2 NMSK1 z1 NMSK2z2 NMSK3 z3 ea3
2 3 12 2 3 12 2 3 12 ea1 em1 es 1 es 1 es 1 ea2 em2 es 2 es 2 es 2 ea3 em3 es 3 es 3 es 3 AFF3 63
ea3 eo3 ea3 em1 em2 em3 selm1 selm2 selm3 sela3 out3
in1 in2 in3
Figure 5.2: Our design of the shared multiplication and addition to realize the AES S-box by inserting a register between the two steps as explained in Section 5.2 (see registers marked by qi,j in Fig. 5.2).
Another important issue regarding our design is the way that the multiplexers are con- trolled. Since the shared multiplication module needs to get different inputs in order to realize a multiplication or a square, there should be a multiplexer to switch between different inputs. That is because – considering Fig. 5.1(a) – the shared multiplication module performs always squaring except in steps 2, 5, 10, and 11. Control signals which select the appropriate multiplexer input must be hazardless2. Otherwise, as an exam- ple, glitches on select signals of Player 1 while Player 2 is active will lead to concurrent side-channel leakage of two shares. Therefore, as a solution we provided some registers to control which input to be given to the target module.
2In the areas of digital logic a dynamic hazard means undesirable transient changes in the output as a result of a single input change.
57 Chapter 5. Implementation of a Glitch-resistant Masking Scheme
For simplicity, we first explain how the shared multiplication module works:
In the first clock cycle by activating the enable signal em1 the first share of both appropriate inputs are saved into their corresponding registers, get selected by select signal selm1, and therefore are multiplied. At the same time the remasking process using a new random a1 and public points α1, α2, α3 is performed. Note that the result of these computations are not saved in this clock cycle.
The same procedure as in the first clock cycles is done on the second and the third shares one after each other in the second and the third clock cycles by activating enable signals em2 and em3 respectively.
The results of the remasking for Player 1 (indeed provided by all 3 Players), which are available at the input of registers q1,1, q2,1, q3,1, are stored at the forth clock cycle by enabling signal em4. Therefore, the second step of the module gets ac- tive and performs the unmasking using λ1, λ2, λ3 to provide the first share of the multiplication output. Note that again the result is not saved in this clock cycle.
In the next two clock cycles (fifth and sixth) the same operation as the previous clock cycle is performed for Player 2 and Player 3 consecutively by enable signals em5 and em6. Note that to save x2, x3, and x12 (see Fig. 5.1(a)) in the appropriate step, one of the 2 3 12 signals esi∈{1,2,3}, esi , and esi gets enabled at the same time with the corresponding emi signal. In fact, we need six clock cycles to completely perform a shared multiplication or a square. It means that since we use only one shared multiplication module in our design, in 6 × 11 = 66 clock cycles the inverse of the given shared input is computed. Afterwards, in order to realize the affine transformation the multiplication-addition mod- ule (modules AFF1, AFF2, and AFF3 in Fig. 5.2) must also contribute to the compu- tations. The Players in this module do not need to establish any communication and their computation is restricted to their own shares. Therefore, by appropriately selecting selai∈{1,2,3} and enabling the eai signal the multiplication with constant and the shared addition both can be done in one clock cycle per share, i.e., three clock cycles in sum. Note that the same techniques as before to make hazardless control signals are used in the design of the multiplication-addition module. Also, the sequence of operations is similar to what is expressed for the first three clock cycles of the shared multiplication mod- ule. According to Fig. 5.1(a), during the affine transformation a multiplication-addition operation must be performed prior to each and after the last square. Therefore, after 3 × 8 + 6 × 7 = 66 clock cycles the operations of an affine transformation is completed resulting in 132 clock cycles in sum to compute an S-box shared output. One optimization option is to perform the multiplication-addition and the first three clock cycles of the squaring at the same time to save 24 clock cycles per S-box computation. According to the definition and the requirements of the scheme, it should not provide
58 5.3. Implementation & Caveats
Table 5.1: Area and time overhead of our design based on XC5VLX50 Virtex-5 FPGA (excluding state register, KeySchedule, PRNGs, initial masking, and final unmasking) FF LUT Slice SB MC+ARK Encryption Design #% #% #% CLK CLK CLK 1 SB MC 315 1% 1387 5% 859 12% 2112 192 22 896 16 SB MC 4275 15% 21 328 74% no fit 132 12 1431
any security loss. However, since our main goal is to practically examine the side-channel leakage of this scheme, we ignored this optimization to be able to separately localize the side-channel leakage of each operation.
Though an optimized scenario to perform MixColumns is proposed in [PR11], by adding more multiplexers (and select registers) to the multiplication-addition module our pre- sented design can also realize MixColumns and AddRoundKey. This can be done ac- cording to the diagram given by Fig. 5.1(b) and selecting the appropriate coefficients c1, c2, c3, c4 corresponding to the rows of the matrix representation of MixColumns. After finishing all SubBytes transformations of one encryption round, i.e., 132 × 16 = 2112 clock cycles, every output byte of the MixColumns transformation in addition to the corresponding AddRoundKey can be computed in 3 × 4 = 12 clock cycles. That is, 12 × 16 = 192 clock cycles for whole of the MixColumns and AddRoundKey transformations. In sum, ignoring the required time for initial masking of the input and the key and for (pre)computing the round keys a whole encryption process takes 2112 × 10 + 192 × 9 + 3 × 16 = 22 896 clock cycles.3
We should stress that – except the mentioned one – no time-optimization option exists for our single-S-box design since no more than one share is allowed to be processed at the same time. It is possible to reach a higher throughput by making multiple, e.g., 16, instances of our design inside the target FPGA and process all SubBytes and later all MixColumns in parallel. This, in fact, leads to a very high area-overhead (addressed by Table 5.1) that even cannot fit into the slices available in our target FPGA which is of the medium-size modern series. We should emphasize that the GF(28) multiplier we employed here is a highly optimized and pure combinatorial circuit, and the design is made for any arbitrary public values αi∈{1,2,3} and λi.
3In the last round MixColumns is ignored and each separate AddRoundKey on one shared state value takes 3 clock cycles.
59 Chapter 5. Implementation of a Glitch-resistant Masking Scheme
5.4 Conclusion
In this work we have demonstrated how to correctly implement a provably-secure glitch- resistant masking scheme of [PR11]. By making certain that in each point in time only operations on a single share are performed, there should in theory exist no exploitable univariate leakage, which was also confirmed by practical evaluations when using a low operation frequency and a basic measurement setup. For details on the evaluation we refer to [MM13b]. While the countermeasure is valuable from a theoretic point of view, its use in real world implementations is unlikely because of the large overheads in area and performance.
60 Chapter 6
Are Dual Ciphers a Side-Channel Countermeasure?
While keeping the in- and outputs of a dual cipher equal to the original AES, all the intermediate values and operations can be different from that of the original one. A comprehensive list of these dual ciphers is given by an article presented at ASIACRYPT 2002, where it is mentioned that they might be used as a kind of side-channel attack countermeasure if the dual cipher is randomly selected. Later, in a couple of works performance figures and overhead penalty of hardware implementations of this scheme are reported. However, the suit- ability of using randomly selected dual ciphers as a power analysis counter- measure has never been thoroughly evaluated in practice. In this chapter we address the pitfalls and flaws of this scheme when used as a side-channel coun- termeasure. As evidence of our claims, we provide practical evaluation results based on a Virtex-5 FPGA platform. We realized a design which randomly selects between the 240 different dual ciphers at each AES computation and examined its vulnerability to SCA attack models. As a result, we show that the protection provided by the scheme is negligible considering the increased costs in term of area and lower throughput.
Contents of this Chapter 6.1 Introduction ...... 62 6.2 Dual Cipher Concept ...... 63 6.3 Design ...... 65 6.4 Evaluation ...... 67 6.5 Conclusion ...... 71
61 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?
6.1 Introduction
In the early 2000s, there were attempts to better understand the algebraic specification of AES-Rijndael. One is about how to make dual ciphers which are equivalent to the original Rijndael in all aspects [BB02a]. By replacing all the constants in Rijndael, including the replacement of the irreducible polynomial, the coefficients of the MixColumns, the affine transformation in the S-box, etc, the idea is to make different ciphers which generate the same ciphertext as the original Rijndael for the given plaintext and key. As explained in [BB02a], there exist 240 non-trivial Rijndael dual ciphers, and a comprehensive list of the matrices and coefficients is given in [BB02b]. Later in [Rad04], it has been shown that one can include field mappings from GF(28) to GF(2)8 as well as intermediate isomorphic mappings to GF(22) and GF(24) to build 61 200 similar Rijndael dual ciphers.
This idea was taken by the authors of [WLL04], and by means of the gate count they in- vestigated which of those 240 dual ciphers can be implemented in hardware using smaller area, and which ones can speed up the implementation. Since the intermediate val- ues of the dual ciphers during encryption are different than Rijndael’s, it is mentioned in [BB02a] that one can randomly change the constants of the cipher thereby realizing different dual ciphers and provide security against power analysis attacks. This led to other contributions. For instance, a hardware-software co-design of a system based on an Altera FPGA where according to the randomly chosen parameters the content of the lookup tables are dynamically changed is presented in [JCCC07a,JCCC07b]1. Moreover, the authors of [GL08] and [GL09] presented a hardware implementation which can realize every selected dual cipher amongst those 240 ones. They reported the performance and area loss when the scheme is realized in order to increase the security against side-channel attacks.
In this work we examine this scheme, i.e., random selection of constants to choose a dual cipher out of 240, from a side-channel point of view. We address its flaws and weaknesses which can lead to easily breaking the corresponding implementation. In order to examine our findings in practice, we implemented the scheme on a Virtex-5 FPGA by means of precomputed matrices and constants and – in contrast to [JCCC07b] – by avoiding the use of any lookup table. Our practical side-channel evaluations confirm our claims indicating that the protection provided by the scheme is negligible while having high area and performance overheads. We show that the implementation can be easily broken when a suitable attack model is taken by the adversary.
The next section restates the concept of Rijndael dual ciphers with respect to the original work [BB02a]. Our design of the scheme considering our targeted FPGA platform in addition to its performance and area overhead figures are represented in Section 6.3. Our
1In fact, the cipher which is realized by their design is not always equivalent to the original AES- Rijndael.
62 6.2. Dual Cipher Concept
discussions about the side-channel resistance of the scheme and practical investigations are given by Section 6.4 while Section 6.5 concludes our research on dual ciphers.
Results of this chapter were published at ICICS in 2013 [MM13a] as joint work with Amir Moradi.
6.2 Dual Cipher Concept
Two ciphers E and E0 are called dual ciphers, if they are isomorphic, i.e., if there exist invertible transformations f(·), g(·) and h(·) such that
0 ∀P,KEK (P ) = f(Eg(K)(h(P ))), where plaintext and key are denoted by P and K respectively.
The concept of dual ciphers for AES-Rijndael was first published in 2002 [BB02a]. The authors demonstrate how to build a square dual cipher of the original AES and show that it is possible to again iterate this process multiple times creating more square dual ciphers. This way 8 dual ciphers for each possible irreducible polynomial in GF(28) can be derived. Since it is also shown how to create dual ciphers by porting the cipher to use one of the other 30 irreducible polynomials in GF(28), a total of 240 non-trivial dual ciphers for AES exist. Here non-trivial means that we are only considering those dual ciphers which actually change the inner core of AES and not only consist of invertible transformations of the input and output of the cipher.
As an example, closely following the explanation in [BB02a], let us consider a square dual cipher of the original AES-Rijndael. In order to create this dual cipher one first has to multiply all AES constants by a matrix which performs the squaring operation under the original AES-Rijndael polynomial 0x11b. These constants include the round constant of the key schedule, the coefficients of the MixColumns transformation, as well as the input data and the key. In this special example this matrix is generated by taking a generator a, in this case the polynomial x2 in GF(28), and building a matrix of the form R = (a0, a1, a2, a3, a4, a5, a6, a7), where each of these elements represents a column of the matrix and the result of the exponentiation is reduced by the original AES reduction polynomial. The resulting matrix is
63 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?
1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 R = . 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1
Furthermore, we also need to make changes to the SubBytes transformation. If we con- sider SubBytes to be purely a table look-up of constants S(x), we can compute a new look-up table S2 by applying the R matrix and its inverse R−1 as follows: S2 = RS(R−1x). If we consider the SubBytes transformation as inversion in GF(28) followed by a multi- plication by the affine matrix A and addition of the constant b, then the inversion stays unchanged while the new affine matrix A2 is computed as A2 = RAR−1 and the new constant b2 is computed (similar to the other constants, i.e., those of MixColums and key schedule) by multiplying it with the transformation matrix R: b2 = Rb. Note that in the case of S2 or A2 no actual squaring is taking place.
If we consider the original cipher as E and the above described square dual cipher as E2, by applying the same squaring routines again we can create a total of 8 square dual ciphers (up to E128 since E256 is equal to E in GF(28)). These square dual ciphers all use different constants and SubBytes transformations. According to the dual cipher concept, if the R matrices are multiplied with all input data bytes and key bytes and the result is transformed back by multiplying each output byte with the inverse matrix R−1, the results of all ciphers when given the same input data and key will be equal. The differences in the internal structure, like the different S-box in SubBytes or the different coefficients in the MixColumns, also translates into e.g., different power consumption and EM emanations of a circuit implementing this technique. As denoted in [BB02a], these differences in the internal structure of the dual ciphers might be usable as some kind of side-channel countermeasure. If the used dual cipher is randomly chosen, this could be comparable to a normal masking countermeasure.
Besides using square dual ciphers of the original AES-Rijndael, one can use the same transformation techniques as above to change all constants by using different generators a and reducing the ai by the new irreducible polynomial. If the SubBytes transformation is not implemented as table look-up but as inversion plus affine transformation, the inversion as well as all field multiplications as in MixColumns are then also performed using the new irreducible polynomial not the original one. This works for all 30 irreducible polynomials in GF(28). Since there exist 8 generators for all irreducible polynomials representing the 8 square dual ciphers, as stated previously a total of 240 different non-trivial dual ciphers in GF(28) exist. All generators, polynomials and constants of each of the 240 dual ciphers can be found in [BB02b]. Note that we consider only dual ciphers using
64 6.3. Design
x2 x3 x6 x12 x15 x30 x60 x120 x240 x252 -1 x SQ MUL SQ SQ MUL SQ SQ SQ SQ MUL MUL x
Figure 6.1: Inversion circuit in GF(28) mappings in GF(28) not such where other composite field representations are utilized, e.g., those presented in [Rad04].
6.3 Design
The first design decision one has to make is whether to implement the SubBytes trans- formation purely based on look-up tables or if a general inversion circuit is used together with the affine matrix multiplication and constant addition. Since the area overhead to store 240 different complete S-boxes is massive, similar to [GL08] and [GL09] we opted to implement a general inversion circuit. Because we want to analyze the side-channel resistance of the original submission of dual-ciphers [BB02a], this requires the inversion to be implemented in GF(28) without the option to save on resources by utilizing inversions in composite fields or using a tower field approach [Paa94, SMTM01]. In other words, the inversion circuit must be general and valid for all the 30 irreducible polynomials mentioned in Section 6.2.
In order to prevent leakage through the timing channel during the inversion it is important to make the circuit time invariant. To achieve this one can make use of the fact that in GF(28) x256 is equivalent to x, which leads to x−1 ≡ x254. Using addition chains this exponentiation can be implemented by a low number of modular multipliers and squaring circuits as depicted in Fig. 6.1. Note that the squaring step itself is free in GF(28) and only requires hardware resources for the modular reduction.
For each possible dual cipher one needs to store the following parameters:
(1) Initial transformation matrix R (64 bits), which is required to transform the original input data and key to the dual cipher representation.
(2) Inverse transformation matrix R−1 (64 bits), required to transform the output of the AES computation from the dual cipher representation back to the original AES representation, precomputed as normal matrix inversion of R in GF(2).
(3) Modular reduction polynomial pˆ (8 bits), to be used during all field multipli- cations (MixColumns) and the inversion steps (SubBytes).
65 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?
(4) MixColumns coefficients mcˆ (2 × 8 bits). While the MixColumns coefficients originally are 8-bit elements of a 4 × 4 matrix, because the coefficients of each row 8 are only a rotated variant of the first row and only two are not 01x (in GF(2 )), it is sufficient to store only these two transformed coefficients (R(02x), R(03x)). (5) Affine matrix of SubBytes Aˆ (64 bits), to apply the affine matrix multiplication step of the affine transformation. The matrix is computed as Aˆ = RAR−1, where A is the original affine matrix of the AES. (6) Affine constant ˆb of SubBytes (8 bits), final addition step of the affine trans- formation. As for every other constant transformation this can be computed as ˆb = Rb, where b is the original affine constant, i.e., 63x. (7) Round constants (rcon) of the key scheduling rcˆ (10 × 8 bits). The rcons r are constructed asrc ˆ (r) = (R 02x) modp ˆ, with r starting from 1 for the first round,p ˆ being the used irreducible polynomial, (02x) the initial element, and R the transformation matrix. The rcons could also be computed on-the-fly which would only require the storage of the transformedrc ˆ init = R 02x (8 bits). Since this would require another modular multiplier, we have opted to store all the precomputed rcons for each of the 240 dual ciphers. The overall architecture of our evaluation circuit is depicted in Fig. 6.2. The initial transformations of the input data and key are performed prior to the general AES/dual cipher computation. After the full encryption is complete, the inverse transformation moves the result back to the original AES representation as described previously. The AES/dual cipher computation itself is implemented as round-based design, i.e., every round of AES requires one clock cycle and the computation is finished after ten clock cycles excluding the initial and final transformations and data loading. We chose to implement a round-based design because this is very common in real-world implementations when a hardware platform is targeted. The on-the-fly key scheduling seems to be the most suitable option since the roundkeys, which are different for each dual cipher, would otherwise require 41 kBytes of storage. We have implemented the whole design on a Virtex-5 LX50 FPGA mounted on a SASEBO-GII [sas] (Side-channel Attack Standard Evaluation Board). In our implementation all the aforementioned parameters and constants are stored in Block-RAMs (BRAM) and are preloaded before every com- plete AES computation. The resource utilization is shown in Table 6.1. Compared to
Table 6.1: Performance figures (excluding the PRNG) Version #LUTs #FFs #BRAMs FREQ Random Dual Cipher 13 481 651 6 21 MHz General AES Enc Only 503 154 6 202 MHz
66 6.4. Evaluation
lastround Matrix done Multiply AndGates 0 GF(2)
Ciphertext 1 R-1 SubBytes 0 Matrix Multiply 1 -1 x Add GF(2) Matrix
GF(2) Multiply Plaintext Constant
init MixColumns AddRoundkey R State Registers BRAM p^ ^ mc^ PRNG Constants A ^b Storage R init rc^ Matrix Multiply 1 KeySchedule Key GF(2) 0 Key Registers
Figure 6.2: Overall architecture of the AES dual ciphers circuit an unprotected design utilizing a more common S-box implementation based on look-up tables we require significantly more LUT resources. This is due to the 20 large general inversion circuits implemented in parallel (16 for the round function and 4 for the key scheduling) which are required to perform the inversion in every selectable dual cipher representation. The number of LUTs could be heavily reduced by using a composite field or tower field approach in the S-box design which, as stated previously, we have not implemented at this point to enable a side-channel evaluation of the original dual cipher proposal. We should also highlight the very low maximum operation frequency of our design. It is due to the very long critical path of the inversion unit. Since it has to be general for any given irreducible polynomial, it could not be optimized with respect to both delay and area.
6.4 Evaluation
As explained in the previous section, at the beginning of each encryption process a PRNG determines which of the 240 dual ciphers is selected. This dual cipher index i is provided as address to the BRAM, which in turn outputs all the constants and coefficients for the whole circuit. If the dual cipher index is unknown to the adversary, he cannot predict the intermediate values. This can be seen as a kind of masking scheme providing some
67 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure? level of resistance against side-channel attacks. However, there are a few issues which significantly affect the robustness of the scheme we address below. Note that the practical evaluations whose results are described in this section were performed by Amir Moradi.
6.4.1 Mask Reuse
All intermediate values and inputs are transformed using the same transformation ma- trix Ri of the corresponding dual-cipher. Therefore, all plaintext bytes are transformed using the same transformation, which leads to an issue similar to mask reuse in masking schemes. Considering a Boolean masking scheme, if the inputs of two S-boxes are masked by the same mask value, the corresponding key bytes difference might be recovered by a classical linear collision attack [Bog08, CFG+11]. The same holds true for dual-ciphers, since the inputs are transformed by the same matrix and all S-boxes use the same dual- cipher parameters. In case a collision in the side-channel leakage between two S-boxes is detected (1) (1) (2) (2) Si(Ri(x ⊕ k )) = Si(Ri(x ⊕ k )), the linear key difference k(1) ⊕ k(2) can be recovered as x(1) ⊕ x(2), where j denotes the byte index of the given plaintext and key. However, since our design is realized as round- based architecture, separating the side-channel leakage of different S-boxes and detecting collisions is infeasible.
6.4.2 Concurrent Processing of Mask and the Masked Data
Contrary to implementations of masking in software, preventing univariate leakages is a very challenging task on hardware platforms. When a masked S-box is processing both mask and the masked data at the same time, glitches in the circuit (see Section 2.2 can cause exploitable leakage. This issue has been observed in many realizations of masking schemes [MPO05, MME10, MM12b], and the dual ciphers scheme suffers from the same issue. The SubBytes unit receives the transformed key-whitened input as well as the irreducible polynomialp ˆ, affine matrix coefficients Aˆ, etc. Due to the glitches the side- channel leakage of the S-box is therefore not independent of its untransformed input. A univariate attack, e.g., a CPA [BCO04], should be able to recover the secrets if an appropriate power model is used.
68 6.4. Evaluation
6 4
Freq. 2 0 0 50 100 150 200 250 6 Sbox Input 4
Freq. 2 0 0 50 100 150 200 250 S−box output
Figure 6.3: Distributions of the S-box output for (top) 11x and (bottom) 44x as original input over all 240 dual ciphers
6.4.3 Unbalance
Considering the lemmas and properties of [NRR06, NRS11], we explain this issue as follows. Let us assume a masking scheme where an input value x is transformed into its masked representation xm with a mask m: xm = x ∗ m. The conditional probability
Pr(xm = XM |x)
must be constant for ∀x to ensure the balance of the distributions (with XM we mean a realization of xm). In other words, if fx(xm) represents the probability density function of xm for a given x, the probability distributions of two different realizations of x must be the same. If the distributions would be different, their corresponding side-channel leakages could be distinguished from each other allowing to detect which value of x was processed. For a scheme to be considered secure, this property has to hold for all intermediate values, which is not the case in our dual cipher design. Taking two distinct values for the S-box output and computing the corresponding probability distributions of their inputs for all 240 dual-cipher cases, clearly indicates that those intermediate values are unbalanced (see Fig. 6.3). Again, a univariate side-channel attack should therefore be able to extract the secrets since leakages for different inputs can be distinguished.
6.4.4 Zero Value
A general problem in multiplicative masking schemes is the masking of the zero value. Regardless of the mask m, the x = 0 input will always be mapped to itself. A CPA attack utilizing the zero-value power model [GT02] can therefore usually extract the secrets. This also holds true for the dual cipher approach. Since the transformation step
69 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?
consists of a linear matrix multiplication with R in GF(2), the zero input remains zero in all 240 dual-cipher cases. Therefore, a zero-value CPA attack targeting the S-box input should easily be able to overcome the countermeasure. Since we have so far only considered the 240 dual-ciphers of the original publication in our design, one might argue that some of those issues might be mitigated if one can select the set of dual-ciphers from the 61 200 cases of [Rad04]. Regardless of the zero-value issue, even if one could find a set of dual ciphers which satisfy the balance property on the S-box input, at the same time keeping the balance property for the S-box output cannot be certainly justified because each dual cipher employs a different S-box. Therefore, the aforementioned problems remain valid for any selection of dual- ciphers making implementations employing this scheme vulnerable to certain attacks.
6.4.5 Practical Investigations
As stated before, our practical evaluations are based on experiments performed on a SASEBO-GII [sas] FPGA platform. The design was implemented on the crypto FPGA of the board, a Xilinx Virtex-5 LX50. For each encryption operation the dual cipher index i was randomly chosen by means of an internal PRNG. A LeCroy HRO66Zi 600MHz digital oscilloscope was used to measure the power consumption of the crypto core at a sampling rate of 1GS/s with a 1Ω resistor in the VDD path. All experiments were performed while running the crypto core at a low clock frequency of 1.5MHz. The bandwidth of the oscilloscope was limited to 20MHz to further reduce electrical noise. A sample power trace of a complete AES computation is shown in Fig. 6.4. The first peak between 0 and 1µs is caused by the selection of the dual cipher, where the corresponding parameters of the selected dual cipher appear at the BRAMs’ output and propagate through the combinatorial circuit. The following peaks are due to the 10 AES rounds and the final storage of the ciphertext in the state register. Note that the peak-to-peak voltage of more than 300 mV is quite high, which is due to the required general purpose inversion circuits (c.f. Section 6.3).
3
2
1
Voltage [100mV] 0 0 2 4 6 8 10 Time [μs]
Figure 6.4: A sample power trace, PRNG ON
70 6.5. Conclusion
(a) PRNG ON
(b) PRNG ON (c) PRNG OFF
Figure 6.5: Correlation-Enhanced Collision Attack results, (a) using 500 000 traces, (b) and (c) over the number of traces
First we mounted a Correlation-Enhanced Collision Attack (CECA) (c.f. Section 2.3), targeting two S-boxes of the first round. While attacking two different instantiations of a S-box is not the optimal case for the CECA, we were still able to recover the linear key difference as is depicted in Fig. 6.5. Compared to the unprotected implementation, where the PRNG is switched off thereby selecting the original AES parameters, the number of required traces increases from 5000 to 100 000.
Next, we examined the feasibility of a zero-value attack whose evaluation results are presented in Fig. 6.6. They confirm our theoretical claims that a zero-value attack is amongst the weakest points of the scheme. Using a very low number of 10 000 traces is sufficient to overcome the provided protection.
6.5 Conclusion
In this work we have taken an in-depth look at the AES-Rijndael dual cipher concept from a side-channel point of view. We have implemented an evaluation circuit which is able to perform AES computations using randomly chosen dual ciphers. The inversion part of the circuit operates in GF(28), as in the original dual cipher contribution [BB02a],
71 Chapter 6. Are Dual Ciphers a Side-Channel Countermeasure?
(a) PRNG ON
(b) PRNG ON (c) PRNG OFF
Figure 6.6: Zero-value attack results, (a) using 100 000 traces, (b) and (c) over the number of traces giving a total choice of 240 different internal computations with correspondingly different side-channel leakage characteristics. Besides providing practical evidence of the vulnerability of this original dual cipher im- plementation to several side-channel attacks, we have also described some of the general flaws of the scheme when considered as a side-channel countermeasure. This includes the mask reuse, the concurrent operations on both mask and the masked data, the violation of the balance property, and the inability to mask the zero value. Because of these prop- erties the vulnerability of dual cipher implementations is not only limited to those which are restricted to a low amount of possible transformations by focusing on mappings in GF(28). Even when one would be able to select between several thousand dual ciphers using composite fields, as given in [Rad04], the described weaknesses still exist and would enable an attacker to successfully extract the secret key. In conclusion, even when ignor- ing the large area overhead of the circuit in comparison to other lighter masking schemes, AES-Rijndael dual ciphers are unsuitable as a side-channel countermeasure and can be broken using modest efforts and simple attack models.
72 Part II
Fault Analysis of AES
Chapter 7
Preliminaries
This chapter provides an introduction to fault injection attacks and counter- measures. It also outlines the remaining chapters of this part of the disserta- tion.
Contents of this Chapter 7.1 Introduction ...... 75 7.2 Fault Injection Attacks ...... 76 7.3 Countermeasures ...... 77 7.4 Joint Motivation and Contributions ...... 79
7.1 Introduction
Fault Injection Analysis (FIA) belongs to the active branch of physical attacks. For an FIA attack a device has to be forced into faulty behavior, which can be achieved by (semi-) invasive or non-invasive means. Examples of invasive attacks are the use of laser light to destroy a memory cell, UV strobes to create electric charges, cutting a wire, or forcing a signal to a constant value by use of probing station. Operating the device outside the defined environmental conditions – too high or too low operating voltage or temperature–, or using voltage or clock glitches, are examples of non-invasive fault injection. The fault injection itself might already lead to a successful outcome, e.g., if a certain instruction is skipped by the target device, but more often the output of the faulty computation is used to recover the secret. The following sections present certain fault attack techniques, in particular FSA which is the basis for our proposed attack improvements in later chapters. We also provide an overview over common countermeasures and outline the remaining chapters of this dissertation.
75 Chapter 7. Preliminaries
7.2 Fault Injection Attacks
We introduce both Differential Fault Analysis (DFA), which the most dominant fault- based attack in literature, as well as Fault Sensitivity Analysis (FSA), a recent new technique presented at CHES in 2010.
7.2.1 Differential Fault Analysis (DFA)
Differential Fault Analysis [BS97] has been proposed at CRYPTO in 1997. While no practical results were presented at that time, it was shown that considering a certain fault model the secret key of a DES computation could be recovered in simulation using just a few faulty and non-faulty ciphertext pairs. The attack requires an attacker who is able to cause a single or at least a low number of bit flips during an encryption operation. He must also be able to collect the resulting faulty ciphertext C0 as well as the non-faulty ciphertext C belonging to the same plaintext. Based on a key guess an intermediate state of the cipher is then computed for both C and C0. If the relation of both intermediate states fit to the assumed fault characteristic, e.g., a single bit flip in the targeted round, the secret key can be recovered after repeating this procedure for several ciphertext pairs.
7.2.2 Fault Sensitivity Analysis (FSA)
Unlike DFA [BS97], no faulty ciphertexts are required by a Fault Sensitivity Analysis (FSA) [LSG+10] attack. Instead, the attack works by increasing the fault intensity until a distinguishable characteristic can be observed, e.g., the first appearance of a faulty output. Therefore the value of the faulty output is not required, only the fact that a fault occurred under the used operating conditions. It was practically demonstrated that this attack is able to completely break the AES PPRM1 core of the SASEBO LSI2 (fabricated in 130 nm technology) using 200 faulty operations for each of 50 randomly selected plaintexts. It also could reveal three key bytes of the AES WDDL implementation of the same ASIC using its fault sensitivity leakage obtained from 1200 plaintexts. The presented method used to increase the fault intensity in [LSG+10] is based on the shortening of clock glitches. Two normal clock cycles get replaced by a short and a longer one, whereby the length of the short one can be gradually decreased until a faulty output occurs or the fault becomes stable. Since the critical path of some gates, e.g., AND and OR gates, is data dependent, the knowledge of the underlying model for this data dependency helps revealing the secret. For example, by simulation it could be
76 7.3. Countermeasures
ascertained that the timing delay of computing the output of a PPRM S-box correlates to the Hamming Weight (HW) of its input [LSG+10]. While WDDL should in theory be immune against set-up time violation attacks, by creating templates with a known key it was shown that at least some bits correlate to the timing delay which lead to the aforementioned recovery of three key bytes. However, for an S-box implemented by an inversion circuit followed by an affine transformation, it could not be shown how to map the timing information of the faults to input values of the S-box. Since no faulty ciphertexts are required, the attack might also be applicable to implemen- tations which apply DFA countermeasures. Therefore, even if a DFA countermeasure is implemented which hinders the propagation of the faulty ciphertext, just knowing that a fault occurred might be enough to break the implementation, as we will demonstrate in Chapter 9. Another difference to DFA attacks is that in the case of FSA the faults do not need to be restricted to a small sub-space. In contrast, by for example attacking the last round of the AES PPRM1 implementation, each faulty output byte can be independently observed and therefore the same complete faulty output can be used to attack all key bytes simultaneously. Moreover, as stated in [LSG+10], while countermeasures like masking are only of limited use against DFA attacks, they may have a larger impact on FSA attacks since the critical path is affected by the random mask bits.
7.3 Countermeasures
One can distinguish between two different types of fault attack countermeasures. By embedding active sensors in a circuit, it is possible to detect the fault injection itself and shut down the running cryptographic operation. A fault injection can also be detected by algorithmic means, e.g., by inserting redundancy into the circuit and verifying that no mismatch occurs.
7.3.1 Sensors
If a designer is in complete control of the fabrication, as is the case of cryptographic algorithm embedded in ASICs, active countermeasures can be employed. External power and clock lines can be monitored and, in case an anomaly is detected, an alarm can be raised usually resulting in a reset of the whole chip. Important circuit parts can be covered by active shielding to detect or at least increase the difficulty of modification attacks, e.g., rewiring or cutting with a focused ion beam (FIB).
77 Chapter 7. Preliminaries