TOWARDS EFFICIENT PRACTICAL SIDE-CHANNEL CRYPTANALYSIS Improved Implementations, Novel Methods, Applications, and Real-world Attacks

DISSERTATION

zur Erlangung des Grades eines Doktor-Ingenieurs der Fakultat¨ fur¨ Elektrotechnik und Informationstechnik an der Ruhr-Universitat¨ Bochum

Timo Bartkewitz Bochum, September 2016 Copyright © 2016 by Timo Bartkewitz. All rights reserved. Printed in Germany. To my parents Claudia and Ralf, and Miriam my wife

Timo Bartkewitz Place of birth: Bochum, Germany Author’s contact information: [email protected] www.rub.de

Thesis Advisor: Prof. Dr.-Ing. Christof Paar Ruhr University Bochum, Germany Secondary Referees: Prof. Dr.-Ing. Kerstin Lemke-Rust Bonn-Rhine-Sieg University of Applied Sciences, Germany Prof. Dr. rer. nat. J¨org Schwenk Ruhr University Bochum, Germany Thesis submitted: September 23, 2016 Thesis defense: June 9, 2017 Last revision: July 22, 2017

v

[As HAL does not open the pod bay doors] Dave: What’s the problem? HAL: I think you know what the problem is just as well as I do. Dave: What are you talking about, HAL? HAL: This mission is too important for me to allow you to jeopardize it. Dave: I don’t know what you’re talking about, HAL. HAL: I know that you and Frank were planning to disconnect me, and I’m afraid that’s something I cannot allow to happen. Dave: Where the hell did you get that idea, HAL? HAL: Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

HAL 9000 and Commander Dr. David Bowman in 2001: A Space Odyssey (1968).

vii

Abstract

ryptography — the art of keeping the written word secret — is one of the most integral C parts in the world whose existence is indispensable for economy and political interaction. is applied to protect assets which are, of course, manifold. Beginning with the military document over financial transactions up to the citizen’s electronic identity, their need of protection involve confidentiality, integrity, and authenticity, amongst others. Circumventing cryptography is therefore one the primary goals in all disciplines; a business, however, which is also afforded secretly. From the mathematically point of view approved cryptography is strong, from the technical however weak. What makes the difference? It is the technical implementation of the mathe- matical construct that makes cryptography applicable in the first place. The implementation however relies on some information processing system which is concurrently the weak point. But why? For one thing the paradox with confidentiality maintaining cryptography is the circum- stance that itself relies on the secrecy of a key. For another thing such information processing systems absorb, preserve, and emit information with the aid of many intentional but also imma- nent unintentional channels; and so the systems do with the key. What is cryptography good for when everybody can get the key? Nothing. It might sound strange but with the implemen- tation one primarily aims at the protection of the cryptography, not the assets. Ideally, the implementation is tailor-made, such that it prevents the emission of information through the system’s unintentional channels — the side-channels. In practice the threat of side-channels is well known and many applications are made subject to a thorough assessment, e.g., Common Criteria for Information Technology Security Evalua- tion (CC). Sounds good? Not really. Commercial products — and Information Technology (IT) security applications are no exception — are driven by competition, increasingly faster time-to- market, and ultimately by profit. Of course, this also applies to the assessment of such products; or does every German citizen really believe that they electronic passport is the safest document on this planet? The side-channel cryptanalysis which is the scientific field of research behind the IT security assessment must be continuously taken forward in order to identify potential weaknesses in an early stage of the product. In the course of this, time is one of the key aspects. It is seriously limited and should hence be efficiently used. In this thesis we place importance on the efficiency of side-channel cryptanalysis. Firstly, we demonstrate how to severely speed-up the runtime of important side-channel analysis prim- itives, namely Correlation Power Analysis (CPA) and profiled side-channel analysis or better known as Template Attacks (TAs) by means of the low-cost parallel computing platform CUDA. Surprisingly, only little effort has been spent on high-efficiency tools in the scientific community, although there is an urgent need in the practice. With Part II of this thesis we aim to fill this gap and evaluate our proposed implementations whereby runtimes can be reduced from several hours to a few seconds. Secondly, we investigate new approaches with respect to profiled side-channel analysis. This kind of analysis is certainly the most powerful since it works on real instead of modeled data. The scientific branch of machine learning can be adopted for large improvements but has been little examined so far. In Part III of this thesis we present two novel approaches which share some similarities but also possess differences that make them complement each other well in some assessment scenarios. In detail, one approach facilitates separating profiling information whilst disregarding counterproductive information at the same time. The other approach makes optimal use of the profiling information by means of reducing its uncertainty to a minimum. Both approaches have in common that they provide an inherent dimensionality reduction which is of uttermost importance in profiled side-channel analysis. In Part IV of this thesis we propose an application of side-channel analysis to protect Intellectual Property (IP). Surely, this is not yet a subject of security assessments, however there is a growing need of tools that are able to reveal counterfeits of IT security applications. Finally, in Part V we go into practical side-channel cryptanalysis when we demonstrate a complete key recovery of the DES and AES co-processor implemented within a hardened security Microcontroller (µC) that finds application in a widely spread Electronic Cash (EC) card in Germany.

Keywords. AES, Correlation Power Analysis, Cryptography, CUDA, DES, EC Cards, Graphics Cards, IP Protection, Leakage Prototype Learning, Machine Learning, NIR Backside Microscopy, Real- world Key Recovery, Side-channel Analysis, Support Vector Machines, Template Attacks.

x Kurzfassung

Uber¨ Effiziente Praktische Seitenkanal-Kryptanalyse Verbesserte Implementierungen, Neue Methoden, Anwendungen und Praxisrelevante Angriffe

ryptographie — die Kunst das geschriebene Wort zu verbergen — ist einer der integralen K Bestandteile auf der Welt, der unerl¨asslich fur¨ die Wirtschaft und das politische Zusam- menwirken ist. Kryptographie wird auf schutzenswerte¨ Guter¨ angewandt, die naturlicherweise¨ in verschiedensten Auspr¨agungen vorliegen. Angefangen vom milit¨arischen Dokument uber¨ Fi- nanztransaktionen bis hin zu der elektronischen Identit¨at eines jeden Burgers,¨ beinhaltet deren Schutzbedarf unter anderem Vetraulichkeit, Integrit¨at und Authentizit¨at. Das Aushebeln der Kryptographie ist daher eines der obersten Ziele in allen Bereichen; ein Gesch¨aft, welches auch im Verborgenen ausgeubt¨ wird. Aus mathematischer Sicht ist erprobte Kryptographie stark, aus technischer jedoch schwach. Worin liegt der Unterschied? Es ist die technische Implementierung, die Kryptographie uberhaupt¨ erst nutzbar macht. Die Implementierung jedoch ist auf ein Informationsverarbeitungssystem angewiesen, das gleichzeitig einen Schwachpunkt darstellt. Aber warum? Einserseits ist es die Paradoxie vertraulichkeitswahrender Kryptographie, die selbst auf die Vertraulichkeit eines Schlussels¨ angewiesen ist. Andereseits absorbiert, verwahrt und emittiert solch ein Informa- tionsverarbeitungssystem die Information uber¨ viele nutzbare Kan¨ale, aber auch immanente unbeabsichtigte Kan¨ale; und so verf¨ahrt das System auch mit dem Schlussel.¨ Was nutzt¨ Kryp- tographie wenn jeder den Schlussel¨ erhalten kann? Nichts. Es mag merkwurdig¨ klingen, aber mit der Implementierung steht in erster Linie der Schutz der Kryptographie im Vordergrund, nicht der der Guter.¨ Idealerweise ist die Implementierung maßgeschneidert, sodass keine Infor- mationen uber¨ die unbeabsichtigten Kan¨ale — die Seitenkan¨ale — preisgegeben werden. In der Praxis ist die Gef¨ahrdung durch Seitenkan¨ale wohlbekannt und viele Anwendungen werden umfangreichen Prufungen¨ unterzogen, z.B. Common Criteria for Information Techno- logy Security Evaluation (CC). H¨ort sich gut an? Nicht wirklich. Kommerzielle Produkte — und IT Sicherheitsprodukte stellen keine Ausnahme dar — sind getrieben von Wettbewerb, im- mer kurzer¨ werdenden Entwicklungszeiten und letztendlich durch die Rendite. Naturlich¨ gilt dies auch fur¨ die Prufungen¨ solcher Produkte; oder glaubt etwa jeder deutsche Burger,¨ dass sein elektronischer Ausweis das sicherste Dokument auf diesem Planeten ist? Die Seitenkanal- Kryptanalyse, die als Forschungsgebiet hinter den Prufungen¨ steht, muss kontinuierlich voran- getrieben werden, um Schwachstellen bereits in einer fruhen¨ Phase identifizieren zu k¨onnen. Dabei ist die Zeit ein Schlusselaspekt.¨ Sie ist stark limitiert und sollte daher effizient genutzt werden. In dieser Arbeit legen wir besonderes Gewicht auf die Effizienz der Seitenkanal-Kryptanalyse. Zuerst demonstrieren wir wie die Laufzeit der wichtigsten Analysewerkzeuge, der Korrelations- basierten Analyse (CPA) und der profilierenden Seitenkanalanalyse, besser bekannt als Templa- te Angriffe (TA), mit Hilfe der kosteneffizienten CUDA Plattform erheblich gesteigert werden kann. Uberraschenderweise¨ hat man sich wissenschaftlich wenig mit hocheffizienten Werkzeugen auseinandergesetzt, obwohl es dringenden Bedarf in der Praxis gibt. Mit Teil II dieser Arbeit m¨ochten wir den Fokus auf die Untesuchungen unserer vorgeschlagenen Implementierungen le- gen, mit welchen die Laufzeiten von mehreren Stunden auf wenige Sekunden reduziert werden k¨onnen. Zweitens untersuchen wir neue Ans¨atze der profilierenden Seitenkanalanalyse. Diese Art der Analyse ist sicherlich die m¨achtigste, da sie auf realen statt modellbasierten Daten arbei- tet. Der Forschungszweig des maschinellen Lernens kann fur¨ deutliche Verbesserungen adaptiert werden, wurde jedoch wenig dahingehend untersucht. In Teil III dieser Arbeit pr¨asentieren wir zwei neue Methoden, die einige Gemeinsamkeiten jedoch auch einige Unterschiede aufbieten, sodass sich Prufergebnisse¨ in einem vollst¨andigeren Bild zeigen lassen. Im Einzelnen erm¨oglicht die eine Methode die Trennung der zu profilierenden Information bei gleichzeitgem Verwerfen von kontraproduktiven Informationen. Die andere Methode nutzt die zu profilierende Informati- on optimal aus, in dem sie deren Unbestimmtheit auf ein Minimum reduziert. Beiden Methoden ist gemein, dass sie eine eigene Datenraumreduktion bereitstellen, der eine wichtige Rolle bei der profilierenden Seitenkanalanalyse zukommt. Daruber¨ hinaus schlagen wir in Teil IV eine Seitenkanalanwendung zum Schutz geistigen Eigentums (IP) vor. Sicherlich hat dies keinen Anteil an Prufungen,¨ jedoch steigt der Bedarf an Werkzeugen, die m¨ogliche F¨alschungen von IT Sicherheitsprodukten entlarven k¨onnen. In Teil V besch¨aftigen wir uns tiefergehend mit praktischer Seitenkanal-Kryptanalyse, indem wir praktische Attacken durchfuhren,¨ die uns die kompletten kryptographischen Schlussel¨ eines AES und eines DES Koprozessors gewinnen las- sen. Diese Koprozessoren sind auf einem geh¨arteten Sicherheitsmikrokontroller implementiert, der Anwendung in einer, in Deutschland weit verbreiteten, EC Karte findet.

Schlagworte. AES, Korrelationsbasierte Stromprofilanalyse, Kryptographie, CUDA, DES, EC Karten, Gra- fikkarten, Schutz Geistigen Eigentums, Leakage Prototype Learning, Maschinelles Lernen, Ruckseitige¨ Nahinfrarot Mikroskopie, Praxisrelevante Angriffe auf Kryptographische Schlussel,¨ Seitenkanalanalyse, Support Vector Machines, Template Angriffe.

xii Table of Contents

Imprint ...... v Preface ...... vii Abstract ...... ix Kurzfassung ...... xi

I Preliminaries1

1 Introduction 3 1.1 Motivation ...... 3 1.2 Mathematical Background ...... 7 1.2.1 Random Variables ...... 7 1.2.2 Expected Value, Variance, Covariance, and their Estimation ...... 7 1.2.3 and Advanced Encryption Standard ...... 9 1.3 Structure of this Thesis ...... 9

2 Side-channel Cryptanalysis 11 2.1 Introduction ...... 11 2.2 Side-channel Origins ...... 12 2.2.1 Software Originated Side-channels ...... 12 2.2.2 Hardware Originated Side-channels ...... 12 2.3 Measurement Setups ...... 16 2.4 Side-channel Leakage Modeling ...... 16 2.5 Side-channel Cryptanalysis Techniques ...... 17 2.5.1 Non-profiled Side-channel Cryptanalysis ...... 18 2.5.2 Profiled Side-channel Cryptanalysis ...... 21 2.5.3 Approaches Based on Machine Learning ...... 24 2.6 Conclusion and Outlook for the Thesis ...... 24

3 Parallel Computing 27 3.1 Introduction ...... 27 3.2 Fundamentals of Parallel Computing ...... 28 3.3 Parallel Computing on Graphics Processing Units ...... 30 3.4 Compute Unified Device Architecture (CUDA) ...... 30 3.4.1 Hardware Overview ...... 30 3.4.2 CUDA Programming Model ...... 31 3.4.3 CUDA Programming Interface ...... 34 3.4.4 CUDA Performance Guidelines ...... 36 Table of Contents

3.5 Conclusion and Outlook for the Thesis ...... 38

II Improved Implementations 39

4 Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis 41 4.1 Introduction ...... 41 4.1.1 Contribution ...... 42 4.2 Correlation Power Analysis ...... 42 4.2.1 Recording of Traces ...... 43 4.2.2 Analysis ...... 44 4.3 Correlation Power Analysis on Graphics Cards ...... 44 4.3.1 Creation of the Leakage Model ...... 45 4.3.2 Computation of the Correlation Coefficient Sums ...... 45 4.3.3 Computation of the Correlation Coefficient Matrix ...... 47 4.3.4 Estimation of Leakage Model Sums ...... 48 4.4 Higher Order Preprocessing on Graphics Cards ...... 49 4.5 Experimental Results ...... 49 4.6 Conclusion ...... 52

5 Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis 53 5.1 Introduction ...... 53 5.1.1 Contribution ...... 53 5.2 Template Attacks Based on the Multivariate Normal Distribution ...... 54 5.3 Graphics Cards Accelerated Template Attacks Based on the Multivariate Normal Distribution ...... 55 5.4 Conclusion ...... 59

III New Methods 61

6 Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines 63 6.1 Introduction ...... 63 6.1.1 Contribution ...... 64 6.2 Binary Support Vector Machines ...... 64 6.2.1 Mathematical Background of Binary Support Vector Machines (SVMs) . 64 6.2.2 Non-linear Classification: Introduction of a Kernel ...... 65 6.2.3 Non-separable Case: Introduction of a Soft-margin ...... 66 6.2.4 SVM Training and Classification ...... 66 6.3 Template Attacks using Support Vector Machines ...... 66 6.3.1 Probabilistic Support Vector Machines ...... 66

xiv Table of Contents

6.3.2 Our probabilistic SVM Multi-class Approach ...... 67 6.3.3 SVM based Templates ...... 69 6.3.4 A Dedicated Feature Selection ...... 69 6.4 Experimental Results ...... 70 6.5 Conclusion ...... 73

7 Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Crypt- analysis 75 7.1 Introduction ...... 75 7.1.1 Contribution ...... 76 7.2 Discriminating Side-channel Leakage via Prototype Learning ...... 76 7.2.1 Basics of Leakage Prototypes ...... 76 7.2.2 Locating and Selecting Side-channel Leakage Dependent Time-instants via Prototype Learning ...... 79 7.2.3 Considering the Variation’s Distribution ...... 79 7.2.4 Side-channel Leakage Profiling Phase ...... 80 7.2.5 Side-channel Leakage Characterization and Key Recovery Phase . . . . . 84 7.2.6 Algorithmic Description of LPL ...... 85 7.3 Empirical Results ...... 85 7.3.1 Profiling Accuracy ...... 86 7.3.2 Locating and Selection of Leakage Dependent Time-instants — Results and Comparison ...... 91 7.3.3 Attack Performance and Comparison ...... 95 7.4 Conclusion ...... 98

IV Applications 101

8 Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property 103 8.1 Introduction ...... 103 8.1.1 Contribution ...... 104 8.2 Perceptual Hashing ...... 104 8.2.1 Our Perceptual Hashing Approach ...... 105 8.3 The Wavelet Transform ...... 107 8.4 Salted Side-Channel Based Perceptual Hashing ...... 108 8.4.1 Assumed Types of Plagiarism ...... 108 8.4.2 Our Proposal using the Wavelet Transform ...... 110 8.4.3 Purpose and Limitations of Salted Perceptual Hashing ...... 111 8.5 Experimental Results ...... 111 8.6 Conclusion ...... 116

xv Table of Contents

V A Case Study: Real World Side-channel Cryptanalysis 117

9 From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcon- troller with Cryptographic Hardware Support 119 9.1 Introduction ...... 119 9.1.1 Contribution ...... 120 9.2 Near-infrared Backside Microscopy ...... 120 9.3 Instrumenting the Smartcard ...... 122 9.4 Preparing and Comparing Smartcard Integrated Circuit (IC) Dies ...... 122 9.5 Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard ...... 124 9.5.1 Measurement Setup ...... 124 9.5.2 Localization, Profiling and Re-alignment of the Co-processors’ Activities . 124 9.5.3 Full Key Recovery Attacks on the Smartcard Usage of the Co-processors . 129 9.6 Conclusion ...... 136

VI Conclusion 139

Conclusion 141

VII Appendix 145

Appendix A 147

Appendix B 149

Appendix C 151

Bibliography 165

List of Abbreviations 167

List of Figures 174

List of Tables 175

List of Algorithms 177

About the Author 179

xvi Part I

Preliminaries

Chapter 1 Introduction

The motivation for our research goes hand in hand with the question whether security providing products can be improved by improving the analysis tools. In this chapter we thus introduce the influence of security assessments on the actual security offered by cryptography supporting devices. Beside the establishing of the most important mathematical background, we further provide an overview on implementation attacks which embody the tools for such assessments, before we outline the structure of this thesis.

Contents of this Chapter

1.1 Motivation ...... 3 1.2 Mathematical Background ...... 7 1.3 Structure of this Thesis ...... 9

1.1 Motivation

No matter if we are driving our cars, taking a plane for our holiday trips, watching TV, playing around with our smartphones, picking up money with our Electronic Cash (EC) cards, pay- ing food in the canteen, entering the fitness club with our member cards, visiting the doctor, traveling with our passports, we are using embedded electronic systems in shape of Microcon- trollers (µCs), Field Programmable Gate Arrays (FPGAs), or Application Specific Integrated Circuits (ASICs) without even recognizing it. The application list can be, of course, vastly extended, especially in the emerging world of the Internet Of Things (IOT). Everyone is aware that devices used in such applications require a certain level of security. A simple everyday example is the usage of the EC card. The customer holds a four-digit number, the Personal Identification Number (PIN), to get access to his banking account. He is responsible for keeping the PIN secret. If the wallet got lost containing the card together with the PIN — noted down on a piece of paper for instance — then the customer gets blamed because of his gross negligence. Even if only the card is getting lost a criminal can simply try out three numbers before the card will be blocked, nevertheless a chance of 0.03% but the customer has the burden of proof. The criminal’s chances are even higher when the verification of the card ownership is done by means of a signature, because in this case also the shop seller must be attentive, too. Therefore the highest risk is shifted to the customer, since firstly on the banking side of the infrastructure the access by criminals is much harder to achieve and

3 Chapter 1. Introduction secondly its security is extensively evaluated during security assessments. In order to one day be able to eliminate the risk on the customer’s side — with new proof-of-ownership methods — the security of the underlying devices must be very deeply investigated to exclude attacks committed by criminals in practice. For sure, similar examples can be found in all of the above mentioned application scenarios. An important security assessment scheme is for sure the Common Criteria for Information Technology Security Evaluation (CC) [Cri12]. Another important certification scheme, that however most widely relies on CC in practice, has been established by the Europay, Master- Card, and Visa Consortium (EMVCo) which manages the world wide employed standard for electronic payment. In the CC scheme developers of security products can apply for an eval- uation that is conducted by a licensed laboratory and supervised by a certification authority. World wide there exist so called Certificate Authorizing Members — countries that issue and accept CC certificates — and Certificate Consuming Members — countries that only accept cer- tificates. Basically, an evaluation involves two processes, a formal examination of accompanying documents which fully describe the product’s security related functionalities, and a vulnerabil- ity assessment in which certain attack techniques are conducted on the product. To this end current research topics and results are involved from the discipline cryptology (Fig. 1.1). Cryp-

Cryptology

Cryptography Cryptanalysis Designing Cryptographic Systems Analyzing Cryptographic Systems

Social Exhaustive Engineering Key Search

Mathematical Implementation Attacks Attacks

Figure 1.1: The scientific discipline cryptology subdivided into the common research fields. In this thesis we are concerned with parts highlighted in green. tology is divided into cryptography which is occupied with designing cryptographic systems and cryptanalysis which analyzes the latter. In vulnerability assessments and also in this thesis we are concerned with implementation attacks within the cryptanalysis. This research field concentrates on the security of a concrete implementation of a certain cryptographic system on a certain device. The mathematical attacks, by contrast, take care of the mathematical security of cryptography. Besides we have the fields social engineering and exhaustive key search. Social engineering takes the human behavior as a further link in the security chain into account. For instance, in its simplest form, a criminal could spy out or force a victim to reveal his banking

4 1.1. Motivation

PIN. The exhaustive key search is a field that employs massive computational power to try out each key, completely independent of the actual cryptographic security. In order to address problems with the everyday use of the CC the Joint Interpretation Library (JIL) has been established to facilitate the mutual recognition of certificates [Gro98]. The JIL breaks down implementation attacks into several attack potentials with respect to smart card and similar devices which cover the mentioned µCs, FPGAs, and ASICs [Gro13]. These attack potentials are oriented towards the common research topics which are depicted in Figure 1.2. Following this subdivision we can distinguish two directions of implementation attacks.

Implementation Attacks

Physically Based Logically Based

Side-channel Protocol Backdoor Fault Injection Cryptanalysis Attacks Discovery

Timing Probing Implementation Malicious Attacks Flaws Applications Power

Electro- Preparative magnetic Attacks

Test Mode Reverse Recovery Engineering

Figure 1.2: Implementation attacks split up into the specialized research topics. Many of them can have large overlaps with others, e.g., side-channel cryptanalysis and fault injec- tion. In this thesis we are concerned with parts highlighted in green.

Logically based attacks make use of the established communication channels. According to that one can attack the communication protocol (e.g., man-in-the-middle attack or replay attack), exploit flaws in the implementation (e.g., buffer overflow attack), discover hidden back- doors that enable access to internal variables or circumvent authentication barriers, or write malicious applications to manipulate legitimate applications on a shared resource system. Physically based attacks also use the established communication channels, but mainly employ physical signals, respectively physical effects. Side-channel cryptanalysis is the topic which we address in this thesis. The name originates from unintended channels that carry information about internal security relevant variables. Three such channels are of major importance during

5 Chapter 1. Introduction security evaluations, namely the timing side-channel, the power side-channel, and the electro- magnetic side-channel. We refer to Chapter 2 where we thoroughly introduce the topic. Fault injection is concerned with the insertion of charge carriers into the device to provoke errors in the computation of cryptographic systems. For many systems one single fault can be sufficient to recover a cryptographic secret. Probing attacks aim at obtaining internal variables by directly tapping signal lines within the device. Preparative attacks summarize those techniques that should enable or support other attacks. A Focused Ion Beam (FIB) is a typical technical mean that allows for arbitrary alterations within the device, e.g., cutting signal lines or implanting pads for probing, but its application alone does not lead to a successful attack. One could also include reverse engineering in preparative attacks because it serves to gain knowledge about the device without actually conducting an attack. Nevertheless, the device is destroyed in most cases since it is grinded down layer for layer. Finally we want to mention the test mode recovery. It has only a meaning during the production phase when each device is carefully tested to guarantee that is fully functional. Afterwards the test mode is deactivated. However, if an attacker manages to activate it again he might gain comprehensive access to the device. The statement of a security evaluation according to CC is not ”this device is secure”, but ”this device shows no vulnerability indication with respect to the evaluation depth”. The evaluation depth is selected by the product developer and subdivided into Evaluation Assurance Levels (EALs) 1 to 7. The more trust the developer puts into his product the higher the EAL. More interesting for us, however, is the fact that from EAL 5 onwards the vulnerability assessment, denoted AVA in CC, reaches its highest level 5 which refers to attack potential beyond high. Regarding to [Gro13] the attack potential is converted into a score that a certain vulnerability must be at least assessed with to fulfill the EAL requirements. The score spreads over two phases, the identification and the exploitation. Within both, six factors contribute to the final score as depicted in Table 1.1. Herein, the Target Of Evaluation (TOE) defines the logical scope of the device during the evaluation.

Table 1.1: Exemplary vulnerability assessment of an attack breaking a device’s security Factor Identification Exploitation Elapsed time > 1 month: 5 > 1 month: 8 Expertise Expert: 5 Expert: 4 Knowledge of the TOE Restricted: 2 Public: 0 Access to TOE < 10 samples: 0 < 10 samples: 0 Equipment Specialized: 3 Specialized: 4 Open samples Public: 0 N/A Intermediate scores 15 16 Final score 31

The elapsed time simply indicates the time needed during both phases. The factor expertise presumes in our example an attacker who is an expert in the required research topic. Knowl- edge of the TOE includes documents that become necessary during the identification, e.g., a datasheet which might not be publicly available. In the exploitation, however, it is assumed that all the information has been published. The circumstance that only one sample of the device is mandatory for the attack is assessed with zero. Specialized equipment usually means costly

6 1.2. Mathematical Background equipment that is available in a laboratory, e.g., an oscilloscope. A Personal (PC), for instance, is cheap and thus standard equipment. Finally, one speaks of open samples if the same device that is attacked is publicly for sale and can be used to become familiar with it in terms of an attack preparation. In our above given example a final score of 31 can be considered as the magic number stating that the TOE is resistant against attackers with high attack potential. A closer look at Table 1.1, however, reveals the actual truth, namely that the security of the device can be broken anyways if the attacker takes a time of let us say one and half month. Now, although the device may have passed the deepest security evaluation, it is not secure at all since there might be an attack that is not fully reflected. One can quickly deduce from our example that the only opportunity to further enhance the security of products in practice is to reduce the elapsed time leading to a score below 31. Driven by the aim of obtaining an evaluation certificate which is a strong statement from a marketing point of view, the developer is willing to invest more time and money to achieve a better security if necessary. We therefore want to contribute to a reduction of the elapsed time with this thesis on efficient side-channel cryptanalysis.

1.2 Mathematical Background

Throughout the scientific field of side-channel cryptanalysis we are often confronted with recur- rent mathematical notions with regards to statistics and so in this thesis. At this point we there- fore briefly introduce the most important concepts and notations. For a deeper introduction to statistics we refer to [LM12]. Beyond that, we sketch the Data Encryption Standard (DES) and the Advanced Encryption Standard (AES). Especially, the latter will be frequently discussed.

1.2.1 Random Variables

By X :Ω → E we denote a measurable variable that originates from a sample space Ω mapping into the measurable space E. The sample space Ω, together with a set of events F and an assigned probability function Prob(•): F → [0, 1], defines the probability space (Ω, F, Prob(•)). According to that, E consequently assigns a number to each set F, whereas the concrete as- signment is called the realization x after having performed a single measurement of X. Suppose a fair dice roll where we have Ω = {1, 2, 3, 4, 5, 6}, F = Ω because we roll one dice only, and Prob(X = x) = 1/|F| = 1/6 which is the probability that we achieve a certain realization (dice value). If we include a second dice we achieve F = Ω × Ω(Cartesian product) and Prob(X = x) = 1/|F| = 1/36; Ω does not change because the elementary samples do not change.

1.2.2 Expected Value, Variance, Covariance, and their Estimation

The expected value of random variable X is the long-term average of its realizations presumed before a measurement. It is given through

X E(X) = Prob(ωi) · ωi (1.1) ∀i

7 Chapter 1. Introduction

for all ωi ∈ Ω. The variance is the expected value of the squared deviation from the expected value of X and can be stated with

V ar(X) = E(X − E(X)2) = E(X2) − E(X)2 !2 X 2 X (1.2) ⇔ V ar(X) = Prob(ωi) · ωi − Prob(ωi) · ωi ∀i ∀i

p for all ωi ∈ Ω. The standard deviation arises from the variance as σ(X) = V ar(X). The covariance between two random variables X and Y is a measure for their linear dependency, i.e., how much they change together. It can be written as

Cov(X,Y ) = E((X − E(X)) · (Y − E(Y ))) = E(X · Y ) − E(X) · E(Y ). (1.3)

It should be noted that the outcome lies within the range [−1, 1] where ±1 indicate a strong linear dependency, meaning that one variable increases or decreases while the second increases or decreases in the same manner. If Cov(X,Y ) = 0 we can deduce that both variables are not linear dependent, but they might still be dependent to some extent. A further useful notation is the covariance matrix Σ = [Cov(Xi,Xj)]∀i,j∈{1,...,n}, containing all covariances between the random variables X1,...,Xn. Since Cov(Xi,Xj) = Cov(Xj,Xi) the matrix is symmetric and moreover, it is positive-semidefinite. In side-channel cryptanalysis we are very rarely a priori aware of Prob(ωi), and further we just want to figure out whether variables X and Y are dependent. Hence, we need to estimate the above mentioned measures. Suppose we obtained M realizations of variable X, i.e., we obtain xi, . . . , xM , then we can formulate the sample mean as

M 1 X µ = x (1.4) x M i i=1 which should equal the expected value of X, i.e., µx = E(X), if they coincide. Based on the mean, the biased sample variance reads

M M ! 1 X 1 X σ2 = (x − µ )2 = x2 − µ2 (1.5) x M i x M i x i=1 i=1

p 2 from which follows the biased sample standard deviation as σx = σx. The bias can be corrected with Bessel’s correction factor M/(M − 1) to achieve the unbiased variance and standard deviation. In the same way we get the biased sample covariance as

M M ! 1 X 1 X σ = (x − µ ) · (y − µ ) = x · y − µ · µ (1.6) x,y M i x i y M i i x y i=1 i=1 which can also be corrected. Nevertheless, in practice, having very large M, we can neglect the correction.

8 1.3. Structure of this Thesis

1.2.3 Data Encryption Standard and Advanced Encryption Standard

The DES was invented by IBM and published in 1975. The plaintext with a block length of 64 bit is processed during 16 Feistel network based rounds together with a 56 bit key (cf. Fig. 1.3(a)). Because of the relatively small key size — a full exhaustive key search can be done in less than a day with RIVYERA1 — only Triple DES (3DES) is used in practice today [Nat12]. The AES [Nat01] was intended as the successor of DES and was announced in 2000 as the winner of a competition initiated by the National Institute of Standards and Technology (NIST). Designed as a substitution-permutation network, AES performs 10, 12, or 14 rounds depending on the key size, namely 128 bit, 192 bit, or 256 bit, to process the 128 bit plaintext (cf. Fig. 1.3(b)). The algorithm is, even with the smallest key size, considered to be computationally secure — approximately 53 bit can be tested in a day with RIVYERA2.

IP Plaintext Key Plaintext Key

L-block R-block Key schedule Key Expansion S S S schedule

ShiftRows

S1 S2 S8 MixColumns

Permutation Ciphertext

IP-1 Ciphertext

(a) (b)

Figure 1.3: Block diagram of (a) DES and (b) AES. Solid lines represent paths that are repeat- edly passed during the round function, whereas dashed lines are passed once before or after the round function.

1.3 Structure of this Thesis

This thesis is structured as follows:

 Preliminaries. Here we illustrate the origin of the side-channels and the common analysis techniques to make them exploitable. Further we introduce parallel computing, especially the framework we are concerned with in this thesis to vastly speed-up basic analysis primitives.

1For details visit http://www.sciengines.com/company/news-a-events/74-des-in-1-day.html 2For details visit http://www.sciengines.com/solutions/crypto.html

9 Chapter 1. Introduction

 Improved Implementations. Taking advantage of parallel computing we are enabled to accelerate selected side-channel analyses by a maximum factor of 100. This directly impacts the available time during security evaluations.

 New Methods. Promoting side-channel analyses with respect to their efficiency is not only desired but mandatory, for instance, to verify security evaluation results. We propose to new directions in profiled side-channel cryptanalysis whose methods are much more efficient in terms of being successful more quickly.

 Applications. Side-channels can also satisfy the demand for the detection of plagiarism. A few approaches have been proposed so far, however, most of them can be rendered harmless with less effort. We propose another approach that is more robust in any cir- cumstances.

 A Case Study: Real World Side-channel Cryptanalysis. We demonstrate the effectiveness of side-channel cryptanalysis in practice against a security microcontroller that finds wide application in German EC cards today. To this end, low cost tools support the identification of open samples which can be utilized in the course of preparing attacks.

 Conclusion. Here we shortly summarize and discuss our contributions presented throughout this thesis.

10 Chapter 2 Side-channel Cryptanalysis

Today, the issue of side-channels of electronic devices drives various research groups around the world and companies put a lot of effort into hardening their products against this threat. Two important discoveries have been made, in 1996 and 1999, which were followed by numerous publications on how to exploit and how to prevent various kinds of side-channels. In this chapter we are concerned with the origins of side-channels, especially with the technology caused ones, and introduce the most important analysis techniques that allow for exploitation, here meant in the sense of recovering cryptographic secrets.

Contents of this Chapter

2.1 Introduction ...... 11 2.2 Side-channel Origins ...... 12 2.3 Measurement Setups ...... 16 2.4 Side-channel Leakage Modeling ...... 16 2.5 Side-channel Cryptanalysis Techniques ...... 17 2.6 Conclusion and Outlook for the Thesis ...... 24

2.1 Introduction

Practical cryptography has two aspects, the mathematical constructed cryptographic system and the implementation that enables its use in the first place. The implementation on the other hand can exist in two flavors, namely software or hardware. By software we mean a piece of program code that is executable on a general purpose (CPU) and by hardware we mean a dedicated IC only serving the special purpose of executing the cryptographic system. Side-channels can arise from both flavors. A small portion directly originates from the piece of program code, but only if the implied side-channels are hardware independent, i.e., when they can be proven by program code analysis already. By contrast, the largest portion originates from the underlying hardware caused by technological effects. Hardware based side-channels, however, fall in two main categories, forced and unforced. Generally, side-channels are always threatening in the case when they reveal information about cryptographic secrets, e.g., an en- cryption key. Therefore we detail in Section 2.2 on the origin of side-channels and in Section 2.5 on the most common analysis techniques in order to exploit them. Due to the nature of forced and unforced side-channels, analyses are differentiated between active and passive techniques.

11 Chapter 2. Side-channel Cryptanalysis

2.2 Side-channel Origins

The origins with respect to both flavors are introduced hereafter. The level of detail is appro- priate to get started with the remaining chapters of this work.

2.2.1 Software Originated Side-channels In 1996 Paul Kocher demonstrated attacks on various asymmetric cryptographic methods, like Diffie-Hellman, Rivest Shamir and Adleman (RSA), DSS, and others [Koc96]. The weakness he exploited was a time variance within the implemented algorithms. For instance, RSA mainly requires modular exponentiations with large integers. Instead of performing them in the na¨ıve way, the exponent is fragmented into its binary representation, such that each exponentiation can be expressed by squarings and multiplications. Algorithm 2.1 highlights this procedure. As

Algorithm 2.1 Basic Binary or Square-and-Multiply Algorithm Input: (1) Base x, (2) exponent d, (3) modulus n Output: y = xd mod n

1: Create array of binary representation of d: d array[dlog2(d)e] 2: y = 1 3: for i = dlog2(d)e − 1 downto 0 do 4: y = y2 mod n 5: if d array[i] = 1 then 6: y = y · x mod n 7: end if 8: end for we can see there exist a dependency concerning exponent d, which serves as the cryptographic secret key. If we are able to measure the time the algorithm needs for each step (easy with an oscilloscope) we would be recognizing that a squaring consumes less time than a squaring with multiplication. By this observation we can simply read off the exponent. This was the original origin of the so-called timing side-channel which is of algorithmic nature. In [Koc96], however, the runtime of the complete algorithm was exploited and not the runtime of single steps. Nevertheless, technological components also contribute to such timing attacks, namely caches in a CPU. Actually, caches are meant to speed up frequent accesses to the same data, by storing it temporarily close to CPU instead of fetching it again and again across the long distance to the Random Access Memory (RAM). In practice, the timing regarding specific data can be profiled and employed to recover secret data ultimately, first described in [Ber05].

2.2.2 Hardware Originated Side-channels Respecting the hardware flavor we need to go into the details of the most common technology for constructing integrated circuits, namely Complementary Metal Oxide Semiconductor (CMOS). The basic primitives are two complementary Metal-Oxide Semiconductor Field-Effect Transistor (MOSFET) types, in particular n-type and p-type. For the purposes of illustration consider Figure 2.1 that depicts a Negative-And (NAND) gate in CMOS technology, complemented by certain parasitic elements [AR07]. These parasitic elements however significantly determine the

12 2.2. Side-channel Origins

VDD

P P 1 2 D 2 Y A

D1 C1 C2 N1

N2 B

Figure 2.1: CMOS NAND gate, fulfilling Y = A&B, complemented by certain parasitic elements (D1, D2, C 1, C 2) which significantly determine the current flow. The gate itself is composed of the n-type transistors N 1, N 2, and the p-type transistors P1, P2. current flow within the gate, and therefore the power consumption of the whole circuit. The diodes D1 and D2 are formed due to reverse bias between diffusion regions and wells, respectively between wells and substrate. The capacitances are formed through wire, drain and even source capacitances (C 1), as well as gate capacitances of subsequent gates (C 2). The temporal profile of the current flow into the device can be stated through

i(t) = Istatic + idynamic(t), (2.1) whereas the two fractions further split into

Istatic = Isubthreshold + Idiode + Itunneling, (2.2) and idynamic(t) = ilogic(t) + iglitch(t) + ishort-circuit(t). (2.3) Following [AR07] the subthreshold current occurs during the transistor off-time due to (sub- threshold) conduction from the transistor’s gate to its source. Additionally, the parasitic diodes cause a small reverse leakage current, whereas the tunneling current grows while transistor gates become thinner. All these static current portions depend on the manufacturing process. Obviously, the most interesting fraction is the dynamic current since it varies with time. When a transition occurs, e.g., due to a change in the inputs A and B of the NAND gate, there will be a direct current path from V DD to ground if both transistor types switch from the on-state to the off-state, or vice-versa respectively. The latter current is denoted as short-circuit cur- rent while both transistor types are concurrently switched on for a short period. Last but not least, ilogic(t) + iglitch(t) and ishort-circuit(t) are the very data-dependent portions of the dynamic fraction. Both arise from the charging and discharging of the capacitances during transitions. However, iglitch(t) is actually undesired because of unbalanced signal propagation. Suppose that the NAND gate is subjected to a transition while input A changes from 0 → 1 and B from 1 → 0. Then ilogic(t) is not affected because the output node Y does not change, but if A, for instance, changes a bit earlier than B the output is glitched to 0 for a short period.

13 Chapter 2. Side-channel Cryptanalysis

Unforced Side-channels

The temporal profile i(t) is referred to as the power side-channel in literature and in the re- mainder of this work. According to [Ver10] this side-channel’s signal can be acquired by means of a Digital Storage Oscilloscope (DSO), however, indirectly via a proportional voltage across a shunt resistor. In Figure 2.2 we highlighted the two opportunities, namely acquiring the signal in the V DD line or the ground line. In both case we can measure the short-circuit current.

RVDD 1 RGND

i1 2 i2

Figure 2.2: Acquisition of the dynamic current (power side-channel) can be achieved across a resistor in the V DD line or the ground line. The discharging portion might be suppressed while acquiring in the ground line.

In the V DD line we can, however, only measure the charging portion but not the discharging portion since we only observe what goes into the device (cf. path 1 and 2 in Fig. 2.2). In the ground line it is exactly the other way round since we only observe what comes out of the device. Nevertheless, in the latter case depending on the resistor’s value it might prevent a full discharge, and thus, we might only be able to measure the short-circuit current. Hence, i1 =6 i2 if the one or the other resistor is present. The first successful exploitation of the power side-channel has been reported in 1999 in [KJJ99]. Another way to acquire the dynamic current is to measure the electro-magnetic field emanated by a current-carrying piece of conductor. In the following we concentrate on the magnetic component. Each IC comprises a vast number of short conductors to interconnect all the different parts, e.g., the logic. Each of such conductors generates a magnetic field respecting the Biot-Savart law µ0 Z dl × ˆr Br(t) = i(t) 2 . (2.4) 4π l r

Bt(t) is the magnetic field component along the conductor length l in a distance r, µ0 the magnetic field constant, and ˆr the unit vector in direction of r. This emanated field is referred to as the electro-magnetic side-channel. We can acquire its signal with the help of an induction R coil. The inductive coupling is achieved through the magnetic flux Φ(t) = A Br(t)dA which is the cross-sectional area A of a coil within the field component Br(t). See Figure 2.3 for an illustration of this relationship. By this means we can acquire the localized dynamic currents, in shape of superposed Br(t) components, under the covered partial IC surface indirectly with

∂Φ(t) u (t) = −n · . (2.5) ind ∂t

From the Equations (2.4) and (2.5) we learn that the distance is a crucial factor, the closer the coil the better the signal’s amplitude. Further we learn that we merely measure the dynamic current because voltage is only induced into the coil when the magnetic flux varies with time,

14 2.2. Side-channel Origins

i(t) u ind(t)

Figure 2.3: Acquisition of the localized dynamic current (EM side-channel) can be achieved by inductive coupling. The size of the coil decides the degree of localization. and hence the current needs to vary with time. The first successful exploitation of the EM side-channel has been reported in 2002 in [AAJR03]. In practice both side-channels are acquired simultaneously but the EM side-channel offers some advantages. First of all there is no need to modify the device power supply, i.e., to insert a resistor. Secondly, stabilizing capacitors that can act as a severe low pass filter must not be removed. Finally, the localization property can be very beneficial since we are enabled to exactly measure the dynamic current of a very specific part of the IC. Additional noteworthy side-channels are the acoustic side-channel and the photon emission side-channel. In [GST14] it has been demonstrated that an RSA key with a bit length of 4,096 could be extracted using the sound generated by a computer during the decryption. This side-channel is originated in electrical components, like capacitors, that vibrate at high frequencies under load. In [SNK+12] it has been investigated whether the emission of photons within a circuit can be employed to break cryptographic implementations. It turned out that for general purpose microcontrollers it was possible to observe AES S-box table look-ups in SRAM cells. Taking several photon emission images ultimately resulted in a successful exploitation. Originally discovered in [FH08] the side-channel is originated in charge carriers that gain kinetic energy while subjected to the electric field within the source-drain channel. This energy is released in shape of photons that could be made visible with corresponding detectors.

Forced Side-channels

As the name already suggests these side-channels cannot be acquired by means of passive measurements. A forced side-channel only reveals its information if the circuit is appropriately stimulated by various kinds of energy. The main goal is to provoke errors while the data is passed through the logic. Thus, a forced side-channel only contains corrupted information, e.g., an erroneous computed ciphertext. The discussion on this topic has started in 1997 with [BDL97]. The authors reported on very simple attacks on the RSA signature scheme. They showed that a single faulty signature is appropriate for a successful exploitation. The research was taken forward in [BS97] in which the authors presented a framework, called Differential Fault Analysis (DFA), for analyzing faulty outputs of a wide range of secret key cryptographic systems. Analysis methods that presume the possibility of forcing certain side-channels are denoted by active analysis techniques. It becomes clear that the appropriate induction of faults — or Fault Injection (FI) — into the logic circuitry is of uttermost importance. There are plenty of possibilities to transport energy portions into focused spots more or less precise. The most common method is certainly the

15 Chapter 2. Side-channel Cryptanalysis exposure to a Laser beam, first reported in [SA03]. Numerous other technical means can be found in [AK96, BECN+06]. Throughout the remainder of this work we will not further deal with FI or active analysis techniques.

2.3 Measurement Setups

To make use of side-channels we need to record a temporal excerpt of their signals and store them into a time and value discrete representation. This can be achieved by the following listed technical means.

 Device Under Test (DUT) which is the target we want to investigate. It must be ap- propriately prepared as mentioned above, e.g., a resistor is inserted in the power supply or the surface is exposed to facilitate the acquisition of the electromagnetic emanation. These preparations might require considerable efforts like (de-)soldering or chemical etching.

 Digital Storage Oscilloscope (DSO) which allows the observation and storage of varying voltages. Therefore each side-channel signal must be quantifiable directly or indirectly via a voltage, at least for the power and Electro-Magnetic (EM) side channel. After the recording the signals are available in a digitized form, i.e., value and time discrete.

 Voltage probes in shape of small circuitries that basically connect the voltage taps at the prepared DUT side with the DSO. There are various kinds of such probes. For side-channel measurements passive probes, for measurements in the ground line, or active differential probes, for measurements in the V DD line, are best suited. Current probes can be employed if adding a resistor is not an option.

 EM probes to measure the electro-magnetic field. One has to decide between toroidal probes with larger or shorter diameter. Both can possess advantages and disadvantages in specific scenarios. Beside the toroidal probes there exist several other shapes.

 A conventional PC to communicate with the DUT, i.e., to utilize the cryptographic implementation.

The quality of the setup can have a large impact on the signal quality, thus improving the setup where possible is surely a prime necessity in side-channel cryptanalysis. In the end, the measurement procedure leads to a set of recorded signals denoted by measurement in this work. A single recorded signal is commonly known as trace consisting of sample points, or short samples.

2.4 Side-channel Leakage Modeling

In the beginning of Section 2.2.2 we saw that the processing within the logic circuitry is data dependent, but for now we do not know to which extent. Suppose an 8 bit deep logic circuitry, i.e., eight bits are processed in parallel presuming that each logical component exists eight times. Further suppose that different logical components are interconnected with longer parallel wires — a bus — with corresponding larger capacitances. For each bit that is subjected to a

16 2.5. Side-channel Cryptanalysis Techniques transition, either from 0 → 1 or 1 → 0, the respective bus line must be heavily charged or discharged due to its capacitance (cf. Sec. 2.2.2). Each remaining bus line whose bit stays at 0 or 1 is not charged or discharged at all. This means that the dynamic current equals the bit difference of two subsequent data bytes b0 and b1 which is called the Hamming Distance (HD), defined as HD(b0 ⊕ b1). The Hamming distance is our very basic side-channel leakage model that relates the data to the acquired signal. A common approach in microcontrollers is the application of a bus pre-charge, i.e., all bits are forced to 0 or 1 before new data is passed through the logic. As a consequence we simply assume that b0 = 0, hence HD(0 ⊕ b1) which is equivalent to the Hamming Weight (HW), merely counting the ones. Nevertheless, both might not be very accurate since they neither account for higher-order contributions of single bits, nor for differences between both transition types. A more relaxed model could be the Identity that directly relates the byte value to the signal. The most accurate model would be surely the transition count with regards to each byte value [MPO05]. Yet, the transition count would require very deep design knowledge which normally not available. In summary, the choice of the optimal model is a decision that has to be made from case to case. A more convenient approach is provided by profiled methods. To this end, each data byte would be profiled by the exact signal value, and thus replacing the model value.

2.5 Side-channel Cryptanalysis Techniques

In this section we deal with the most common passive side-channel analysis techniques. Basically they fall into two distinct groups with respect to their assumed prior knowledge, namely profiled and non-profiled side-channel cryptanalysis. In contrast to the non-profiled analysis we will see that profiled analysis requires actually two devices. In this regard the first device is employed to profile the desired side-channel with respect to known data to be able to characterize the same side-channel of a second device with unknown data. For both we differentiate between Simple Power Analysis (SPA) and Differential Power Analysis (DPA) attacks. Let us assume in the following that the logic circuitry of a DUT processes some consecutive vh functions fi while stimulated with data vh ∈ V . The data is called inputs according to different runs. During each run, while the functions are processed, we record a trace and get an output back. The processing in the DUT is stated by matrix

   v1 v1 v1  x1,1 x1,2 ...... x1,S f1 f2 ...... fN v2 v2 v2  x2,1 x2,2 ... Xj . . . x2,S   f1 f2 ... Fj . . . fN     v3 v3 v3   . . .   f1 f2 ...... f   ......   N   . . .  F =  f v4 f v4 ...... f v4  ; X =   . (2.6)  1 2 N   Xi   . . .     ......   . . .   . . .   ......  f vM f vM ...... f vM 1 2 N xM,1 xM,2 ...... xM,S

Each column vector Fj of F executes the same function, yet with different inputs. Let matrix M×S X ∈ Z contain the corresponding recorded traces Xi (rows) with respect to the M runs. Each trace consists of S ≥ N samples, such that each function fi is represented by at least one sample.

17 Chapter 2. Side-channel Cryptanalysis

2.5.1 Non-profiled Side-channel Cryptanalysis

We start by covering analysis techniques that are performed in a non-profiled scenario. Such analyses belong to the attacks from the very start.

Simple Power Analysis in a Non-profiled Scenario

With SPA, first reported by Kocher et al. in [KJJ99], one aims at retrieving information about the processing by means of one single trace or some averaged traces. That means we observe merely one run of F, i.e., a single row. Generally, SPA could be seen as a preparative step to interpret certain structures in the trace or to map them to processed functions. Correspondingly chosen intervals are then employed during further, more sophisticated analyses. Therefore this step is also known as visual inspection. Besides, the timing side-channel could be directly exploited by means of a single trace to distinguish between operations that allow for direct secret key extraction, as mentioned in Section 2.2.1.

Differential Power Analysis in a Non-profiled Scenario

DPA was also proposed by Kocher et al. in [KJJ99] on the basis of a specific statistical method. Today, DPA can be regarded as a class of attacks since several methods make in fact use of distinct statistical methods but still rely on the same concept. This concept is dedicated to extract secret keys and quickly explained. Regarding to the matrix F we suppose that one function processes an input together with a secret key, thus we have fi(v, k) where v is the input value, known to us, and k the secret key, unknown to us. From now on input v is written as an argument of fi and not as a superscript. If we would know k we could simply compute fi(v, k). If we further do so in the course of M runs with different inputs, then our column vector Fj, containing the fi(v, k), would show some kind of dependency to the corresponding column vector Xl, within the measurement matrix X, that represents the side-channel signal during the computations at Fj. Clearly, any other pair (Fo,Xp) for (o, p) =6 (j, l) would not show a dependency to each other. This simple fact opens us the door to side-channel attacks. We now come back to the assumption that we are not aware of the key. In this case we can enumerate all possible values for k, initially called key hypotheses and build just as many Fj vectors. We can do so because we are aware of the inputs and the function, and further, because k is, in most cases, merely a portion of the secret key, a so called sub-key, that is much smaller than the entire key. The vector Fj that was built with the correct hypothesis shows a dependency to Xl because the side-channel signal contains the information about the correct key. The dependency could severely benefit from a suited side-channel leakage model (cf. Sec. 2.4), i.e., if applied to the elements of Fj. If the dependency can be proven we speak of side-channel leakage. In practice we do not know exactly at which point in time the targeted function is processed, therefore each hypothetical vector is examined for a dependency with each measurement column vector. The previous mentioned statistical method only serves the purpose of rating the dependency, respectively to distinguish between the key hypotheses which is why methods in this context are called side-channel distinguisher.

18 2.5. Side-channel Cryptanalysis Techniques

 Difference of Means was the first side-channel distinguisher in the literature [KJJ99]. The hypotheses are tested by means of two groups G0 and G1. Therefore the side-channel leakage model must be adapted accordingly, such that its application on the targeted function results in two values. This can be achieved by taking the Hamming weight into account aiming at a single bit, e.g., the Least Significant Bit (LSB). That is, a hypothetical vector consists of LSB(ftarget(vi, ke)) wherein vi is the input during the ith run and ke denotes a key hypothesis. As we have assigned each hypothetical value to one or the other group, the associated traces are averaged according to their group. We recall that we do this for all samples xi,j in a trace Xi because we do not know exactly at which point in time the targeted function is executed. Afterwards a difference trace with respect to each key hypothesis is computed. That is, the hypothesis difference reads

1 1 X X − X X . ∆k = i i (2.7) e |G0| |G1| Xi∈G0 Xi∈G1

For the correct key hypothesis at the correct point in time the difference will be distin- guishably greater than zero, in all other cases the difference will be approximately zero. Why is this the case? Only for the correct key and the samples that indeed represent the target function, the side-channel signals show a difference of the means. If the Hamming weight assumption holds then the last bit of the side-channel information in G0 is zero and in G1 one. The mean Hamming weight of the other bits are shared in both groups, thus the difference is one because of the last bit. If the grouping is wrong we distribute all values over both groups and the mean Hamming weight is equal, and hence the difference is zero.

 Correlation Power Analysis overcomes the shortcomings of Difference Of Means (DOM), namely that only one bit of side-channel information is utilized. Proposed by Brier et al. in [BCO04], it makes use of the entire available side-channel information. CPA facilitates hypotheses testing in a more intuitive way by rating the dependency between Fj and corresponding measurement vector Xj directly. To this end, each hypothetical column vector is first subjected to a leakage model M according to the targeted function and to each sub-key hypothesis ke, hence we achieve M(ftarget(vi, ke)) for i = 1 ...M. The side-channel distinguisher is embodied through the Pearson correlation coefficient

Cov(F ,Xj) ρ target,ek . k,j = q (2.8) e V ar(F ) · V ar(Xj) target,ek

The sub-key hypothesis reaching the maximum correlation along all Xj is supposed to be correct sub-key. CPA accounts for a linear dependency and assesses it within the bound [−1, 1], where ±1 means perfect linear dependency and zero means no linear dependency. Please note that a correlation of zero does not mean that there is no dependency at all. Because of noise it might be the case that an actually existent correlation does not show up with respect to a smaller amount of traces. The noise however reduces with the number |ρ | ≤ √4 significance threshold of traces according to noise M [MOP08] which is denoted throughout the rest of this thesis.

19 Chapter 2. Side-channel Cryptanalysis

 Mutual Information Analysis was introduced by Gierlichs et al. in [GBTP08]. Its name giving side-channel distinguisher is the mutual information

I = H(Xj) − H(Xj | F ) (2.9) ek,j target,ek where H(•) denotes entropy that has to be estimated in practice. The mutual information states the reduction of the information-theoretic uncertainty about F under the target,ek knowledge of Xj. Here as well, the largest mutual information generated among all sub-key hypotheses highlights the correct one. In contrast to CPA, however, Mutual Information Analysis (MIA) accounts for any dependency from which follows that a mutual information of zero in fact means that there is no dependency at all.

 Linear Regression Analysis according to [SLRP05, DPRS11] assumes that measure- ment vector Xj has some algebraic property over chosen subspace Fu,j such that each Pu−1 xi,j decomposes into polynomial xi,j = l=0 bl,j · xi,j[l] plus an independent noise, if we suppose that each processed bit l of Fj is leaked independently through xi,j[l] with respect to coefficients bl. The latter representation, however, does not account for higher order relations between different xi,j[l]. To include higher orders as well, additional coefficients need to be taken into account.

In order to establish a side-channel distinguisher, the targeted function Ftarget is computed for each possible key candidate ke and with N known inputs vi. Ftarget is then decomposed in the same manner as measurement vector Xj over Fu,target, hence

u−1 X ftarget(vi, ke) = bl,i · ftarget(vi, ke)[l]. (2.10) l=0

The coefficients bl,i are to be estimated which is realized by solving

> > G G b = G Xj (2.11) ek ek ek ek

with matrix G = (ftarget(vi, ke)[l]), 0 ≤ l < u, 1 ≤ i ≤ N. By definition, linear regression ek outputs coefficients that minimize the Euclidean distance between G b and Xj, meaning ek ek that the key hypothesis reaching the minimum distance along all Xj is supposed to be the correct sub-key. To this end, the coefficient of determination is applied, namely

X − G b j kbk R2 = e e 2 . (2.12) ek,j V ar(Xj)

 Correlation-enhanced Collision Analysis as proposed by Moradi et al. in [MME10] has the advantage that it does not rely on a modeled side-channel leakage. Instead it exploits collisions within a cryptographic systems that reveal information about secret key interrelations. We will explain this by means of AES since it well suited for this analysis. Due to the bijection property of the AES S-box and its usage within the first (or last) round, collisions will occur in the shape of

S-box(va ⊕ ka) = S-box(vb ⊕ kb) ⇔ va ⊕ ka = vb ⊕ kb ⇔ ka ⊕ kb = va ⊕ vb = δa,b (2.13)

20 2.5. Side-channel Cryptanalysis Techniques

for arbitrarily chosen inputs va, vb and sub-keys ka, kb at positions a and b. Therefore, if we can find colliding S-box outputs we can obtain information about the difference between two sub-keys, simply by means of their respective inputs. Since the sub-key difference δa,b is fixed we find a collision whenever va⊕vb has the same difference. Instead of finding single collisions, the approach includes all possible collisions for the difference δa,b. Initially, the 8 traces are averaged according to the inputs va and vb, thus we would achieve 2 (since AES works on bytes) mean traces Xα,a and Xα,b with respect to each input value α at input positions a and b. Each possible collision and each possible difference can be covered by va = α and vb = α ⊕ δea,b and thus we employ the Pearson correlation coefficient to rate

 >  > ρ (xj)0,a,..., (xj)256,a , (xj)0⊕δ˜ ,b,..., (xj)256⊕δ˜ ,b . (2.14) eδa,b,j a,b a,b

The difference hypothesis δea,b reaching the maximum correlation is assumed to be the correct sub-key difference. Nevertheless, in Equation 2.14 we implied that the collisions occur at the same point in time. If this does not hold, we need to shift the point in time of b by j +  where  is the shifting offset.

2.5.2 Profiled Side-channel Cryptanalysis

In non-profiled side-channel analysis, the leakage information is exploited at the same point in time over different runs, so to speak vertically with respect to measurement X. In profiled side- channel analysis, by contrast, several points in time of a trace are incorporated and the contained information is matched with another measurement of these points to state the similarity. Here, the leakage information is exploited in a horizontal fashion, i.e., per trace and not per time- instant. For this purpose known information, e.g., a secret key, is profiled — during the profiling phase — such that we obtain traces that serve as a reference. In a second step we assume we are not aware of the information, i.e., the secret key, hence we characterize — during the characterization phase — these traces Xnew by means of the reference. Hypothetical function vectors based upon Fj are not involved. The approach of profiling and characterization is very closely related to the Template Attack (TA), in which the name giving templates are representatives for the reference traces. New traces are then matched to templates with an assigned probability stating the similarity.

Simple Power Analysis in a Profiled Scenario

Template attacks as being the first branch of profiles side-channel cryptanalysis introduced in [CRR03], assume measured side-channel leakage traces Xi — or side-channel observations to be more formal — to be drawn from a Multivariate Normal Distribution (MVND). A multivariate approach supposes that merely a certain subset of time-instants within measured traces contain the crucial proportion of the side-channel leakage. Consequently, these time-instants, known as Points Of Interest (POIs), need to be figured out. To date, there were several methods proposed. Amongst others, a known-key CPA [RO05], a distance vector between observations to their overall mean [CRR03], or pair-wise T-differences instead [GLRP06] that output those time- instants that show a significant data dependent variability. Generic approaches, like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), aim for transforming the

21 Chapter 2. Side-channel Cryptanalysis set of side-channel traces into a low-dimensional subspace, supposing that it basically retains the data dependent variability [APSQ06, MBvLM12]. Based upon a subset of P selected POIs, several templates τ v,k are built by estimating the v,k means µv,k and covariances Σv,k of corresponding observations Xi ∼ N (µv,k, Σv,k). There- fore, each template correlates to ftarget(v, k). Regarding to templates, we will call a single pair v, k a side-channel leakage class hereafter. This approach of estimating parameters refers to template profiling. More precisely, the means actually profile the side-channel leakage whereas the covariances estimate the noise at each time-instant and existing mutual dependencies. new ◦ A new observation Xv,k◦ involves a yet unknown sub-key k that is to be characterized with help of the MVND Probability Density Function (PDF)   new 1 1 2 new Prob(Xv,k◦ | τ v,k) = q · exp − ·D (Xv,k◦ , τ v,k) . (2.15) P (2π) |Σv,k| 2

◦ The maximum likelihood states the best fit of a template τ v,k to the correct key k , respectively to the new observation in which it is embedded. Here, D2 is a distance measure that can take three shapes.

 Mahalanobis Distance proposed from the author with the same name [Mah36] is, in its squared form, the actual distance measure from the definition of the MVND PDF since it considers the full covariance matrix. In sense of Equation 2.15 it is written as

2 new new > −1 new D (Xv,k◦ , τ v,k) = (Xv,k◦ − µv,k) Σv,k(Xv,k◦ − µv,k). (2.16) If there is a good reason to believe that covariances are equal for all side-channel leakage classes v, k one can estimate a pooled covariance matrix Σ replacing each Σv,k.

 Normalized Euclidean Distance arises from the Mahalanobis distance by neglecting the covariances (off-diagonal values) and thus considering the variances solely. We obtain  2 P new x − (µj)v,k 2 new X j D (X ◦ , τ ) = . (2.17) v,k v,k σ 2 j=1 ( j)v,k This leads however, to a univariate approach since Equation 2.15 can then be rewritten as the product of the probability densities at each point of interest   2  P new xj − (µj)v,k new Y 1  1  Prob(Xv,k◦ | τ v,k) = q exp − 2  (2.18) 2 2 (σj) j=1 2π(σj)v,k v,k

considering them as being independent and thus uncorrelated.

 Euclidean Distance is completely independent of the covariance matrix, respectively assuming unit variances, and merely computes the distance at each point of interest, that is P 2 2 new X  new  D (Xv,k◦ , τ v,k) = xj − (µj)v,k . (2.19) j=1 As a consequence, the profiling phase also only includes the estimation of the means.

22 2.5. Side-channel Cryptanalysis Techniques

In template based SPA one has numerous traces available for the profiling phase, but merely a single trace in the characterization phase. Moreover, it is likely that both phases are conducted on distinct but very similar DUT, the first one under full control for profiling and the second with an unknown secret key to be characterized, i.e., to be recovered.

Differential Power Analysis in a Profiled Scenario Simply speaking, the template based DPA [ARR03] extends the SPA with regards to the number of traces during the characterization. Since it might be difficult in practice to recover an unknown secret key correctly by means of one trace, one can, of course, include several traces and update the probability of sub-key hypotheses. new Because templates actually profile fi(v, k), a new trace Xv,k◦ is matched to a template in accordance with v. During a second matching another new trace embedding the same sub- key is matched to another template when v was different. So, after a few characterizations the correct should have always been assessed with a non-negligible probability, whereas all the other sub-key hypotheses should be assessed with lower probabilities and sometimes with zero probability even. Therefore, the probability of each hypothesis evolves by means of the application of Bayes’ rule to Eq. 2.15.

new new Prob(Xv,k◦ | τ v,k) · Prob(ke) Prob(ke | Xv,k◦ ) = new (2.20) Prob(Xv,k◦ )

new where Prob(ke) is the prior and Prob(ke | Xv,k◦ ) the posterior (updated) key hypothesis proba- bility. In the beginning all hypotheses are equally likely and in the end, on the other hand, the correct hypothesis should have reached probability one.

Stochastic Model or Linear Regression Based Profiling The stochastic model [SLRP05] views the side-channel leakage at time-instant t as the random variable It(v, k) = ht(v, k) + Rt. (2.21)

Herein, ht(v, k) is the deterministic side-channel leakage component and Rt an independent noise component. As introduced above, v, k are pairs of a known input and a key. ∗ In the profiling phase ht(v, k) is estimated by het (v, k) within an appropriately chosen subspace Fu,t that relates to a selection function, again, involving v and k. Therefore,

u−1 ∗ X het (v, k) = bj,tgj,t(v, k) (2.22) j=0 where bj,t is a coefficient of the (partial) selection function gj,t(v, k). Hence, the coefficients bj,t are actually to be estimated which is realized by solving

> > G Gb = G it (2.23) with (N1 × u) matrix G = (gj,t(vi, k)), 0 ≤ j < u, 1 ≤ i ≤ N1. Further, it(vi, k) corresponds to a column vector Xt possessing N1 measured observations at time-instant t for distinct inputs

23 Chapter 2. Side-channel Cryptanalysis

vi. Next, utilizing a further set of N2 observations, the distribution of Rt is characterized by ∗ a covariance matrix of Rt = it(vl, k) − het (vl, k) for 1 ≤ l ≤ N2, according to Eq. 2.21. This p leads to a multivariate normal probability density fe0 : R → R. As a preparation step, however, a selection of p side-channel leakage dependent time-instants, as explained above, needs to be carried out. ◦ During the characterization phase, a yet unknown key k embedded in a set of N3 new p-variate observations is decided by key candidate k˜ that maximizes

N3 Y  new ∗  ◦ ˜ fe0 Xvj ,k − heT (vj, k) . (2.24) j=1

T denotes the interval comprising p selected time-instants. Here we apply Bayes’ rule to recover ◦  new ∗  k f X ◦ − h v , k˜ by means of substituting the posterior probability in Eq. 2.20 with e0 vj ,k eT ( j ) . We prefer Bayes’ rule to prevent a degeneration of results in case of small density values fe0.

2.5.3 Approaches Based on Machine Learning The branch of machine learning can be developed to support or even enhance side-channel cryptanalysis. In can be applied for both, non-profiled and profiled analyses. Generally speaking each machine learning method represents an artificial neural network (cf. Appendix A) that is trained with data to optimize a certain classification problem. Therefore machine learners are optimally suited to fit into profiled side-channel cryptanalysis. The most considered approach so far is certainly Support Vector Machines (SVMs) [Vap95] investigated in [LBM11, HMG+11, HGM+11, HZ12]. In short the SVM approach has been extended from a single bit distinguisher to a full multi-class profiled side-channel attack. Since SVMs are also examined in this work we refer to Chapter 6 for more details. Besides, Self Organizing Maps (SOMs) [Koh01] and Random Forests (RFs) [Bre01] have been utilized in [LBM11] to recover bits of a DES implementation. The authors showed that both methods can have a similar predictive power compared to SVMs. In [YZLC12] a Back Propagation (BP) neural network [RHW88] is used to enhance Correla- tion Power Analysis and Mutual Information Analysis in the presence of extreme noise. There, the network supports both side-channel analyses to efficiently derive leakage characterized mod- els. Both, BP enhanced CPA and MIA are claimed to succeed where the usual approaches do not. As a consequence of the application of a neural network, CPA and MIA are turned into profiled side-channel attacks. In Chapter 7 we propose a further profiled side-channel analysis, based on Learning Vector Quantization (LVQ) [Koh01], that is related to [YZLC12] with respect to the kind of network that aims at minimizing some defined error function.

2.6 Conclusion and Outlook for the Thesis

The side-channel cryptanalysis techniques presented in this chapter belong to the very basic tools a security evaluator or an attacker is given to examine potential vulnerabilities in the implementation of cryptographic systems on a various number of devices. However, an impor- tant practical aspect has not been touched yet, namely alignment. One simple countermeasure that finds application in all security related products, at least in those which considered the

24 2.6. Conclusion and Outlook for the Thesis

side-channel threat, is a de-alignment of the execution of underlying functions fi. Each and every attack heavily relies on the fact that they are more or less perfectly aligned in time, and thus, in practice we need to re-align. We are faced with re-alignment in Chapter 9. We will focus on each technique in the next coming chapters from different viewing angles. In Part II we are dealing with the acceleration of lattice basic reduction, CPA and template attacks in terms of runtime. Further, template attacks can be conducted by means of most diverse statistical methods. So far we got to know template attacks based on the evaluation of probability densities. In Part III we introduce two new approaches based on machine learning, such that each application of them leads to tailor-made side-channel cryptanalysis that can be finely tuned to perfectly fit the attack scenario. A profiled side-channel scenario is also the background in Part IV. Nevertheless, we are not concerned with an attack but with proving the existence of already profiled side-channel information to facilitate plagiarism detection. Ultimately, in Part V we make use of SPA, CPA, template based DPA to demonstrate the vulnerability of a real-world product by recovering the secret key from a DES and AES hardware implementation. The reader will find many aspects of this chapter sufficiently covered in order to compare theory and practice.

25

Chapter 3 Parallel Computing

In this chapter we summarize the fundamentals of parallel computing. We shortly answer the question on how to examine an algorithm with respect to its parallel processing suitability. Furthermore we introduce our target platform offering cost efficient massively parallel computing which lays the basis for our improved side- channel analysis implementations.

Contents of this Chapter

3.1 Introduction ...... 27 3.2 Fundamentals of Parallel Computing ...... 28 3.3 Parallel Computing on Graphics Processing Units ...... 30 3.4 Compute Unified Device Architecture (CUDA) ...... 30 3.5 Conclusion and Outlook for the Thesis ...... 38

3.1 Introduction

In general problems can often be broken down into smaller sub-problems which are then solved concurrently, i.e., in parallel. However algorithmic solvers of such problems consist of both, a sequential or serial and a parallel processing proportion. By the term sequential we mean that the data to be processed depend on each other in some certain order and hence cannot be treated separately. The challenge to be faced is to carefully examine a given problem solving algorithm and identify these two proportions. Afterwards the figured-out parallel proportion has to be adapted to fit the target system’s parallel architecture. There are several distinct types of parallel systems of which each has its right to exist because of unique properties that qualify the system for certain tasks or problems. Although the advantages of parallelization were known for a long time, the attempts to gain a computational speed-up were concentrated on improving sequential systems. The paradigm change came in 2004 along with Intel’s disastrously wrong prediction that single core processors, especially their deep-pipelined NetBurst architecture introduced with the Pentium 4, could reach up to 10 GHz clock speed. Virtually any successor architecture, also on the mobile market, has a multi-core design.

27 Chapter 3. Parallel Computing

3.2 Fundamentals of Parallel Computing

First of all we need to understand in how far one can benefit from a parallelization of a given algorithm. The question to ask is not whether the algorithm is worth a thorough examination because there will always be a little parallelizable proportion that may result in a little speed-up. No, the question is rather how can we estimate the maximum gain of performance? As already implied this is done by means of the speed-up ratio. The speed-up is the quotient between the serial and parallel proportion whilst taking the number of available processors into account. The first idea goes back to Amdahl [Amd67] who formulated his Amdahl’s law given through 1 S = P (3.1) (1 − P ) + N where S denotes the speed-up ratio, P the parallel proportion, and N the number of available processors. Theoretically, the speed-up is infinite but in reality an algorithm always contains sequential parts, e.g., initialization phases. Reaching a certain threshold according to P the speed-up will stagnate near its theoretical maximum and can be utilized to formulate a useful upper bound of the processor count to involve. Figure 3.1 visualizes Amdahl’s law along different P . Even an algorithm that has a parallel proportion of almost 95%, which is obviously an excellent value, gains a maximum speed-up of 20. Clearly, the speed-up depends on several other properties of the targeted parallel system. A second law that addresses the shortcomings

20 50% 18 75% 90% 16 95% 100% limit 50% limit 14 75% Limit 90% Limit 12 95% Limit

10 Speed−up 8

6

4

2

1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 Number of processors

Figure 3.1: Amdahl’s law representing an algorithm’s parallelization speed-up with respect to the available number of processors. of Amdahl’s law, namely Gustafson’s law [Gus88], considers that the sequential proportion benefits as well from an increasing number of processors. He proposed that the speed-up scales

28 3.2. Fundamentals of Parallel Computing with the problem’s size in contrast to Amdahl who assumed a fixed problem size and hence a fixed sequential proportion. Gustafson’s law is defined by S = P (N − 1) + 1. (3.2) As Figure 3.2 clarifies there is no upper bound anymore and the speed-up nearly scales with the number of processors. Consequently, the question arises who is right? As is so often the

20 50% 18 75% 90% 16 95% 100% limit 14

12

10 Speed−up 8

6

4

2

1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 Number of processors

Figure 3.2: Gustafson’s law representing an algorithm’s parallelization speed-up with respect to the available number of processors. case, the truth lies somewhere in between. If the problem size is scalable by which means the sequential proportion benefits from further processors then the speed-up is closer to Gustafson’s law otherwise closer to Amdahl’s law. We now want to introduce Flynn’s taxonomy [Fly72] which is a classification based upon the number of concurrent instruction streams and data streams within the available computational architecture. Four classes are to distinguish:

 Single Instruction, Single Data (SISD). This architecture contains exactly one pro- cessor that executes an instruction stream to operate on a data stream from a single memory. It is referred to as the classical sequential von Neuman architecture.

 Multiple Instruction, Single Data (MISD). This type of architecture contains many processors that performs different operations on the same data stream. It is an uncommon architecture which is merely used for fault tolerance systems.

 Single Instruction, Multiple Data (SIMD). The same instructions are executed on different data streams. Hence, the data streams are concurrently processed (in parallel). As an example a vector computer shall be mentioned.

29 Chapter 3. Parallel Computing

 Multiple Instruction, Multiple Data (MIMD). Multiple independent processors concurrently executing different instructions on different data streams. MIMD structures can work either on shared memory (the processors are sharing one memory space) or on distributed memory (each processor owns a dedicated private memory space). Most of the TOP500 supercomputers1 are based on this type.

3.3 Parallel Computing on Graphics Processing Units

Graphics Processing Unit (GPU) computing, in general referred to as General-purpose Comput- ing on Graphics Processing Units (GPGPU), is the shift of computations that are traditionally handled by the CPU — the host processor — to the GPU. A GPU is typically intended for han- dling computations of computer graphics, but with the growing demand for a higher flexibility it became feasible to process other data, too. The GPU is a typical representative of the SIMD architecture and hence suitable for vector computing. Crucial for GPGPU is the core unit, commonly denoted as stream processor in most modern GPU architectures. However, GPGPU was a costly process so far since the programming inter- face was especially built to implement the graphics pipeline. All data had to be inconveniently adapted to fit this interface. For example the data had to be encoded in textures, an array of pixels, which are unfortunately read-only objects. Finally, the finished processed data was written to the frame buffer that contains the frames to be shown on the display and looped back into the texture space. As of 2007 this shortcomings were properly addressed by the two leading GPU manufacturers nVidia and AMD leading to a growing popularity of the GPGPU concept. AMD’s framework Stream SDK [ATI09] is aimed at enabling GPGPU by granting direct access to the native instruction set and memory spaces of AMD graphics cards. It features a C language deriva- tive optimized for stream computing. nVidia’s framework is called Compute Unified Device Architecture (CUDA) which is introduced hereafter.

3.4 Compute Unified Device Architecture (CUDA)

In this thesis we decided to work with nVidia’s CUDA since it enjoys a wide acceptance in the GPGPU community and is the de facto standard framework in science. CUDA features C for CUDA, also a C language derivative which consist of usual C with special extensions. Moreover, it is possible through third party wrappers to use Python, Fortran, Java, or MATLAB even. All informations on CUDA are taken from [nVi10a, nVi10b].

3.4.1 Hardware Overview

CUDA works with all nVidia graphics cards from the G80 series onwards. Besides common household graphics cards the Tesla line-up is offered which is a dedicated GPU computing solution. The most important difference between latest and older generations is the number of stream processors and the so called compute capability. Obviously the number of stream processors is crucial for the system’s overall performance but otherwise the compute capability

1Detailed informations on the TOP500 supercomputers can be found on http://www.top500.org/

30 3.4. Compute Unified Device Architecture (CUDA) indicates the suitability regarding a certain problem. The compute capability is mainly sub- jected to the technological improvement, e.g., facilitating processing on double precision floating point data. Due to the need of double precision floating point capability and access to unrestricted hard- ware resources we make use of a GTX280 and a Tesla C2070 graphics card throughout this thesis. Table 3.1 summarizes the available resources.

Table 3.1: Available resources on nVidia GTX 280 and Tesla C2070 Resource GTX 280 Tesla C2070 Compute capability 1.3 2.0 Global memory (VRAM) 1 GiB 6 GiB Number of multiprocessors 30 14 Number of stream processors 240 448 Constant memory 64 KiB 64 KiB Shared memory per block 16 KiB 48 KiB Registers available per block 16, 384 32, 768 Warp size 32 32 Maximum number of threads per block 512 1, 024 Maximum size of dimension of a block 512 × 512 × 64 1, 024 × 1, 024 × 64 Maximum size of each dimension of a grid 65, 535 × 65, 535 × 1 65, 535 × 65, 535 × 65, 535

The device’s main unit — the multiprocessor — is a set of eight stream processors which share memory, caches, and an instruction unit. The multiprocessor creates, manages, and executes threads in hardware with zero scheduling overhead and lightweight creation. A thread in CUDA is, as well as in other architectures, the smallest unit of parallelism executed with other threads in hardware. Despite the fact that a GPU is a representative of the SIMD architecture, nVidia denotes the architecture Single Instruction, Multiple Thread (SIMT) which is akin to SIMD. In contrast however, SIMT enables users to write thread-level parallel code for independent scalar threads, as well as data-parallel code for coordinated threads.

3.4.2 CUDA Programming Model

The CUDA programming model is described through three abstractions layer. Firstly, a hierar- chy of threads and thread groups, secondly a hierarchy of different memory types, and thirdly a barrier synchronization. We now introduce the model by covering each abstraction layers.

Abstraction Layer 1: Thread Hierarchy

 Thread. As mentioned above a thread is the smallest unit of parallelism which is executed with other threads in hardware. They can concurrently run different pieces of code or the same piece of code, the latter resulting in a parallel instruction execution on a data stream. Compared to threads within a multi-core CPU, threads on a GPU need clearly less resources concerning creation and switching. On the downside, GPU computing is efficient only through involving a large thread count in order to ensure a constant workload.

31 Chapter 3. Parallel Computing

 Thread Block. Threads are organized in a thread block, a group of threads in which the threads can communicate with each other and synchronize their state. A thread can be identified by its thread index, a scalar that is unique within the block but not beyond. It is possible to define a two and three dimensional block.

 Thread Grid. A group of thread blocks is called a thread grid. A thread grid is the passable execution unit in the CUDA model, hence it is not possible to execute a thread or thread block solely. Again a block is identified within the grid by its block index which is unique within the grid. Yet, the grid can only possess two dimensions, the third dimension is set to 1. Multiple grids need to be executed sequentially.

The complete hierarchy is depicted in Figure 3.3 for two dimensional thread block and grid.

Grid

Block Block Block (0,0) (1,0) (2,0)

Block Block (0,1) Thread Thread Thread (0,0) (1,0) (2,0) Block (0,2) Thread Thread Thread (0,1) (1,1) (2,1)

Thread Thread Thread (0,2) (1,2) (2,2)

Figure 3.3: Thread hierarchy within the CUDA programming model. For instance a two dimen- sional thread block encompassed by a two dimensional thread grid.

Abstraction Layer 2: Memory Model and Memory Hierarchy

Threads in the CUDA programming model can access data from various memory spaces differing in size and access time (latency). However, these memory spaces are not physically implemented in hardware but logically. Table 3.2 summarizes the most important features of the graphics card’s memory. The memory model is illustrated in Figure 3.4. It shows the reachable memory spaces from a thread’s point of view. Accordingly, at lowest level a thread is granted read and write access to its owned registers and copy of local memory. Threads within the same block are granted read and write access to a shared memory on the higher level. Beyond the block, all threads are granted access to the largest memory space, the global memory.

32 3.4. Compute Unified Device Architecture (CUDA)

Table 3.2: Available memory spaces on nVidia GTX 280 and Tesla C2070 Memory Location Cached Access Scope Lifetime Register On chip n/a R/W 1 thread Thread Local Off chip (RAM) No R/W 1 thread Thread Shared On chip n/a R/W All threads in block Block Global Off chip (RAM) No R/W All threads and host Host allocation Constant Off chip (RAM) Yes R All threads and host Host allocation Texture Off chip (RAM) Yes R All threads and host Host allocation

Finally, two further read-only spaces that are mentionable, the constant memory and the texture memory. The latter three memory spaces are accessible from the CPU main memory.

Grid

Block(0,0) Block(1,0) Shared Memory Shared Memory

Registers Registers Registers Registers

Thread Thread Thread Thread (0,0) (1,0) (0,0) (1,0)

Local Local Local Local Memory Memory Memory Memory

Global Memory

Constant Host Memory Mem. Texture Memory

Figure 3.4: CUDA memory model.

Abstraction Layer 3: Thread Synchronization

As usual, memory spaces that are shared are dominated by potential hazards of conflicts such as read-after-write, write-after-read, or write-after-write. Therefore the programming model implements a barrier in shape of a synchronization instruction. When a thread has reached this instruction it halts its execution as long as the last thread within the block has not reached the barrier. The barrier however is implemented on block level only, such that a user still has to take care of threads that access the same global memory address beyond block level.

33 Chapter 3. Parallel Computing

3.4.3 CUDA Programming Interface

Indeed there exist two interfaces currently supported to create CUDA program code. The program code must make use of either the already mentioned C for CUDA or the CUDA driver Application Programming Interface (API). The driver API is a low level C API that provides functions to load binary code, respectively assembly code. The driver API is not further considered in this thesis.

Kernels

The piece of code that is executed on thread level is called kernel. A kernel is written in a way that each thread within a block, respectively grid, performs the same operation stream on a dedicated data stream.

Host and Device

The CUDA programming model assumes two parties, the CPU and the GPU. The CPU acts as a host which manages the execution of grids and the data stream transfer. Thus the GPU is a slave — device in CUDA notation — that is given the instruction and data stream for processing. Firstly the CPU transfers the data from its host memory (RAM) to the GPU’s global memory. The execution of such heterogeneous CUDA program is shown in Figure 3.5.

Serial code Host

Parallel kernel Device Grid 0

Block Block Block (0,0) (1,0) (2,0) Block Block Block (0,1) (1,1) (2,1)

Serial code Host

Parallel kernel Device Grid 1

Block Block Block ((0,0)) (1,0) (2,0) Block Block Block (0,1) (1,1) (2,1)

Figure 3.5: Flow of a heterogeneous CUDA Program.

34 3.4. Compute Unified Device Architecture (CUDA)

Execution on Device

Due to the multiprocessor’s nature providing shared memory, registers and an instruction unit a block will be maintained by exactly one multiprocessor. Prior to the execution a block is being split up into units handled at once — the warps. Warps are equal sized block slices with the first thread of the block being the first thread of the first warp. The warp size is 32. The scheduler unit of a multiprocessor constantly switches among these warps but the order is not specified, i.e., warps execute asynchronously. Since a multiprocessor possesses eight stream processors one warp will finish in four cycles.

Language Extensions, Compilation, and Limitations

C for CUDA provides a minimal set of extension to the C language in order to enable GPU computing. We introduce the most common extensions that are crucial for CUDA applications. Function type qualifiers specify whether a function is callable or executed on the host, re- spectively the device:

 __device__ The function is executed on the device and callable from the device only.

 __global__ It declares the function as being a kernel executed on the device and callable from the host only.

 __host__ States the opposite of the device declaration, executed on the host and callable from the host only. This is a normal C function and thus it can be omitted.

Variable type qualifiers specify the memory location on the device:

 __device__ The variable resides in the global memory space and is accessible from all threads in the grid.

 __constant__ The variable resides in the constant memory space and is accessible from all threads in the grid but read-only.

 __shared__ The variable is a shared variable within the block. Initialization of such a variable is not allowed.

 {} Declared variables without having a type qualifier are stated to reside in the registers of the thread. A register variable is accessible from its dedicated thread only.

Built-in variables specify the grid and block dimensions, as well as the block and thread indices:

 gridDim This variable represents a three dimensional array containing the grid dimension.

 blockIdx This variable is a three dimensional integer array and contains the block index within the grid.

 blockDim Like the block index it is a three dimensional integer array that contains the number of threads within the block.

 threadIdx A three dimensional integer array containing the thread index.

35 Chapter 3. Parallel Computing

 warpSize This variable contains the warp size in threads.

The dimensions are accessed via .{x,y,z} where x represents the first dimension. The z component is fixed to 1 for some devices (cf. Tab. 3.1). The barrier instruction which is used to synchronize threads within their block is realized by the function __syncthreads(). C For CUDA provides a mathematical standard library in accordance with the C language such as sqrt(x) or log(x). By default those functions work on single point arithmetic variables only and even if the device offers double precision support the functions are mapped to their single-precision equivalents. The compiler option -arch sm_13 needs to be added in order to use double-precision. To distinguish both types single precision functions end with f, e.g., sqrtf(x). Additionally, there are some functions that have faster counterparts with reduced accuracy prefixed with __ such as __logf(x). Linear memory is allocated using cudaMalloc() which reserves memory in the device’s global address space. Data is bidirectionally transferred with cudaMemcpy(). The allocated memory is freed through cudaFree() ultimately. The execution configuration specifies any call to a kernel with the __global__ function type qualifier. The execution configuration is an expression of the form <<< Dg, Db >>> where Dg is the dimension of the grid and Db the dimension of the block. For instance <<< 30, 512 >>> calls a kernel executed by 30 blocks each containing 512 threads.

3.4.4 CUDA Performance Guidelines

To achieve optimal CUDA performance many details have to be considered. Some of them have a crucial impact on performance, especially when they deal with the global memory space. Hereafter we want to sensitize in dealing with CUDA to avoid performance collapses.

Memory Transfers

Dealing correctly with CUDA memory types is the Alpha and Omega. To start with GPU computing in the first place, the data must be transferred form the host memory into the device memory. Host-to-device transfers are limited by the bandwidth of the PCI-E interface that offers a theoretical peak bandwidth of 4 GiB/s. In practice we gain about 2.9 GiB/s with our GTX280 device. A device-to-device transfer provides a theoretical peak bandwidth of 160 GiB/s and in practice an effective bandwidth of 129.2 GiB/s2. Hence, it is of uttermost importance to avoid host-to-device transfers except for the first which is mandatory.

Memory Access

Memory accesses within the device should be handled very carefully since one could give away much of the performance. Table 3.3 provides a glimpse on memories’ latency due to different actions.

2Both bandwidths have been measured by means of Bandwidth Test from the SDK.

36 3.4. Compute Unified Device Architecture (CUDA)

Table 3.3: Latency for memory actions on the device Type Latency Condition Instructions hidden due to caching Registers single cycle Shared Memory single cycle no bank conflicts n cycles n-way bank conflict Global memory 400-600 cycles Constant / texture 1-100 cycles cache hit memory 400-600 cycles cache miss

As expected the global memory requires the highest latency concerning read and write accesses. This comes as no surprise because the global memory space is the largest of all spaces and accessible by all threads. During such a global memory fetch all further thread instructions are deferred. As a consequence, one should obey the below techniques:

 Use of Shared Memory. The Shared memory is located on-chip and much faster than global memory, provided that there are no conflicts. The shared memory is organized in equally sized banks that can be accessed simultaneously. Therefore any read or write that addresses n distinct memory banks can be done concurrently yielding a bandwidth that is n times higher than the bandwidth of a single bank. Nevertheless, multiple accesses could cause a bank conflict and hence enforce a serialization. Shared memory should always be used when processing the fetched data more than once.

 Memory Access Coalescing. Memory coalescing has the highest priority in order to avoid performance issues. A coalesced memory access is a coordinated access that results in as few as possible memory transactions (reads or writes). This is achieved by global memory reads and writes through threads from a half warp. In particular, threads access words in sequence by which means that the ith thread in the half-warp must access the ith word. Secondly, the addressed words have to be located in the same segment. The size of the segment — the range that can be accessed at once — is determined by the size of the word.

 32 byte if all threads access 1-byte words

 64 byte if all threads access 2-byte words

 128 byte if all threads access 4-byte or 8-byte words

If a half-warp addresses words in n different segments then n memory transactions will be issued. Otherwise if all words are located within one segment then one memory transaction will be issued.

 Latency Hiding. Memory latency can be hidden by intensive computations that overlap memory transactions. Threads are requested to feature a high arithmetic workload and the multiprocessor is requested to run a large thread number. Nevertheless, this technique depends on the problem.

37 Chapter 3. Parallel Computing

 Warp Divergence. Flow control is inevitable for sequential code but can significantly affect the overall performance when using warps. Due to a data dependent control flow warps may follow different execution paths, in such a case the warps are considered as diverged. Different execution paths must be serialized and the total number of instructions for this warp increases. A suitable countermeasure is to make diverging dependent on the thread identification so that different warps diverge but not threads within a warp.

 Occupancy. To achieve the highest memory bandwidth possible one definitely wants to execute as many warps in parallel as possible. A single multiprocessor processes warps of a block and therefore the block dimension should be a multiple of the warp size in order not to waste computational resources. The occupancy is a metric to determine how efficiently the device is kept busy. It is defined by the ratio between warps concurrently running on a single multiprocessor and the number of warps that could be possibly run on a multiprocessor. The number of how many warps can be executed at once by a single multiprocessor is based upon the registers available. For example a device that has 16,384 32 bit registers per multiprocessor and can have a maximum of 1,024 simultaneous threads. This means each thread could employ a maximum of 16, 384/1, 024 = 16 registers to gain an occupancy of 100%. If threads employ more than 16 registers the occupancy decreases while the number of warps decreases.

3.5 Conclusion and Outlook for the Thesis

In this chapter we introduced the Compute Unified Device Architecture (CUDA), our favored parallel system we make use of throughout Part II of this thesis. We laid the basis to examine our proposed implementations of side-channel analysis primitives in terms of kernel code and optimization strategies. Not only generating parallel program code but particularly optimiza- tions and strict compliance with the performance guidelines are of uttermost importance to benefit from GPU computing at all.

38 Part II

Improved Implementations

Chapter 4 Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis

With respect to security testing of smart cards with cryptographic software or hard- ware support Correlation Power Analysis (CPA) plays a major role. Apart from laboratory typical side-channel measurement setups, the subsequently mounted anal- ysis of large amounts of traces and samples is considerably time-consuming; and during security testing time does indeed matter. We show that CUDA is well suited to make a CPA implementation possible which could serve as a high-performance reference with a throughput of approximately 63 billion samples per GPU, model, and second. Parts of this chapter have been previously published in [BLR11] available at link.springer.com. Contents of this Chapter

4.1 Introduction ...... 41 4.2 Correlation Power Analysis ...... 42 4.3 Correlation Power Analysis on Graphics Cards ...... 44 4.4 Higher Order Preprocessing on Graphics Cards ...... 49 4.5 Experimental Results ...... 49 4.6 Conclusion ...... 52

4.1 Introduction

The resilience of a cryptographic device against side-channel attacks is ultimately defined by the amount of traces that is at least required to recover some secret information embedded in the device. In commercial applications, many cryptographic devices — which can either be based on a dedicated or general-purpose µC, FPGA, or ASIC even — are hardened to withstand side- channel attacks up to a certain order. In order to fulfill the high attack potential security requirements, side-channel testing with at least one million traces are nowadays common for security labs, e.g., when it comes to the evaluation of state-of-the-art DES or AES hardware implementations. Currently, DPA attacks in research are already carried out with up to a few hundred million traces [MPL+11]. Both processes involved by a side channel analysis, namely the recording of traces and the computational analysis, can be highly time consuming during a smart card evaluation. Contrary to the effort that needs to be spent in the trace recording,

41 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis normally constrained by the device and the setup, the computational analysis is iterated for different applicable attack scenarios. Hence, an acceleration of the analysis part is definitely desirable. DPA [KJJ99] improved through the use of the Pearson Correlation Coefficient denoted as CPA [BCO04] is still the most common statistical tool to evaluate the side-channel resilience. Graphics cards provide a powerful parallel architecture which became widely accepted in scientific computations, cf. Chapter 3. Even cryptographic implementations are well established on graphics cards with several approaches that were taken forward during the last few years. For instance, cryptosystems like Elliptic Curve Cryptography (ECC), RSA, and AES were successfully implemented [SG08, BCC+09, HW07]. However, until now only little efforts were spent to speed up analyses by means of graphics cards. The only other proposal we are aware of is [LSH+10]. Their approach concentrates on DOM as the statistical distinguisher and makes use of both, the CPU and a CUDA enabled GPU. All in all they achieve a reasonable speed-up factor of about two through parallelizing the summation of the time-coherent samples on the GPU side.

4.1.1 Contribution We follow a different approach and make use of advanced techniques offered by CUDA to achieve key benefits for the runtime and shift any computation to the GPU side. We adapt algorithms for CPA to achieve optimal results on graphics cards. A major portion with regards to the speed- up is the covariance whose implementation is done through a matrix multiplication which scales almost perfectly on graphics cards as long as several pitfalls are avoided. In Section 4.3, we describe our approach in detail and stress requirements for CPA and other tools when being implemented on a graphics card. In Section 4.5 we report our results concerning the achieved efficiency of terms in runtime.

4.2 Correlation Power Analysis

CPA is a passive implementation attack aiming at key recovery of cryptographic implemen- tations. The intrinsic physical leakage side-channels of a DUT being exploited is usually the power consumption or the electromagnetic emanation of the device while it data propagates through the cryptographic algorithm, whether it be in software or hardware. Further CPA is a divide-and-conquer attack, i.e., a cryptographic key is iteratively compromised by its sub-keys. We assume that the device processes a sensitive variable z which is the conjunction of known input v to the cryptographic computation, i.e., a plaintext or ciphertext and an unknown secret information embedded in the device, i.e., a sub-key k, in order that z = ftarget(v, k). We further assume that the physical leakage of the DUT can be expressed as

Lt = δt + L (ftarget(v, k)) + Bt. (4.1)

Herein, Lt is the leakage at time-instant t depending on a constant portion δt, a certain deter- ministic leakage function L(•) that describes how the leaking signal is linked to the sensitive variable z, and a noise term Bt. Bt is often assumed to be a random normally distributed variable according to a standard deviation σ (Gaussian assumption). Note that in practice the leakage function L(•) is usually unknown to the attacker, however, it is well-known that often the Hamming weight or the identity, i.e., the value, of z is a good approximation [MOP08].

42 4.2. Correlation Power Analysis

The approximation of L(•) is thus done through a model M(•). Alternatively, the attacker may investigate single bit or zero-value leakage of z, or a Hamming distance to a further sensi- tive variable z0 respectively. The most accurate model is certainly the transition count model [MPO05] which however can be obtained through simulation of the hardware netlist only.

4.2.1 Recording of Traces As the first step of CPA the power consumption and/or the electro-magnetic emanation of the DUT is measured with a DSO. A complete measurement is given by the (M × S)-matrix   x1,1 x1,2 x1,3 . . . x1,S  x x x . . . x   2,1 2,2 2,3 2,S      X = X1 X2 X3 ...XS =  x3,1 x3,2 x3,3 . . . x3,S  (4.2)    . . . .   ......  xM,1 xM,2 xM,3 . . . xM,S involving M independently recorded traces containing S samples each. Therefore xi,j denotes a sample taken from trace i at time-instant j (row-major order). There are usually different (randomly chosen and uniformly distributed) inputs for each trace i, such that   v1,1 v1,2 v1,3 . . . v1,U  v v v . . . v   2,1 2,2 2,3 2,U    V =  v3,1 v3,2 v3,3 . . . v3,U  , (4.3)    . . . .   ......  vM,1 vM,2 vM,3 . . . vM,U where vi,l is the lth input used for trace i where l ∈ {1, 2,...,U} and U is the portion count of the plaintext. Each plaintext portion is equal in size with a sub-key, e.g., usual both are subdivided in bytes. CPA works with hypotheses on sub-keys derived from assumed leakage model. Let vi be a plaintext portion in row i of V that is considered during the computation of z = ftarget(vi, ke). i,ek That is, to recover a sub-key ke we get a hypothetical leakage matrix   y1,1 y1,2 y1,3 . . . y1,P  y y y . . . y   2,1 2,2 2,3 2,P      Y = Y1 Y2 Y3 ...YP =  y3,1 y3,2 y3,3 . . . y3,P     . . . .   ......  yM,1 yM,2 yM,3 . . . yM,P   (4.4) L(z1,1) L(z1,2) L(z1,3) ... L(z1,P )  L z L z L z ... L z   ( 2,1) ( 2,2) ( 2,3) ( 2,P )    =  L(z3,1) L(z3,2) L(z3,3) ... L(z3,P )  ,    . . . .   ......  L(zM,1) L(zM,2) L(zM,3) ... L(zM,P ) covering each of the P possible sub-key candidate values ke ∈ {0, 1,...,P }.

43 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis

4.2.2 Analysis

CPA makes use of the Pearson correlation coefficient as the statistical distinguisher. It computes the linear dependency of each column of the hypothetical leakage matrix with each column of the measurement matrix. The sub-key hypothesis reaching the absolute maximum of correlation is assumed to be the correct sub-key. For completeness, we provide the explicit formula for the estimated Pearson correlation coef- ficient expressed by means of sums:

PM 1 PM PM i=1 xiyi − M i=1 xi · i=1 yi ρX,Y = q q . (4.5) PM 2 1 PM 2 PM 2 1 PM 2 i=1 xi − M ( i=1 xi) · i=1 yi − M ( i=1 yi)

4.3 Correlation Power Analysis on Graphics Cards

The implementation of the Pearson correlation coefficient according to its representation Equa- tion 4.5 requires us to compute five sums:

X X X X 2 X 2 xiyi, xi, yi, xi , and yi (4.6) ∀i ∀i ∀i ∀i ∀i

Taking both matrices X and Y into account, then the first sum embodies the matrix multipli- cation   Y1 ∗ X1 Y1 ∗ X2 Y1 ∗ X3 ...Y1 ∗ XS Y ∗ X Y ∗ X Y ∗ X ...Y ∗ X   2 1 2 2 2 3 2 S  >   Y ∗ X = Y3 ∗ X1 Y3 ∗ X2 Y3 ∗ X3 ...Y3 ∗ XS  , (4.7)    . . . .   ......  YP ∗ X1 YP ∗ X2 YP ∗ X3 ...YP ∗ XS where Xi and Yj are column-vectors, respectively row-vectors of length M. Matrix multiplications perform very well on graphics cards, and hence the idea is to build an implementation of CPA around the matrix multiplication. The other sums could then be computed simultaneously. However, in this case a prerequisite is the computation of the hy- pothetical leakage matrix Y at beforehand. Additionally, we have to face some other issues that arise when we aim for an implementation that can handle arbitrary large measurement matrices. First, we recall that the global memory of a graphics card is a constrained resource. Second, we likely run into numerical problems with regards to the single dot products, respec- tively with regards to the sums of large vectors due to overflows. Thirdly, we aim to distribute the computation of the correlation coefficients over an arbitrary number of graphics cards, or in the other case the computation has to be iteratively issuable if only one graphics card is available. Therefore, our implementation approach consists of three major steps, i.e., CUDA kernels, that are carried out subsequently:

(1) Initially, a kernel that computes the leakage model,

(2) a kernel that performs the computations of the sums, and

(3) finally a kernel that computes the correlation coefficients.

44 4.3. Correlation Power Analysis on Graphics Cards

Apart from this approach it is, of course, possible to have one kernel that computes everything in one go but that heavily depends on how large the matrices are. Throughout the following we assume measurements being too large in order to be processed by one kernel at once, but nevertheless we will also briefly discuss this point later on.

4.3.1 Creation of the Leakage Model

The leakage model is created by a kernel that is given an input vector V ∈ V. Row vectors of Y>, of which each is based on a copy of V , are distributed over different thread blocks, that is each row is computed among several threads within a thread block as depicted by Figure 4.1. As usual, the computation of the leakage modeling function L(z = ftarget(vi, ke)) is realized i,ek   L(z1,1)) L(z2,1) L(z3,1) ... L(zM,1)  L z L z L z ... L z   ( 1,2) ( 2,2) ( 3,2) ( M,2)     L(z1,3) L(z2,3) L(z3,3) ... L(zM,3)     . . . .   ......  L(z1,P ) L(z2,P ) L(z3,P ) ... L(zM,P )

Figure 4.1: Computation of Y> among different thread blocks (outlined by solid lines) using a table, e.g., a AES Substitution-box (S-box) with precomputed output values (identity) and/or their Hamming weights etc. This table is copied into the constant memory of the graphics card prior to the execution of the kernel and referenced later on. Eventually, this kernel can be omitted if the model inputs are directly fed into the matrix multiplication kernel. This indeed saves global memory because the leakage model matrix does not need to be stored. However, in some cases it might be more convenient to have a separated kernel when the leakage model is more complex for instance. Our straightforward approach is represented by Algorithm 4.1. As stated in the algorithm the integer values tIdx.x and bIdx.x represent the index of a single

Algorithm 4.1 Leakage Model Creation Input: (1) Input vector V ∈ V, (2) precomputed L(ftarget(•)) Output: Leakage model matrix Y> 1: for each block parallel do 2: for each thread parallel do 3: Y[bIdx.x, tIdx.x] = L(fbIdx.x(V [tIdx.x])) 4: end for parallel 5: end for parallel thread and thread block in the first dimension in compliance with the CUDA model.

4.3.2 Computation of the Correlation Coefficient Sums

The computation of the correlation coefficients’ sums (Eq. 4.6) is, as already mentioned above, based upon the matrix multiplication Y> ∗ X. In order to achieve the best performance in the sense of CPA some effort has to be spent. Regarding arbitrary large matrices we have to keep

45 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis in mind that at some point the matrices can exceed the global memory of the graphics card. At this point we can follow two approaches. On one hand a kernel that is given the matrices and computes the correlation coefficients directly because the matrices fit the global memory. On the other hand, if the matrices do not fit the memory, a kernel is needed that intermediately outputs the sums for a later processing. In the remainder of this chapter we deal with the latter opportunity. This have also led us to the next implementation decision. Each sum has to be stored in a single variable and a 32 bit variable may not be sufficient in general because of a potential overflow. This is even complicated by the fact that the CUDA performance involving 64 bit variables is lower [nVi10b]. Besides the framework is highly optimized to employ 32 bit floating arithmetic. So to address these issues the correlation coefficients, respectively their involved sums, are computed as follows:

PM 1 PM 1 PM 1 i=1 M xiyi − i=1 M xi · i=1 M yi ρX,Y = q q (4.8) PM 1 2 PM 1 2 PM 1 2 PM 1 2 i=1 M xi − ( i=1 M xi) · i=1 M yi − ( i=1 M yi)

Shifting the fractions into the sums does not necessarily cause a performance penalty due to desirable latency hiding. These additional instructions are consumed by the global memory access time. Referring to [SK10] the arising matrix is computed by two-dimensional thread blocks, here called tiles, that are distributed as shown in Figure 4.2. Each tile consists of n2 threads where

  x1,1 x1,2 x1,3 . . . x1,S  x x x . . . x   2,1 2,2 2,3 2,S     x3,1 x3,2 x3,3 . . . x3,S     . . . .   ......  xM,1 xM,2 xM,3 . . . xM,S     y1,1 y2,1 y3,1 . . . yM,1 Y1 ∗ X1 Y1 ∗ X2 Y1 ∗ X3 ...Y1 ∗ XS  y y y . . . y   Y ∗ X Y ∗ X Y ∗ X ...Y ∗ X   1,2 2,2 3,2 M,2   2 1 2 2 2 3 2 S       y1,3 y2,3 y3,3 . . . yM,3   Y3 ∗ X1 Y3 ∗ X2 Y3 ∗ X3 ...Y3 ∗ XS       . . . .   . . . .   ......   ......  y1,P y2,P y3,P . . . yM,P YP ∗ X1 YP ∗ X2 YP ∗ X3 ...YP ∗ XS

Figure 4.2: Computation of Y> ∗ X along different two-dimensional thread blocks, here called tiles. Exemplarily, only one tile is emphasized to show which portions of the matrices X and Y are involved to compute the resultant sub-matrix covered by that tile. Actually, the whole resultant matrix is covered by several tiles. a single thread is responsible to compute a single dot product within the result matrix. In the beginning a tile loads portions, such that n2 elements of X and n2 of Y> are deposited into the shared memory of the thread block. This strategy avoids loading every vector each time it is needed. With these portions a single thread can now compute the first n products of a dot product, since the first n elements of each row vector of Y> and each column vector of X are loaded. Afterwards, a tile fetches again portions to compute the next n products, a

46 4.3. Correlation Power Analysis on Graphics Cards procedure which is repeats until all elements are passed through. Obviously, we can exploit synergies and compute all the other correlation coefficients’ sums concurrently as all necessary elements are already loaded. Figuratively speaking, the tile move rightwards regarding Y> and downwards regarding X (Fig. 4.2). The kernel in its optimized version is depicted in Algorithm 4.2. Additionally, the algorithm reveals two mandatory optimizations that were not

Algorithm 4.2 Computation of Correlation Coefficient Sums Input: (1) Measurement matrix X, (2) leakage model matrix Y> P 1 P 1 P 1 P 1 2 P 1 2 Output: Sums of correlation coeff.: ∀i M xiyi, ∀i M xi, ∀i M yi, ∀i M xi , and ∀i M yi 1: for each block parallel do 2: for each thread parallel do 3: i ← thread position within column vectors of X 4: j ← thread position within row vectors of Y> > 5: prefetch first tile of X and first horizontal tiles of Y into registers: xi, yj. 6: end for parallel 7: end for parallel 8: M 9: for r = 1 to n do 10: for each block parallel do 11: for each thread parallel do 12: Xshared tIdx.x, tIdx.y √xi deposit prefetched tiles into shared memory: [ ] = M , y Y shared tIdx.x, tIdx.y √ j [ ] = M 13: prefetch next tiles into registers: xi+r, yj+r 14: for l = 1 to n do P P shared shared 15: thread xiyi = thread xiyi + X [tIdx.x, l] · Y [l, tIdx.y] 16: P x P x Xshared tIdx.x, l · √1 thread i = thread i + [ ] M 17: P y P y Y shared l, tIdx.y · √1 thread i = thread i + [ ] M P 2 P 2 shared 2 18: thread xi = thread xi + X [tIdx.x, l] P 2 P 2 shared 2 19: thread yi = thread yi + Y [l, tIdx.y] 20: end for 21: end for parallel 22: end for parallel 23: end for mentioned so far. The portions are prefetched by the tiles’ threads into their registers first and then deposited into shared memory with the sideeffect that the sum calculations only consume already fetched tile elements while the next elements are already loading. This enables latency hiding. The second optimization considers the workload balance within the kernel. Therefore, several tiles of matrix X, instead of one, are loaded horizontally to compute multiple dot products involving the loaded single tile of Y>.

4.3.3 Computation of the Correlation Coefficient Matrix This kernel’s implementation is similar to the kernel for the leakage model creation. The matrix of the correlation coefficients is segmented in the same way (Fig. 4.1). Every thread

47 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis block is responsible for samples that relate to one key hypothesis. Hence a block is given the corresponding correlation coefficient sums which result from the measurement matrix and each sum from the leakage model matrix. The implementation is shown in Algorithm 4.3.

Algorithm 4.3 Computation of Correlation Coefficient Matrix P 1 P 1 P 1 P 1 2 P 1 2 Input: Sums of correlation coefficients: ∀i M xiyi, ∀i M xi, ∀i M yi, ∀i M xi , and ∀i M yi Output: Correlation coefficient matrix P 1: for each block parallel do 2: for each thread parallel do

3: P[bIdx, tIdx] = ρXtIdx,YbIdx , either Eq. 4.8 and Eq. 4.15 4: end for parallel 5: end for parallel

4.3.4 Estimation of Leakage Model Sums In case of the identity model, Hamming Weight or distance model, or bit model etc. , except for the transition count model, a further approach to achieve an even better performance outcome is to estimate the model’s mean and standard deviation. This saves computational effort, above all it saves divisions and square root calculations which should be avoided. These operations only offer one fourth, respectively one eighth of the performance of a multiplication [nVi10b]. In addition to that, the model’s estimation also saves global memory due the avoided sums. Presuming that the inputs, contained in V, are randomly chosen and uniformly distributed P 1 the mean ∀i M yi for the identity model denoted with Id(•), zero-value denoted with ZV(•) and Hamming weight, Hamming distance, or bit model, all denoted with M(•) can be estimated with 2b−1 b b b X 1 X (2 − 1) · 2 2 − 1 y = E[Id(Z)] = 2−b z = = , or (4.9) M i · b ∀i z=0 2 2 2 b b X 1 2 − 1 2 − 1 y = E[ZV(Z)] = 2−b · 0 + · 1 = , or (4.10) M i b b ∀i 2 2

b X 1 X b y = E[M(Z)] = E[ Z(i)] = b · E[Z(i)] = (4.11) M i ∀i i=1 2 where Z(i) is the ith bit of a random variable Z, i.e., Z is a b bit variable. z is a realization of P 1 2 Z. Whereas the models’ mean of the squares ∀i M yi can be estimated with

2b−1 b b b+1 X 1 X (2 − 1) · 2 · (2 − 1) y2 = E[Id(Z)2] = 2−b z2 = M i 6 · 2b ∀i z=0 (4.12) (2b − 1) · (2b+1 − 1) = , or 6

b b X 1 2 − 1 2 − 1 y2 = E[ZV(Z)2] = 2−b · 02 + · 12 = , or (4.13) M i b b ∀i 2 2

48 4.4. Higher Order Preprocessing on Graphics Cards

b b b X 1 X X X y2 = E[M(Z)2] = E[ Z(i) · Z(j)] = E[Z(i) · Z(j)] + E[Z(i)] M i ∀i i,j=1 i6=j i (4.14) b · (b − 1) b b2 + b = b · (b − 1) · E[Z(i)Z(j)] + b · E[Z(i)] = + = . 4 2 4 Eventually, we obtain the correlation coefficient with leakage estimation being expressed as

PM 1 PM 1 i=1 M xiyi − SY · i=1 M xi ρX,Y = q q (4.15) 2 2 PM 1 2 PM 1 2 SY − (SY ) · i=1 M xi − ( i=1 M xi) ,

 b q 2b 2 −1 •  2 −1 for Id(•)  2 for Id( )  12  b q q S 2 −1 • S2 − S 2 2b−1 with either Y = 2b for ZV( ) and Y ( Y ) = 22b for ZV(•)  q  b for M(•)  b 2  4 for M(•).

4.4 Higher Order Preprocessing on Graphics Cards

The higher order preprocessing of traces requires the computation of a combination function comb(xi, xj). In case of the second order this can be seen as the vector-vector product     x1 x1 ◦ x1 x1 ◦ x2 x1 ◦ x3 . . . x1 ◦ xS x  x ◦ x x ◦ x x ◦ x . . . x ◦ x   2   2 1 2 2 2 3 2 S      x3  ◦ (x1, x2, x3 . . . , xS) = x3 ◦ x1 x3 ◦ x2 x3 ◦ x3 . . . x3 ◦ xS  (4.16)      .   . . . .   .   ......  xS xS ◦ x1 xS ◦ x2 xS ◦ x3 . . . xS ◦ xS replacing the inherent multiplication by the combination operation ◦. The kernel function be- haves analogously to the leakage model creation kernel, cf. Section 4.3.1. Each row is computed among several threads within a thread block where a single block loads its corresponding sample xi and its threads combine this sample with all the other samples. Eventually, only the upper, respectively lower, triangular matrix needs to be stored into the global memory. However, the computational arithmetic effort is very low. In the special case that only comb(xi, xi) is reasonable, then the computation can be straight- forwardly shifted into the computation of the correlation coefficient sums (Alg. 4.2).

4.5 Experimental Results

For our experiments we used an nVidia Tesla C2070 graphics cards with 6 GiB VRAM and an Intel Xeon X5660 at 2.8 GHz running a 64 bit Windows 7. The results were obtained using the CUDA toolkit and Software Development Kit (SDK) of version 3.2, the CUDA driver 270.61, and the Microsoft Visual C++ compiler. We presume attacking a sequential byte granularity implementation of AES [Nat01] to re- cover one sub-key. The leakage model here does not really matter because all model values would be stored in a single precision variable. In order to gain meaningful results the in- puts are composed of byte values that are randomly chosen and uniformly distributed. Since

49 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis in most cases employing a DSO with a vertical resolution of 8 bit suffices, hence we set 8 M×S X ∈R [0, 2 − 1] . First of all the best kernel configuration has to be figured out, more precisely the thread and thread block numbers. For the leakage creation and the correlation coefficient kernel it is quite obvious that they have to be launched with P thread blocks (one block per key hypoth- esis). Regarding the correlation coefficient sum kernel two constraints show up, the maximum number of threads a kernel can take and the hence allocatable shared memory. Actually, we can have√ at most 1,024 threads per kernel on our graphics card. Thus the tile size could be nmax = b 1, 024c = 32, but due to the restricted shared memory and the horizontal tiles (one tile computes more than one portion of X) we need to find out the optimal trade-off. Through empirical testing, it turned out that the kernel performs best with n = 28, that is n2 = 784 threads per block and four horizontal tiles while barely not exceeding the available shared mem- ory. The number of tiles (thread blocks) can be obtained by simply dividing both dimensions by n, namely the number of key hypotheses P and the number of given samples S. Another limitation is the total global memory Mglobal which accommodates the measurement matrix, the input vector, and the five correlation coefficient sums, respectively three in case of estimation of the leakage model sums. Presuming single precision variables for the sums and 8 bit variables for the samples and inputs, the global memory usage consists of M · S bytes for the measurement matrix, M bytes for the inputs, P · S · 4 bytes for the covariance, and 2 · P · 4 bytes, respectively 2 · S · 4 bytes for the variances and means. Eventually, we obtain the inequality MS + M + 4PS + 8P + 8S < Mglobal. (4.17) The runtime was then measured in steps of 10, 000 traces and a number of samples fixed to 20, 000. Figure 4.3 shows the results for the following flavors of the sums kernel:

 the kernel is given the precomputed leakage model (corr. coeffs. sums leakage model, cf. Sec. 4.3.1),

 the kernel is given the input vector directly (corr. coeffs. sums leakage model input vector, cf. Sec. 4.3.1),

 and further both variants with the leakage estimation (corr. coeffs. sums leakage model w/ estimation, cf. Sec. 4.3.4).

Additionally, we clarify on the required time for the data transfer between the host memory and the device memory (CUDA memory copy). As it can be seen the kernel applying the leakage estimation generally performs better than the other flavors without. Further it is evident that the effort in terms of runtime increases linearly with the number of traces. It is not worthwhile to compute the leakage model at beforehand which is most likely a penalty by the frequent global memory accesses. Also interesting to know is the fact that we get the same runtime results if we fix the number of traces and iteratively increase the number of samples instead. This in turn means that the measurement matrix can be cut at any point to make it fit the graphics cards global memory in the case of very large measurements. Thus we can state that the effort increases linearly with both dimensions, the number of traces and the number of samples, and therefore we conclude that the implementation scales perfectly with additional graphics cards.

50 4.5. Experimental Results

10 Corr. coeffs. sums leakage model 9 Corr. coeffs. sums input vector Corr. coeffs. sums leakage model w/ estimation 8 Corr. coeffs. sums input vector w/ estimation CUDA memory copy 7

6

5 Time [s] 4

3

2

1

0 10 20 30 40 50 60 70 80 90 100 Traces [×1,000]

Figure 4.3: Runtime of correlation coefficient sums kernel flavors. The kernel can either be given the leakage model or the input vector directly and the leakage model can be accurately calculated or estimated even.

The runtimes of the other two mandatory kernels, namely leakage model creation and correlation coefficient computation, are absolutely negligible being in the range of a few milliseconds. This comes at no surprise since the elements of the obtained matrices are independent of each other, in marked contrast to the sums kernel for which thread synchronization is vital due to usage of shared memory. That is statement also holds for the overhead, i.e., transferring data from the host memory to the device memory and vice versa, and last but not least the kernel launches. However, data transfers depend on the data and thus then required time increases linearly with the byte count (cf. Fig. 4.3). The influence of Input/Output (I/O) transactions, i.e., loading the traces from an Hard Disk Drive (HDD), is not considered because it also affects a pure CPU implementation in the same way. Ultimately, we provide a comparison between the CPU and GPU of our system. We imple- mented and optimized the sums kernel with leakage estimation on the CPU which does exactly the same as its GPU counterpart. Table 4.1 summarizes the results.

Table 4.1: Runtime comparison between CPU and GPU where one thread runs the CPU im- plementation. The number of samples is fixed to 20, 000. 10k traces 20k traces 50k traces 100k traces GPU 0.774 s 1.545 s 3.861 s 7.733 s CPU 302.72 s 622.11 s 1531.69 s 3152.21 s

As expected, the CPU implementation is, as well, linearly scaled with the number of traces. Hence, we can derive a speed-up of almost 100 whilst taking a processor with four cores into

51 Chapter 4. Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis account and assuming that the performance ideally multiplies with four. The respective ef- 9 Samples ficiency is calculated as 62.95 × 10 GPU·Model·s for the GPU implementation, respectively 9 Samples 0.162 × 10 CPU·Model·s for the CPU implementation.

4.6 Conclusion

We proposed a highly performant implementation of Correlation Power Analysis and pre- processing on graphics cards. Our implementation can handle arbitrary large measurement matrices which can be split up at any point to make them fit into the graphics cards memory, respectively into the memory of several graphics cards for further acceleration. As the most important result very large measurements can be analyzed within a few minutes instead of hours or days. This directly gives rise to more thorough security evaluations since the analysis time of various CPA scenarios is for sure the largest bottleneck.

52 Chapter 5 Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis

Profiled side-channel cryptanalysis has, beside CPA, a high standing during security assessments. Especially template attacks are considered to be the most important branch. As originally proposed, template attacks based on the multivariate normal distribution are still very common. Yet problematic however is the implementation in a numerical stable way. We show how both can be achieved, a numerically robust and fast implementation, making use of a parallelized computation exclusively performed on a graphics card.

Contents of this Chapter

5.1 Introduction ...... 53 5.2 Template Attacks Based on the Multivariate Normal Distribution .... 54 5.3 Graphics Cards Accelerated Template Attacks Based on the Multivariate Normal Distribution ...... 55 5.4 Conclusion ...... 59

5.1 Introduction

Template based DPA attacks, or short TAs, are de-facto the second anchor in security eval- uations. If an implementation shows no vulnerabilities against CPA it does not mean there also not exist vulnerabilities against TAs. Originally proposed in [CRR03] the TAs based on the Multivariate Normal Distribution (MVND) is nowadays also conducted by means of million traces. The amount of traces is shared between the profiling and characterization phase and both can benefit from an increasing number. Therefore both phases are highly time consuming, however, aggravated the fact that numerical instabilities may arise. The latter are likely to occur during the characterization phase caused by the inversion of a not properly conditioned covariance matrix.

5.1.1 Contribution

Following the approach we proposed in Chapter 4 we make use of the advanced techniques offered by CUDA to obtain a highly performant implementation of MVND TAs. Our approach

53 Chapter 5. Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis which is depicted in Section 5.3 is many-fold: the profiling phase is performed exclusively at the GPU side in a single run. The characterization phase is split into three GPU portions whereas the first deals with the robust inversion of the covariance matrix, and two further portions that deal with the evaluation of the MVND PDF. Analogously to our proposed CPA implementation the MVND TAs implementation scales extremely well with arbitrarily sized measurement matrices.

5.2 Template Attacks Based on the Multivariate Normal Distribution

In this section we will shortly recall the template based DPA approach. The important first step is to select points in time — often called POIs or features — that are supposed to contain the entire or at least a considerable proportion of the side-channel leakage information. After- wards templates are built involving the power consumption or electro-magnetic emanation of a reference device, similar to the target device, that is under the full control of the attacker. Eventually, the attack is carried out through matching freshly acquired traces from the target device to the built templates.

Feature Selection

The selection of points of interest within power traces is the first issue in TAs one is concerned with. There are several methods to obtain a set of points that could lead to a successful attack. Primarily, the points are selected according to their key-dependent variability, including known- key DPA [RO05], pair-wise distance to the mean vectors [CRR03], or using the sum of squared pair-wise T-differences [GLRP06]. A more systematic approach is the principal subspace-based TAs where the Principal Component Analysis (PCA) is applied to transform the recorded side- channel data into a low-dimensional subspace, figuring out the optimal linear combination of points in time which show maximum variance with respect to the side-channel leakage [APSQ06]. Since there exist several methods, not limited to the methods mentioned above, we assume for the rest of this chapter that feature selection is done already and the set of POIs has been figured out adequately. Hence, a corresponding implementation is out of scope in this context.

Profiling Phase — Template Building

In the first phase templates are built according to P selected POIs from several recorded traces v,k Xi ∈ X that are correlated to a function which involve both, known input data v and a key k, respectively a sub-key of it. The traces are assumed to be drawn from a Multivariate Normal v,k Distribution (MVND) N (Xi | µv,k, Σv,k). Therefore, a single template, denoted with τ v,k, is equivalent to an estimation of the mean µv,k and the covariance matrix Σv,k based on the selected POIs, and referring to different pairs of (v, k).

54 5.3. Graphics Cards Accelerated Template Attacks Based on the Multivariate Normal Distribution

Characterization Phase — Template Matching

new In the second phase one is given a new, yet uncharacterized trace Xv,k◦ which is evaluated through the MVND PDF, such that   new 1 1 new > −1 new Prob(Xv,k◦ | τ v,k) = q exp − (Xv,k◦ − µv,k) Σv,k(Xv,k◦ − µv,k) . (5.1) P (2π) |Σv,k| 2

The maximum likelihood approach provides the best fit, hence the higher the probability density the better the trace x fits the respective template. In order to avoid numerical problems in practice, mainly due to the inversion of Σv,k, one can omit the covariances (off-diagonal elements of Σv,k) to obtain so called reduced templates [MOP08].

Key Recovery Attack

In practice it is often not sufficient to recover the key, or sub-key, based on a single trace to be characterized but based on a few traces. Usually, we apply Bayes’ theorem [MOP08]

new new Prob(Xv,k◦ | τ v,k) · Prob(ke) Prob(ke | Xv,k◦ ) = new (5.2) Prob(Xv,◦k) in order to determine by means of which key, or sub-key, the traces to be characterized were new generated. Here, Prob(ke) is the prior probability and Prob(ke | Xv,k◦ ) the posterior probability of each key candidate ke.

5.3 Graphics Cards Accelerated Template Attacks Based on the Multivariate Normal Distribution

In the template building phase we are required to compute the covariance matrices and mean vectors of the recorded (M ×S) measurement matrix X. Here, X has been subjected to a feature selection already by which means each trace only contains samples that have been figured out as POIs. The measurement matrix corresponds to a (M × U)-matrix V that consists of usually different (randomly chosen and uniformly distributed) inputs for each trace, where vi,l is the lth input used for trace i and U is the portion count of the plaintext. Each mean and covariance matrix is computed with respect to different pairs of (v, k). That is, we need to compute the sums X X X xi,j1 xi,j2 , xi,j1 , and xi,j2 (5.3) ∀(v,k),∀i ∀(v,k),∀i ∀(v,k),∀i that belong to the same pair (v, k). The indices j1 and j2 denote the time-instant within trace i in row-major order. In practice, the pair (v, k) is usually replaced by a leakage model function M(ftarget(v, k)), denoted as a class in the following, that takes v and k as input and approximates the side- channel leakage of a sensitive variable ftarget(v, k). We refer to Section 2.4 and Section 4.3.4 where we detail on different leakage models. Clearly, the approach proposed for the computation

55 Chapter 5. Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis of the Pearson correlation coefficient as shown in Figure 4.2 can be reused by simply replacing the hypothetical leakage matrix Y with X>. The adapted kernel is shown in Algorithm 5.1. Nevertheless, the arising covariance matrix is symmetric and positive definite, meaning that only either the upper or lower triangular matrix needs be computed. Thus also the tiles need to cover only one of the two triangles. In summary the profiling phase, in shape of estimating the mean vectors and covariance matrices with respect to different leakage classes, can be done very efficiently on graphics cards. The characterization phase on the other hand mainly consists of matrix-matrix-matrix prod- new > −1 new ucts, namely the Mahalanobis distance (X − µc) Σc (X − µc)(cf. Eq. 5.1) where new −1 X is a measurement matrix with traces to be characterized and µc and Σc the mean vector, respectively the inverted covariance matrix, corresponding to a leakage class. The most important part here is surely the inversion of the matrices Σc. Besides that the determinants |Σc| needs to be computed additionally. It therefore makes sense to utilize an approach that provides both results at a time. The Gauss-Jordan-elimination is a common approach used in practice. However it is not optimal for several reasons. In general it needs twice the memory space that is demanded by the matrix to invert, although modified variants were proposed that work in-place [Das13] for constrained cases. Further it always works on the entire matrix since it turns the original matrix into the identity matrix. The latter is a numerically unstable process.

The matrix inversion is, however, supposed to be inefficient on GPU for smaller dimensions (smaller than 10, 000 × 10, 000). A potential inefficiency is mainly caused by the strict depen- dencies between the matrix’ rows which require frequent accesses to the global memory which are confronted with a low arithmetic processing effort. Admittedly, the covariance is a sym- metric positive definite matrix which is much easier to invert but a recent work [BQOER11] indicates that even that kind of inversion is not efficient on GPU. Nevertheless we present our approach below which makes use of a technique as proposed in [BG11], namely n-staged One-Block-Per-Associated-Rows (OPBA). This technique can be used whenever a more or less strict row dependency shows up during matrix operations. In this chapter we follows a three-fold approach to achieve the inversion that can be briefly summarized as follows.

(1) The covariance matrix is decomposed into a lower triangular matrix representation.

(2) The determinant is calculated from the triangle.

(3) The triangle is inverted.

(4) The inverted triangle is multiplied with its transposed to give the inverted covariance matrix.

First we apply the Cholesky decomposition to represent the covariance matrix as the product > of two lower triangular matrices, namely Σc = ∆c × ∆c . Please note that we previously used the term of a triangular matrix to denote the upper or lower portion of the covariance matrix whereby those are not equivalent to the decomposed portion ∆c of the covariance matrix, although the same notion used. The Cholesky decomposition only works on the lower triangular portion which comes in favor of our profiling approach. Further it works fully in-place meaning that no additional memory space is needed. The determinant can simply be obtained through

56 5.3. Graphics Cards Accelerated Template Attacks Based on the Multivariate Normal Distribution

Algorithm 5.1 MVND Based Template Building Input: (1) Measurement matrix X, (2) input vector V ∈ V, (3) leakage model M(fk(•)) Output: (1) Triangle of covariance matrices Σc, (2) mean vectors µc 1: for each block parallel do 2: for each thread parallel do 3: i ← thread position within column vectors of X 4: j ← thread position within row vectors of X> > 5: prefetch first tile of X and first horizontal tiles of X into registers: xi, xj. 6: end for parallel 7: end for parallel 8: 9: mc[#classes] ← initialize class counter array variable with zero M 10: for r = 1 to n do 11: for each block parallel do 12: for each thread parallel do shared 13: deposit prefetched tiles into shared memory: X [tIdx.x, tIdx.y] = xi, > shared (X ) [tIdx.x, tIdx.y] = xj 14: prefetch next tiles into registers: xi+r, xj+r 15: compute leakage class: c = M(ftarget(v, k)) 16: increment class counter: mc[c] = mc[c] + 1 17: for l = 1 to n do P P shared > shared 18: thread xi,j1 xi,j2 [c] = thread xi,j1 xi,j2 [c] + X [tIdx.x, l] · (X ) [l, tIdx.x] P P shared 19: thread xi,j1 [c] = thread xi,j1 [c] + X [tIdx.x, l] P P > shared 20: thread xi,j2 [c] = thread xi,j2 [c] + (X ) [l, tIdx.x] 21: end for 22: end for parallel 23: end for parallel 24: end for 25: ∀c µ c 1 P x c µ c 1 P x c : j1 [ ] = mc[c] thread i,j1 [ ], j2 [ ] = mc[c] thread i,j2 [ ] 26: ∀c c 1 P x x c − µ c · µ c : Covj1,j2 [ ] = mc[c] thread i,j1 i,j2 [ ] j1 [ ] j1 [ ]

Q 2 the product of the squared diagonal elements, thus |Σc| = ∀i (∆c)i,i. For the sake of robustness P a sum of logarithmized values is to be preferred, therefore Ln (|Σc|) = 2 · ∀i Ln ((∆c)i,i). Now, −1 having the decomposed lower triangle we can efficiently establish its inverse ∆c by basically −1 −1 −1 > −1 using dot-products. In the final step we obtain Σc by means of ∆c × (∆c ) . Since Σc is symmetric again, the latter process only needs to compute a triangular portion but without in-place capability. The matrix-matrix-matrix products (Mahalanobis distance) as mentioned above have to be processed by two consecutive kernels. The reason behind this is the circumstance that we can new > −1 compute the first matrix-matrix product (X − µc) × Σc with our usual approach (cf. new > −1 new Alg. 5.1). Though, the subsequent matrix-matrix product (X − µc) Σc × (X − µc) cannot be computed at the same time within the same kernel because it does not load the correct tiles. If we recall the Figure 4.2 and replace Y> with Xnew and the profiling measurement ma-

57 Chapter 5. Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis

Algorithm 5.2 MVND Based Template Characterization I new −1 Input: (1) Measurement X , (2) triangle of covariance matrices Σc , (3) mean vectors µc new > −1 new n×n Output: Sums of Mahalanobis distance (X − µc) Σc [(X − µc)] 1: for each block parallel do 2: for each thread parallel do 3: i ← thread position within column vectors of Xnew 4: j ← thread position within row vectors of Σ−1 new 5: prefetch first tile of X , first sub-vector of µc, and −1 new −1 first horizontal tiles of Σc into registers: xi ,(µc)i,(σc )j, 6: end for parallel 7: end for parallel 8: M 9: for r = 1 to n do 10: for each block parallel do 11: for each thread parallel do new shared new 12: deposit prefetched tiles into shared memory: (X ) [tIdx.x, tIdx.y] = xi , shared −1 shared −1 µc [tIdx.x] = (µc)i,(Σc ) [tIdx.x, tIdx.y] = (σc )j, new −1 13: prefetch next tiles into registers: xi+r ,(µc)i+r,(σc )j+r 14: Xnew tile tIdx.x, tIdx.y xnew − µ store mean-free tile: ( µc ) [ ] = i ( c)i 15: for l = 1 to n do P new −1 P new −1 16: thread(xi − (µc)i)(σc )j = thread(xi − (µc)i)(σc )j +  new shared shared  −1 shared (X ) [tIdx.x, l] − µc [tIdx.x] · (Σ ) [l, tIdx.y] 17: end for 18: end for parallel 19: end for parallel 20: end for 21: for each block parallel do 22: for l = 1 to n do 23: P xnew − µ σ−1 xnew tile P xnew − µ σ−1 · Xnew tile tIdx.x, l block( i ( c)i)( c )j( µc ) = l( i ( c)i)( c )j ( µc ) [ ] 24: end for 25: end for

Algorithm 5.3 MVND Based Template Characterization II new > −1 new n×n Input: (1) Sums of Mahalanobis distance (X − µc) Σc [(X − µc)] , (2) Ln(|Σc|) Output: Logarithmized posterior probabilities Ln (Prob(Xnew | c)) 1: for each block parallel do 2: for each thread parallel do 3: new > −1 new P P new −1 new tile (X − µc) Σc (X − µc) = ∀tiles block(xi − (µc)i)(σc )j(xµc ) 4: end for parallel 5: end for parallel new 1  new > −1 new  6: Ln (Prob(X | c)) = − 2 P · Ln(2π) + Ln(|Σc|) + (X − µc) Σc (X − µc)

58 5.4. Conclusion

−1 trix X with Σc — which now exactly embodies the first matrix-matrix product — we recognize that the necessary lower rows, of Xnew here, are not loaded for the exemplarily conducted tile processing. Luckily, we can already make use of the elements of Xnew which are loaded by the tile, such that we can at least partially perform the second matrix-matrix product. More pre- cisely, we are capable to compute intermediate sums we can save us the multiplication such that we only need to add up the intermediate sums in the mentioned second kernel. Going back to our example in Figure 4.2 and having a look at the tile stressed in Xnew we obtain the interme- new −1 new new −1 new new −1 new new −1 new diate sums X1 (Σc )1x1,1 +X1 (Σc )2x2,1 and X2 (Σc )1x1,2 +X2 (Σc )2x2,2 . See Algorithm 5.2 for the kernel code. Since the template characterization can conducted in- dependently for each leakage class we can extend the execution configuration (cf. Sec. 3.4.3) and use a three-dimensional grid wherein the z-component equals the leakage class count. The second kernel now adds up the intermediate sums and converts the resulting Mahalanobis dis- tance measure into a probability according to Equation 5.1. Analogously to the latter kernel this kernel is also called through a three-dimensional grid. After the kernel has been called we are provided the posterior probabilities, such that each leakage class is assigned to each of the new traces with a certain probability. The kernel’s code is shown in Algorithm 5.3. To conduct a key recovery attack we can now iterate over the traces with respect to their leakage class affiliation and apply Bayes’ theorem (Eq. 5.2). At this point we are not going to present performance numbers. It is obvious that the performance or better the preference over a pure CPU based implementation depends on the number of traces but also on the number of points of interest. If the latter is very small than a large performance gain is unlikely.

5.4 Conclusion

Analogously to our approach in Chapter 4 we proposed a highly performant implementation of template based differential power analysis through the multivariate normal distribution on graphics cards. It can handle arbitrary large measurement matrices which can be split up at any point to make them fit into the graphics cards memory, respectively into the memory of several graphics cards for further acceleration. However, the performance gain is closely related to a larger number of points of interest.

59

Part III

New Methods

Chapter 6 Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines

Template attacks relying on the multivariate normal distribution characterize, i.e., estimate, the leaked information. Feature selection — better known as the selection of POIs — is a fundamental requirement to identify points in time that contribute most information to an attack. However, numerical problems are likely in practice due to the evaluation in higher dimensions and severe noise. We propose an approach which separates the leaked information, based on SVMs, with advanced feature selection and considerably reduced effort under the assumption of a strict order attack model. Parts of this chapter have been previously published in [BLR13] available at link.springer.com. Contents of this Chapter

6.1 Introduction ...... 63 6.2 Binary Support Vector Machines ...... 64 6.3 Template Attacks using Support Vector Machines ...... 66 6.4 Experimental Results ...... 70 6.5 Conclusion ...... 73

6.1 Introduction

Previous works investigated potential alternatives to the MVND based approach. Machine learning in the shape of SVMs is one of these promising alternatives. The SVMs belong to linear binary classifier that decide to which of two classes an input vector belongs, based on classified training data. Further, the SVM is independent of a certain noise distribution. [LBM11] focused on SVMs amongst other machine learners and present attacks that predict key bits of a DES software implementation. In [HMG+11, HGM+11] the authors exclusively focused on SVMs concentrating on the applicability for template attacks. However, they did not provide an attack. In [HZ12] the authors extended the SVM approach from a single-bit to a multi-bit model and consequently introduced a probabilistic multi-class SVM approach. They showed that the SVM based template attack outperforms the MVND approach in the presence of non- negligible noise. Nevertheless, their approach did not take account of characteristics that lead to a considerable reduced effort and missed the utilization of a strong inherent feature selection.

63 Chapter 6. Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines

6.1.1 Contribution

In Section 6.3 we take the SVM template attack approach a leap forward by demonstrating how a tailored multi-class strategy can considerably reduce the effort during the profiling and characterization phase. Further, we show how to make TAs more powerful through the intro- duction of an SVM dedicated feature selection. Ultimately we compare our approach against several other TA approaches and depict some benefits of side-channel leakage separating over estimating.

6.2 Binary Support Vector Machines

Support Vector Machines (SVMs) are suitable for solving classification, regression, and pattern detection problems and belong to the category of sparse kernel machines. Originally described in [Vap95], SVMs are a non-probabilistic, linear, binary-class decision-maker whose output is a class label. SVMs are related to supervised learning methods whose determination of the model parameters correspond to convex optimization problems. This section follows the explanations in [Bis06].

6.2.1 Mathematical Background of Binary SVMs

A binary classification problem can be described by a linear discriminant function of the form

y(x) = w>x + b (6.1) where w is a weight vector, and b is a bias. An input vector x is assigned to class C− if y(x) < 0 and to C+ otherwise. Hence, the decision boundary corresponds to the (D − 1)-dimensional hyperplane within the D-dimensional input space, i.e., y(x) = 0. Since w>x = 0 for every x lying on the decision boundary, w is orthogonal to any vector on the decision boundary and therefore the normal vector of the hyperplane. With the same argument, bias b = −w>x for any x on the decision boundary. Suppose the training set consists of n input vectors (row vectors) x1,..., xn (vectors with an index belong to the training set in the remainder of this chapter) where each vector is associated with a class label ci ∈ {−1, 1}. Unlabeled vectors x are accordingly classified through the sign of y(x). For the moment, it is assumed that the training set is linearly separable within the input space D, which means we can find a pair (w, b) such that (Eq. 6.1) satisfies ciy(xi) > 0 for all training vectors. That is, every training vector xi is correctly classified. Obviously, we can find several pairs that separate the training set exactly but not every solution will give the smallest generalization error [Bis06] stating the goodness of classifying unlabeled vectors. With SVMs this is solved by introducing the margin concept which embodies the smallest distance between the decision boundary to any of the input vectors (Fig. 6.1). The best solution is given by the pair (w, b) for which this margin is maximized. The orthogonal distance of a vector x to the hyperplane is given by y(x)/ kwk (k•k denotes the Euclidean norm) and under the general constraint ciy(x) > 0 the maximum margin can be achieved by finding

> arg max{min[ci(w xi + b)]}. (6.2) w,b i

64 6.2. Binary Support Vector Machines

y>0 y=0 y<0

w b/||w|| y (x)/||w||

margin

Figure 6.1: Geometry of the separating hyperplane in support vector machines.

Finding a direct solution would be too complex. Nevertheless, rescaling of w and b does not affect the distance from any input vector to the hyperplane, thus we set

> ci(w xi + b) ≥ 1 for i = 1, . . . , n (6.3) which means that for vectors that are located on the margin around the decision boundary the equality holds. Consequently, the optimization problem has been reduced to maximize kwk−1 which is, without a proof, equivalent to minimizing kwk2. The latter is a quadratic programming problem that can be solved by application of the method of Lagrange multipliers ai with respect to the Karush-Kuhn-Tucker (KKT) conditions [Bis06]. The optimal solution of PN such Lagrangian optimization problem yields a representation of (Eq. 6.1), s.t. w = i=1 aicixi where the vectors xi for which ai > 0 are called support vectors. Hence, to classify unlabeled vectors x we obtain

X 1 X X y(x) = a c x>x + b where b = (c − a c x>x ) (6.4) i i i n i j j i j i∈S S i∈S j∈S and S is the set of indices of the support vectors, respectively nS the number of support vectors.

6.2.2 Non-linear Classification: Introduction of a Kernel

So far we assumed that the training set is linearly separable within the input space D. If this assumption does not hold the training set might be separable in the higher dimensional feature space F > D — not to be confused with feature selection, i.e., POIs selection. Therefore, the input vectors are transformed into that feature space through Φ(x) which gives a vector > D product of the form Φ(x1) Φ(x2) for x1, x2 ∈ R . A direct solution is computationally very intensive. Hence, a kernel function is applied that behaves exactly like the vector product in F without even knowing the concrete feature space, on the downside however, also without having influence on the arising dimension.

65 Chapter 6. Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines

6.2.3 Non-separable Case: Introduction of a Soft-margin

A linear separation, no matter if in the input space or in the feature space, can lead to a poor generalization (large generalization error) in the case of overlapping class distributions. In this case, there might be no getting around misclassification of some training vectors to achieve a separation at all. To do so, slack variables and a trade-off parameter γ, often denoted box constraint, are introduced to penalize misclassification resulting in two kinds of support vectors [Bis06]. Those support vectors for which ai < γ — support vectors which lie on the margin — and those for which ai = γ which are support vectors that are either correctly classified but inside the margin or misclassified anyhow. For γ → ∞ the penalty prohibits misclassification and thus recovers the original strict margin.

6.2.4 SVM Training and Classification

Training a support vector machine means solving the Lagrangian optimization problem for the given training set. There exist a specific approach called Sequential Minimal Optimization (SMO) [Pla99a] that breaks down the optimization problem into many small problems. From a wider viewing angle the Lagrange multipliers are jointly optimized under a linear equality constraint but merely two Lagrange multipliers are considered at a time. For this purpose two multipliers are chosen heuristically. The SVM classification is afterwards done through the evaluation of Equation 6.4 making use of the parameters from the SMO training, namely the Lagrange multipliers ai (and bias b).

6.3 Template Attacks using Support Vector Machines

The approach is similar to common template attacks. The posterior key probabilities are suc- cessively updated with each characterization trace applying Bayes’ theorem. Since TAs are a multi-class classification problem it is essential to turn the actual binary SVM into a probabilis- tic multi-class SVM. Contrary to previous work [HZ12] which follows a general probabilistic approach, we subsequently present probabilistic multi-class SVMs tailored to fit template at- tacks.

6.3.1 Probabilistic Support Vector Machines

In order to use SVMs in an aggregated probabilistic approach, probabilistic decisions of the class label c for unlabeled vectors are necessary. Maintaining the sparseness property of SVMs it is proposed in [Pla99b] to fit a logistic sigmoid function to the outputs y(x) of an already trained binary SVMs. This give rise to the posterior conditional probability

1 Prob(c = 1 | x) = (6.5) 1 + exp(A · y(x) + B) that x belongs to the class c = 1. Clearly, Prob(c = −1 | x) = 1 − Prob(c = 1 | x). A second training set should be involved to avoid severe overfitting (this will be discussed in Section 6.4).

66 6.3. Template Attacks using Support Vector Machines

The parameters A and B are to be found though minimizing the cross-entropy error of the training set, that is

n X arg min − ti log (Prob(c = 1 | xi)) + (1 − ti) log(1 − Prob(c = 1 | xi)) (6.6) A,B i=1 where ti = (sign[y(xi)] + 1)/2. Nevertheless, solving this optimization problem in a numerical stable way is proposed in [LLW07].

6.3.2 Our probabilistic SVM Multi-class Approach The related work in [DK05] investigates methods that are founded on two strategies to turn a binary SVM into a multi-class SVM. The one-versus-all strategy uses binary SVMs for separating one class from the joint set of all other classes. The one-versus-one strategy applies binary SVMs for pair-wise distinguishing the classes. In [HZ12] the authors preferred the one- versus-one strategy in order to mount their attacks. In this chapter, we suggest our own method tailored for TAs with regards to the attack model. Suppose that the TA classification scenario implies m distinct classes. Then, given an unla- beled input vector x, we aim for posterior conditional probabilities Prob(Ωl | x) for l = 1, . . . , m and Ωl is the lth class label. The classification is enabled by a training set Xe where for each training vector x˜i the correct class label Ωl is known. From now on input or training vectors are denoted with traces.

Generally Applicable Assumption of a Side-channel Leakage with Strict Order

We assume that the side-channel leakage leads to splitting the measurement samples into a strict order according to the known multi-class labels at the leakage-dependent points in time.

More precisely, let xΩl,t be an arbitrary scalar at point t of a trace that belongs to label Ωl, then for each leakage-dependent point t we assume a strict ordering of the labels, i.e. either xΩ1,t < xΩ2,t < . . . < xΩm,t or, alternatively, xΩ1,t > xΩ2,t > . . . > xΩm,t. This assumption applies to any side-channel leakage except for the case that two classes are represented by equal scalars. For instance, this could happen in the case where the DUT leaks information indeed in shape of the Hamming weight, then separating the side-channel leakage by assuming the identity will, of course, lead to equal scalars. Hence, in preparative step one needs to

(1) validate that each class is representable by a distinct scalar and

(2) arrange the Ωl, t in order of their scalars, s.t. a strict order is obtained.

With this assumption we train m − 1 SVMs using the training set Xe and introduce binary class helper labels ci,j to the training traces, such that

( 1, {x˜j | x˜j belongs to Ωl with l ≤ i j = 1, . . . , n ci,j = , (6.7) −1, else l = 1, . . . , m whilst the ith SVM is under training, i = 1, . . . , m − 1. These helper labels convert the side- channel leakage classes into binary-classes as requested by the SVM classification model. Using

67 Chapter 6. Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines

the ith SVM afterwards to classify a new trace x we get the class overlapping probabilities Si Prob( l=1 Ωl | x). That is, the ith SVM gives the probability that x belongs to the classes before the ith hyperplane. Figuratively, we construct the hyperplanes between the class labels from the left to the right as depicted in Figure 6.2. Recalling that the probabilities rely on a distance measure between x and the separating hyperplane, each consecutive probability Si+1 Si+2 Sm−1 Prob( l=1 Ωl | x), Prob( l=1 Ωl | x),..., Prob( l=1 Ωl | x) is even higher once x was classified to belong to a class before the ith hyperplane with a non-negligible probability (cf. Fig. 6.2). This is due to the fact that the distance grows in a positive manner and thus the probability grows since the binary class separation regarding x becomes even clearer. However, this is of course an undesired result. We can overcome this by training, again, m − 1 SVMs but this time starting with the last class label Ωm, i.e., using Equation 6.7 with reversed signs. With this approach, going from the left to the right first and then vice versa, we surround the correct class label by two consecutive hyperplanes. In fact we do not need to train new SVMs since the separating hyperplanes did not change at all. Instead we use the complementary probabilities

1 1 1 −1 −1 −1 −1 −1

Ω1 Ω2 Ω3 Ω4 Ω5 Ω6 Ω7 ... ΩM

Figure 6.2: For instance, the third support vector machine is trained which corresponds to the S3 Sm hyperplane separating l=1 Ωl and l=4 Ωl. The binary class helper labels c3,j for the training traces xj are given on top. Training traces that belong to the classes before the hyperplane are classified with helper label 1, all others with −1. shifted by one class due to the surrounding. Suppose that x indeed belongs to Ωi, then x is right-bounded by the hyperplane i and left-bounded by the hyperplane i − 1 and thus related Si Si−1 to the probabilities Prob( l=1 Ωl | x) and 1 − Prob( l=1 Ωl | x). Hence, we suggest using

i " i−1 # [ [ Prob(Ωi | x) = Prob( Ωl | x) · 1 − Prob( Ωl | x) (6.8) l=1 l=1 1 1 − 1+exp(A ·y (x)+B ) = i−1 i−1 i−1 , 1 < i < m, (6.9) 1 + exp(Ai · yi(x) + Bi)

1 [ 1 Prob(Ω | x) = Prob( Ω | x) = , (6.10) 1 l A · y x B l=1 1 + exp( 1 1( ) + 1) m−1 [ 1 and Prob(Ω | x) = 1 − Prob( Ω | x) = 1 − (6.11) m l A · y x B l=1 1 + exp( m−1 m−1( ) + m−1) being the posterior conditional class probabilities.

Side-channel Leakage Without Strict Order If a strict order is not applicable, i.e. several scalars that belong to different classes are equal, a separation is impossible no matter which method is used. Nevertheless, in practice one could try to combine (leave out) equivalent classes or use the one-versus-one strategy. The latter case

68 6.3. Template Attacks using Support Vector Machines

means separating each pair (Ωi, Ωj) for i < j ≤ m which in turn results in training m(m − 1)/2 SVMs. The posterior conditional class probabilities are then given by Y Prob(Ωi | x) = Prob[(Ωi, Ωj) | x belongs to Ωi]. (6.12) j6=i

6.3.3 SVM based Templates As in the MVND approach, templates are built on the profiling traces first, here contained in the training set, by training m − 1 or m(m − 1)/2 support vector machines where m denotes the number of classes according to the assumed attack model. Afterwards, we fit the sigmoid function with classification values from the SVMs involving a second set of profiling traces. Hence, a single template contains the Lagrange multipliers ai, the support vectors XS i ⊂ Xe , and the bias bi plus the values Ai and Bi. Please note that in the SVM approach templates are to be characterized by the class separators and not by the class representatives. This approach is summarized in Algorithm 6.1.

Algorithm 6.1 SVM Template Building

Input: (1) Training set Xe with n traces related to labels Ω1,..., Ωm, (2) constraint γ Output: m − 1 templates (ai, bi, XS i ⊂ Xe , Ai, Bi) 1: for i = 1 to m − 1 do 2: for j = 1 to n do 3: if x˜j belongs to Ωl with l ≤ i then cj ← 1 else cj ← −1 4: end for 5: ai, bi, XS i ⊂ Xe ← SMO training (App. C, Alg. C.1) with Xe ,(c1, . . . , cN ), and γ 6: Ai, Bi ← Sigmoid training (App. C, Alg. C.2) with ai, bi, XS i ⊂ Xe , Xe , and (c1, . . . , cN ) 7: end for

Algorithm 6.2 depicts the characterization of unlabeled traces.

Algorithm 6.2 SVM Template Classification

Input: (1) m − 1 templates (ai, bi, XS i ⊂ Xe , Ai, Bi), (2) unlabeled trace x Output: Posterior probabilities Prob(Ω1|x),..., Prob(Ωm|x)

1: Prob(Ω1|x) by Eq. 6.10 with (a1, b1, XS 1 ⊂ Xe , A1, B1) 2: for i = 2 to m − 1 do 3: Prob(Ωi|x) by Eq. 6.9 with (aj, bj, XS j ⊂ Xe , Aj, Bj) for j ∈ {i − 1, i} 4: end for 5: Prob(Ωm|x) by Eq. 6.11 with (aM−1, bm−1, XS m−1 ⊂ Xe , Am−1, Bm−1)

6.3.4 A Dedicated Feature Selection Basically, a prior feature selection is a dimensionality reduction to help figuring out the most discriminative features in a given data set. Especially for machine learners this is a critical component as the predictive power reduces while the dimensionality increases. The latter fact is known as the Hughes effect [Hug68]. Aside from that, for the MVND TA approach a feature

69 Chapter 6. Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines selection is certainly essential to avoid numerical problems in practice which would render the evaluation of probability densities impossible. It is anyhow assumed that the exploitable side- channel leakage is hidden locally in the variability of only a few points in time with respect to such probability densities [MOP08]. However, severe loss of information is likely in case of a too strong reduction whereas loss of information that marginally contributes to the attack is acceptable. In summary this means that the thin line is even thinner with respect to machine learners. On top of that feature selection within the SVM context has a slightly different meaning since we aim for inter-class separation instead of intra-class density estimation. On the one side numerical problems due to high-dimensioned data do not occur and on the other side dimensionality reduction methods such as PCA likely jeopardize the optimal performance of SVMs in other applications [ZS10]. In contrast to previous works [HZ12, HMG+11, HGM+11, LBM11] we recommend using the linear kernel with a dedicated subsequent feature selection, called normal-based feature selection. This is optimal while presuming a linear side-channel leakage in template attacks (cf. Sec. 6.3.2). The normal-based feature selection retains points in time according to the weight vector w (cf. Sec. 6.2.1). It has been shown in [BGMFM02] that a feature at point t which corresponds to a high absolute value |wt| has great influence in determining the optimal margin and thus improves the classification performance. Because we train several SVMs according to a leakage model we disregard features by setting the respective weights in w to zero instead of removing them from the data set. That is the POIs employed for our SVM TA approach are those points in time for which wt > 0. In Section 6.4 we answer the question of which features are to disregard and which are to retain. One may argue that a prior feature selection may have the same effect but the Lagrange multipliers ai that significantly determine the weight vector are still found using the complete data set meaning that the entire information is available for inter-class separation. Therefore and supported by the results of [ZS10] speaking against prior feature selection we strongly recommend the normal-based approach.

6.4 Experimental Results

For our experiments we used a Microchip PIC18F2520 microcontroller [Mic08] running at 3.68 MHz. The power traces were acquired by means of a PicoScope 5203 at a sampling rate of 125 MSamples/s. The signal was sampled across a 1 Ω resistor in the opened ground line. The recorded traces are representing the full substitution layer (S-box) within the first round of an AES software implementation. As the side-channel leakage model we chose the Hamming weight of the S-box output since it has been proven highly successful on this device. We have verified that we indeed achieved nine classes with strict order (cf. Sec. 6.3.2). Before the analysis is conducted the traces were compressed through peak extraction [MOP08] which significantly improved the exploitation of the side-channel information by means of CPA. To produce comparable results to [HZ12] we choose the Partial Guessing Entropy (PGE) [SMY09] to evaluate the attack’s performance. It states the average position of the correct sub- key within a descending ranking in order of the probabilities of all possible sub-keys. However, in order to introduce our adaptations we used our own SVM implementation as described throughout this chapter instead of the C-Support Vector Classification (C-SVC) implementation

70 6.4. Experimental Results of the A Library for Support Vector Machines (LIBSVM) library [CL11]. C-SVC also applies SMO for training and the same probability model for classification, thus both implementations are comparable. In the following, we validate our SVM template attack against variants of it, the SVM TA from [HZ12], the MVND TA, and the MVND TA with prior PCA. Our approach implies the proposed multi-class method and the linear kernel with the subsequent normal-based feature selection. As variants we replaced the normal-based feature selection with a prior feature selection, namely Known Key-Correlation Power Analysis (KK-CPA) where the points in time with the highest correlation were taken, respectively with the application of PCA. Further, we include the MVND template attack with both, KK-CPA and prior PCA. The SVM TA from [HZ12] uses the Radial Basis Function (RBF) kernel and KK-CPA. They also suggest an empirically determined constraint γ = 10, respectively γ = 1 for noisy measurements, and an empirically determined termination criterion of 0.02 that states the fraction of traces that are allowed to violate the KKT conditions. In our experiments, however, we involve γ = 1 and termination criterion of zero, as recommended in [CST00] — except for the attack from [HZ12]. In order to obtain noisy measurements we add normal noise to the recorded traces, in particular normal noise with a standard deviation of σng = 5. We characterized the intrinsic noise of our recorded traces with σn0 ≈ 0.7. Initially, we want to show how to disregard features with the help of the normal-based feature selection. As can be seen in Figure 6.3 the weights are linearly increasing except for the last few

0.7

0.6 | j

0.5

0.4

0.3

0.2 Absolute values of weights |w

0.1

0 0 50 100 150 200 250 300 350 400 450 Features in ascending order according to |w | j

Figure 6.3: Absolute values of the weights in ascending order. The weight vectors were obtained from training 8 SVMs (HW attack model). weights which increase exponentially. These are the weights we retain and thus all the others

71 Chapter 6. Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines were set to zero. In our experiments we retain 8 weights. We also used 8 components from PCA and 8 points in time with the highest correlation from KK-CPA. Our performance comparison considers each template attack involving the required profiling, respectively characterization traces to reach a partial guessing entropy of one. Thus, it states the minimum effort to always recover the correct sub-key with respect to our experiments. Furthermore, we depict the partial guessing entropies of all other attacks while our proposed attack possesses a partial guessing entropy of one. Table 6.1 shows the results considering the original traces. It is observable

Table 6.1: Comparison of template attacks depicting the required amount of characterization traces along an increasing profiling base (traces per HW) to reach a guessing entropy of one. The lower table depicts the guessing entropies while our TA reaches a PGE.

The traces were involved with their intrinsic noise σn0 ≈ 0.7. Profiling SVM based TA MVND based TA base Our TA linear & KK-CPA linear & PCA [HZ12] KK-CPA PCA 10 33 98 71 289 n/a n/a 20 22 67 33 147 92 53 40 21 63 26 139 63 37 60 19 61 21 121 59 23 80 15 44 17 116 55 19 100 13 39 15 111 51 16 10 1 2.59 1.17 18.54 n/a n/a 20 1 2.57 1.15 16.75 7.51 1.62 40 1 2.58 1.15 18.72 6.58 1.15 60 1 2.58 1.08 18.10 4.91 1.10 80 1 2.08 1.08 17.53 3.46 1.10 100 1 2.08 1.08 18.43 3.74 1.08

that SVM based template attacks perform better than the MVND TA. As expected each attack requires less characterization traces with an increasing profiling base where best per- formance is almost reached with a profiling base containing 100 traces per Hamming weight. However, with a too small profiling base (10 traces per HW) the MVND based attacks fail due to numerical problems caused by the matrix inversion with the Gauss-Jordan algorithm. Our approach performs well, especially with a small profiling base, whereas the attack proposed in [HZ12] using the RBF-kernel performs only suboptimal. This observation is confirmed when comparing the guessing entropies. When our TA reaches a guessing entropy of one the other attacks reduce the sub-key space to at most four except the RBF kernel based attack that reduces the sub-key space to about 18. Next, we evaluate the performance in the presence of higher noise. Table 6.2 depicts that in this case PCA is not the optimal choice for feature selection. Both TA approaches lead to inferior results when using PCA instead of KK-CPA. Our approach still performs well but the results altogether are also a bit closer together now. The RBF kernel based attack performs slightly better and can finally reduce the sub-key space to eight. Eventually, our findings indicate that SVM based template attacks do not perform optimal with the RBF kernel. Admittedly, our results concerning the RBF kernel were obtained using our

72 6.5. Conclusion multi-class strategy instead of the usual one-versus-one strategy but we came to the conclusion this has no crucial negative impact. Actually, the good performance of a linear kernel should not be surprising since template attacks usually imply a linear classification problem whereas the RBF kernel is appropriate for non-linear problems.

Table 6.2: Comparison of template attacks depicting the required amount of characterization traces along an increasing profiling base (traces per HW) to reach a partial guessing entropy of one. The lower table depicts the partial guessing entropies while our TA

reaches a PGE of one. The traces were involved with added noise σng , thus σn1 ≈ 5.7. Profiling SVM based TA MVND based TA base Our TA linear & KK-CPA linear & PCA [HZ12] KK-CPA PCA 10 81 121 149 1,100 n/a n/a 20 62 93 112 365 650 920 40 56 84 94 312 158 344 60 51 73 84 153 112 146 80 46 62 79 144 96 124 100 43 58 73 138 92 121 10 1 1.18 1.66 41.25 n/a n/a 20 1 1.14 1.62 13.54 15.524 14.37 40 1 1.12 1.54 7.8 11.22 12.02 60 1 1.12 1.52 7.75 4.91 5.55 80 1 1.12 1.46 7.57 2.95 3.58 100 1 1.12 1.36 7.58 2.96 2.81

The linear kernel also performs well with prior feature selection, i.e., KK-CPA and PCA but the normal-based feature selection is very simple and furthermore provides better results with a small profiling bases. The computational effort of our SVM based attack is in the range of seconds and hence negligible when compared to the MVND based attack.

6.5 Conclusion

We showed how to significantly improve the performance of template attacks based on Support Vector Machines (SVMs). Although previous works already demonstrated the advantages of such template attacks, their approaches were not optimal in the sense of efficiency and perfor- mance. First of all we proposed a multi-class method tailored for SVM based TA which leads to training the minimum number of SVMs possible under a side-channel leakage model with strict order. Next, we showed that the subsequent feature selection after the training called normal-based feature selection together with the linear kernel leads to superior results than using it with a prior feature selection, namely KK-CPA or PCA, respectively.

73

Chapter 7 Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

We propose a new profiled side-channel attack called Leakage Prototype Learning (LPL) that aims for determining unbiased side-channel leakages instead of estimat- ing them. Furthermore, it encompasses the locating, respectively selection, of leakage dependent time-instants with clear criteria. For one thing we provide a deep theoret- ical analysis by discussing mathematical foundations and properties, and for another thing a thorough analysis by practical means including performance comparisons to meaningful variants of several common profiled side-channel attacks. Parts of this chapter have been previously published in [Bar16]. IEEE Xplore. © 2016 IEEE. Contents of this Chapter

7.1 Introduction ...... 75 7.2 Discriminating Side-channel Leakage via Prototype Learning ...... 76 7.3 Empirical Results ...... 85 7.4 Conclusion ...... 98

7.1 Introduction

Although, considerable effort has been spent on improving template attacks over the last years e.g., [BLR13, HGM+11, LBM11, YZLC12, APSQ06, GLRP06, CK14], it is, however, unclear how much better we still can get with future improvements. Examining the accompanying locating or selection of leakage dependent time-instants (POIs), or feature selection, is apparently an issue where profiled side-channel cryptanalysis could fur- ther benefit from. [BLR13] has already demonstrated that a selection dedicated to a certain classification scheme can show superior results over generic methods like PCA as afforded in principal subspace based template attacks [APSQ06]. Moreover, PCA was subjected to further investigations as a preprocessing method for CPA in [MBvLM12] where it is supposed that side-channel leakage may not be principally covered by the largest eigenvectors, as supposed so far, but by localized eigenvectors. Another question concerns the classifier type, namely whether it should be either linear or non-linear. A common model in side-channel attacks is the assumption that each bit, processed within a device together with other bits, independently contributes to the leakage. Indeed, this is never the case since the leakage additionally consists of each interaction among these bits and

75 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis therefore, quiet a number of non-linear terms. For instance, the attack described in [SLRP05] is capable of both, a linear and non-linear bit model. Nevertheless, in [DPRS11] it is argued that these terms are different for each device and small anyway. That is even aggravated by the noise inducing measurement chain [LCSL07], e.g., environmental influences, the carrier board layout, or DSO resolution, which renders a reproducible capturing of small non-linearities most likely impossible. These observations might be different for future nano-scale devices [RSVC+11]. In this work we start from the same premise as in [SLRP05], namely that side-channel leakage can be seen as a multivariate stochastic variable. But instead of carrying out linear regression to estimate leakage coefficients we aim for determining the unbiased side-channel leakage as comprised in the device. Because of its multivariate nature it is, like other profiled attacks, applicable to higher-order leakage scenarios with no further effort but out of scope in this work.

7.1.1 Contribution

Leakage prototype learning, abbreviated LPL in the remainder, belongs to profiled side-channel cryptanalysis and hence consist of the two common phases, namely leakage profiling and leakage characterization. In Section 7.2.1 we describe what is meant by a leakage prototype and build a bridge to machine learning (cf. Learning Vector Quantization (LVQ) [Koh01]). It is mainly motivated by Generalized Relevance Matrix Learning Vector Quantization (GRLVQ) [BHSV09] which is an extension to Generalized Learning Vector Quantization (GLVQ) as introduced in [SY96]. The main focus is put on a margin that allows for determining exact side-channel leakages within a device — denoted by unbiased leakages in the following — with minimal effort. This means that unbiased leakages could be obtained by a few number of traces instead of successively reducing the leakage uncertainty by recording even more measurements. In Section 7.2.2 it is shown that the model could be enhanced to include the learning (POIs selection) of leakage carrying time- instants. Therefore LPL does not rely on external methods which are executed at beforehand but provides an inherent selection. Within Section 7.2.3 the model is further enhanced to take the distribution of inexplicable variation at those chosen time-instants into account. As before with leakage carrying time-instants, the variation’s distribution is not estimated but learned to minimize misclassification risk. For both model enhancements also consider [BHSV09]. Sec- tion 7.2.4 then elucidates how the model is solved, i.e., how the learning is applied, by means of an approach that fits best into LPL. The learning is equivalent to the profiling phase. Based upon a sound assumption Section 7.2.5 completes the discussion with the characterization phase.

7.2 Discriminating Side-channel Leakage via Prototype Learning

In this section we describe the mathematical model for discriminating side-channel leakage via prototype learning. Please also see Appendix A.

7.2.1 Basics of Leakage Prototypes

Our adversarial model assumes multiple side-channel leakages at the time-instants T = {t1, . . . , tl} of a cryptographic implementation, running on an electronical device. More pre- cisely, the side-channel leakages correspond to device-immanent functions that determine the

76 7.2. Discriminating Side-channel Leakage via Prototype Learning observable electro-magnetical measurands induced by the processing of a secret key, respectively m n secret sub-key k over F2 and a known input V over F2 . A single function at time t is defined n m by ht(V, k): F2 × F2 7→ R. The input V could be either a plaintext, ciphertext, or intermediate state accordingly. A measured side-channel leakage observation is seen as a realization of the |T |-variate random variable

IT (V, k) = hT (V, k) + RT (7.1)

where hT (V, k) = (ht1 (V, k), . . . , htl (V, k)) is a vector composed of deterministic side-channel leakages, i.e., V ar(hT (V, k)) = 0, in the time interval T . RT is a component vector, inexplicable by any side-channel leakage model, that is neither dependent on V or k nor on their processing. In this sense, RT accounts for the variation due to power management, concurrent process activities or technology, e.g., fabrication variations, that are inherently linked to the leakage. By means of a finite set of observations the adversary is thus merely able to realize an erroneous estimation of the side-channel leakage caused by RT . In the following heT (V, k) is used for the estimated leakage assigned to (V, k) whereas the estimation error is denoted by side-channel leakage bias in the remainder. Taking this into account, we model a leakage prototype that represents the bias minimized leakage by means of the observations. It denotes

V,k V,k LT = heT (V, k) − bT (7.2)

bV,k density-biases bV,k, . . . , bV,k with T being a vector that consists of ( t1 tl ) in the wake of non-zero V,k V,k RT . Take note that the induced bias bT is different for each IT (V, k) and so for LT and hence not a function of (V, k). Through such a modeling we aim at constructing an error minimized side-channel leakage determination. The function hT (V, k) is deterministic, generally unknown and originates from the transformation from the side-channel leakage into the measurand which is to be deduced from the fact how the data is handled by the underlying specific device, e.g., bus transferred or processed by a dedicated hardware circuitry. The bias is introduced into the model to account for the estimation uncertainty caused by a preliminary estimation. In the following we depict the relation between the bias and the introduced leakage prototype. V,k The leakage prototype LT is seen as the unbiased leakage and therefore equivalent to hT (V, k) V,k j Nl with respect to the uncertainty of bT . Let {iT (V, k)}j=1 denote a set of leakage class obser- 1 PNl j vations, then we have V ar(heT (V, k)) = V ar( i (V, k)) which simplifies to Equation 7.2 Nl j=1 T by     Nl Nl 1 X j 1 X j V ar hT (V, k) + r  > 0 ⇒ V ar heT (V, k) − r  = 0 (7.3) N T N T l j=1 l j=1

1 PNl j V,k V,k where r = b are the density-biases with non-interrelated V ar(b ) > 0 and Nl Nl j=1 T T T being the quantity of leakage observations regarding (V, k).

Remark 7.1. We remark that an individual determination of unbiased leakage components at different time-instants, as assumed above, is not opposed to the inclusion of their potential joint distribution (see Sec. 7.2.3).

77 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

As a consequence of Equation 7.3, it is the central goal of leakage prototype learning to get rid of typical bounds of sampling estimators that associate the decrease of the leakage uncertainty to the quantity Nl. Clearly, the exact biases can not be inferred directly but found with respect to an optimal prediction confidence by means of a distance measure. In this work we chose the inter-class- margin to be this prediction confidence. In the light of side-channel leakage it corresponds to the maximum distance the decision surface (or border) between adjacent leakage distributions can move before it changes the decision whether a yet unseen leakage observation is assigned to one or the other leakage. In other words it basically states the clearance between adjacent unbiased leakages. 0 For the sake of readability we use for the remainder of this chapter hT , iT and hT as a short 0 0 for hT (v, k), iT (v, k), respectively hT (v , k ), also when written with other indices or replaced by its estimated value.

V,k Definition 7.1. Let {LT } be a set of leakage prototypes and iT a measured side-channel leakage n V,k observation. Presuming that P rob(V = v) > 0 ∀v ∈ F2 , the inter-class-margin of {LT } with regards to iT is defined as

C D Mi = LT − iT − LT − iT . (7.4) T 2 2

C C LT is the closest prototype to iT , such that indeed both, LT and iT , comprise the leakage hT . D 0 Conversely, LT is the closest prototype to iT that comprises a different leakage hT =6 hT ( cf. [Koh01]).

Remark 7.2. As defined above the margin consists of two interrelated distances. The first is theoretically zero if observations are correctly assigned (RT = 0). The second is large since it directly expresses the clearance between adjacent unbiased leakages and is hence subjected to further enlargement. However, due to its negative sign in case of a correct classification, the inter-class-margin is supposed to be minimized in order to reduce the risk of misclassification. This is discussed throughout the remainder of this section.

Finding the minima of the inter-class-margin needs to be carried out for the entire set of Ne j Ne leakage observations, i.e., {iT }j=1. However, so far the margin is only defined with regards to a j single observation iT which demands for a marginal-loss function over the entire set. The sum of the margins is therefore an obvious choice [SY96]. The issue, though, is that margins are likely to differ between different leakage distributions and hence, there might exist several local minima. To overcome this, the inter-class-margin is normalized, such that its loss is defined as

Ne j j X dC (iT ) − dD(iT ) LM = j j (7.5) j=1 dC (iT ) + dD(iT )

j {C,D} j where d{C,D}(i ) = L − i . By this normalization each summand is lower bounded by T T T 2 −1 (and upper bounded by 1) which in turn results in a function LM with a global minimum. This fact can be easily verified. For the unbiased leakage hT we obtain dC (hT ) = 0, respectively 0 dD(hT ) = hT − hT 2 which is the margin’s minimal value and thus each summand equals −1.

78 7.2. Discriminating Side-channel Leakage via Prototype Learning

7.2.2 Locating and Selecting Side-channel Leakage Dependent Time-instants via Prototype Learning

We partition an observation considering two disjunct time-intervals. As before interval T de- notes the time-instants that rely on a function ht(V, k), whereas T denotes the interval con- taining leakage distributions at corresponding time-instants that are not distinguishable from a constant, i.e., providing no key-dependent variance. The complete measurement interval containing all recorded samples is S = T ∪ T . Also consider [BHSV09].

Lemma 7.1. A leakage distribution at time-instant s ∈ S is distinguishable from a constant, and therefore relies on a function ht(V, k) ⇒ t = s ∈ T , if it has a non-zero contribution to the global minimum of LM.

Proof. Clearly, if we complement the set T with a function at time-instant s that does not 0 0 0 have a key-dependent variance, i.e., it yields hs = hs for any (v, k) =6 (v , k ), then it has no 0 0 0  contribution to Equation 7.4 since hT − hT 2 = (hT , hs) − hT , hs 2.

Consequently, Lemma 7.1 suggests to employ the margin contributing univariate minima to locate the side-channel leakage dependent time-instants. The loss function (Eq. 7.5) can be adapted by introducing the values with reversed signs of the univariate minima as weights λs, so that N j j e dΛ (i ) − dΛ (i ) LΛ X C S D S M = Λ j Λ j (7.6) j=1 dC (iS) + dD(iS)

r 2 Λ j  1  {C,D} j  d i 2 L − i λ where {C,D}( S) = Λ IS S with a diagonal matrix Λ = diag( s), expecting that λs ≤ 0 if s ∈ T . Otherwise, if t = s ∈ T then the value of the univariate minimum depends on the function ht(V, k). For instance, when it holds that hti (V, k) > htj (V, k) while both are n linear dependent for all k ∈ F2 and ti, tj ∈ T , i.e., the leakage dependence has a measurably stronger indication at time-instant ti, we notice that also each minimum at ti has a greater absolute value than at tj.

7.2.3 Considering the Variation’s Distribution

The characterization of yet unseen leakage observations by means of the L2-norm is only mean- ingful if the observations are spherically jointly distributed. A spherical joint distribution implies that observations lying on the same spherical contour have the same numerical association with the enclosed leakage. Therefore, it is ensured that actually equivalent leakage observations are equally characterized. However, this kind of distribution requires the leakage observation com- ponents to be mutually independent which is generally not the case because of the vector RT , also cf. [MOP08].

We take the general case into account by assuming that the components (Rt1 ,...,Rtl ) are correlated. Please note, that uncorrelatedness does not imply independence. Up to this point we did not suppose any concrete probability distribution, however, by support of the central limit theorem we assume that RT is approximately jointly normally distributed, even for smaller Ne and although RT is not necessarily zero. Also consider [BHSV09].

79 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

Lemma 7.2. Let r be a jointly normally√ distributed variance vector with covariance matrix Σ > > and Φ a symmetric matrix, then kΦrk2 = r Φ Φr is minimal if Φ transforms r into a vector with uncorrelated variables and a spherical joint distribution. √ > Proof. We can write krk2 = u Σu where u consists of random normally distributed and 0 uncorrelated components. The matrix Σ = Σ − e1 · I, with e1 being the smallest eigenvalue of Σ and I the identity matrix, still fulfills u>Σ0u ≥ 0 due to positive semi-definiteness and therefore √ √ q > 0 > > u Σ u ≥ 0 ⇒ u Σu ≥ u (e1 · I) u. (7.7)

> That is, krk2 is minimal if Φ ΣΦ = e1 · I from which follows uncorrelatedness if the variance components possess a spherical joint distribution.

Similar to the leakage prototype we can thus establish a global distribution prototype > DT := Φ Φ and achieve a distance of the form

r Λ,Φ j  1  {C,D} j >  1  {C,D} j  2 2 d{C,D}(iT ) = Λ LT − iT DT Λ LT − iT . (7.8)

Although quite similar to the Mahalanobis distance, both are not equal since DT may only be proportional to the inverse covariance matrix Σ−1. The notion global here signifies a distribution characterization over all leakages (similar to pooled covariance). Further, its decomposed form enforces it to be symmetric and positive semi-definite to yield a non-negative distance. The loss function is adapted once again to find a global minimum by means of DT , accordingly

Ne Λ,Φ j Λ,Φ j Φ X dC (iT ) − dD (iT ) LM = Λ,Φ j Λ,Φ j . (7.9) j=1 dC (iT ) + dD (iT )

We remark that Lemma 7.2 is applicable only provided that Σ does not contain any eigenvalue ei = 0. However, if we restrict the distribution prototype to the set T then Σ is positive definite, i.e., all eigenvalues are greater than zero, because of a non-zero key-dependent variance. As a consequence distribution prototype learning is performed subsequent to leakage prototype learning.

7.2.4 Side-channel Leakage Profiling Phase

In the side-channel leakage profiling phase the adversary is first required to find the global V,k V,k minimum of Equation 7.6 to obtain the biases bT , respectively the leakage prototypes LT , where the leakage time-interval is T = {s ∈ S | λs > 0}. Secondly, the adversary achieves the global distribution prototype DT in the wake of minimizing Equation 7.9. In what follows we propose the mathematical framework of gradient minimization to be considered for the profiling approach. It will be sufficiently introduced and applied to LPL afterwards. For an algorithmic description of this section we refer to Algorithm 7.1.

80 7.2. Discriminating Side-channel Leakage via Prototype Learning

Basics of Gradient Minimization

By help of a gradient method the loss of a multivariate objective function F (x), that is differ- entiable for a point xi, decreases iteratively with

n αi X x = x − α ∇F (x ) = x − ∇f (x ). (7.10) i+1 i i i i n j i j=1

The step-size αi determines the portion of the gradient that is used for minimizing the loss in each iteration and has to be selected appropriately. In this work we make use of stochastic average gradient [RSB12] to find a minimum for both loss functions. As opposed to the original batch gradient method a stochastic method incorporates merely a single observation (fj) to calculate the gradient instead of the entire observation set. This comes in favor of very large observations sets in side-channel analysis due to very low iteration costs. Furthermore, the choice of the step-size is less critical, can generally be kept constant and straightforwardly determined for stochastic average gradient minimization. The latter circumstance is the main reason why we chose this method for LPL. In the following, intermediate variables subject to iteration are denoted by [·]i. Three major steps are considered per iteration, namely

(1) choose jR ∈ {1, . . . , n} randomly,

( ∇fj([x]i), if j = jR (2) decide [yj]i = [yj]i−1, otherwise,

α Pn (3) and compute [x]i+1 = [x]i − n j=1[yj]i. That is, the gradient of a single randomly chosen observation is averaged together with previous {C,D} gradients of all other observations that belong to the same LT . Here, α is applied as a constant step-size.

Gradient Representation of Prototypes

If we apply Equation 7.10 to our adversarial model we have

{C,D} {C,D} {Λ,Φ} [LT ]i+1 = [LT ]i − α∇ {C,D} LM . (7.11) LT

C D Please remember that both, LT and LT are place-holders for the actual leakage prototypes j determined by each iS, i.e., a certain leakage prototype is only considered by observations for j C which this prototype is the closest. Thus, during an iteration each iS is added to a set Ω D C D C D and to a set Ω according to its LT and LT . From each Ω and each Ω an observation is V,k randomly chosen, so that ideally any LT is updated with equal probability. Λ Regarding LM replacing the nabla operator with its partial derivatives in Equation 7.11 for j a single iS, denoted | j , yields iS

Λ  C j  ∇LC LM| j = 4nC Λ · [LT ]i − iS (7.12) T iS

81 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

Λ  D j  and ∇LD LM| j = −4nD Λ · [LT ]i − iS (7.13) T iS associated with the normalizing quotients

Λ j dD(iS) nC = 2 , (7.14) Λ j  Λ j Λ j  dC (iS) dC (iS) + dD(iS)

Λ j dC (iS) as well as nD = 2 . (7.15) Λ j  Λ j Λ j  dD(iS) dC (iS) + dD(iS)

C C D 0 D Using LT = heT − bT and LT = heT − bT (according as to Eq. 7.2), we can write the leakage biases as C C 4α X C [b ]i+1 = [b ]i + [y ]i, (7.16) T T |ΩC | j j C iS ∈Ω

D D 4α X D [b ]i+1 = [b ]i + [y ]i. (7.17) T T |ΩD| j j D iS ∈Ω The gradients are either C  C j  [yj ]i = nC Λ · heT − [bT ]i − iS (7.18)

D  0 D j  or [yj ]i = −nD Λ · heT − [bT ]i − iS , (7.19)

j {C,D} {C,D} respectively, if iS was selected randomly, and [yj ]i = [yj ]i−1 otherwise. Next, an iterative description of the diagonal matrix Λ, i.e., of the weights, is derived in the same manner, such that Λ [Λ]i+1 = [Λ]i − βdiag(∇ΛLM). (7.20) Replacing the nabla operator gives

2β X Λ [Λ]i+1 = [Λ]i − [y ]i (7.21) |ΩC | j j C iS ∈Ω

Λ with [yj ]i = diag(nC HC − nD HD) (7.22)

j Λ Λ if iS was selected randomly, and [yj ]i = [yj ]i−1 otherwise. HC , HD are the component-wise  C j   0 D j  squares (Hadamard product) of heT − [bT ]i − iS and heT − [bT ]i − iS . Since the latter prod- ucts use the current biases, the weight update should succeed the biases update. Furthermore, β is a another step-size quantifier, not necessarily required to differ from α. Note that the sign reversal of λi is implicitly included in Equation 7.20 and Equation 7.21 because of subtraction. {C,D} {C,D} After having performed J iterations we obtain both, bT = [bT ]J and Λ = [Λ]J , {C,D} whereas [bT ]0 = 0 and [Λ]0 = I. Previous gradients are, as well, initialized with all zero. Before utilizing Λ to update the biases in the next iteration we cancel out unusable weights P {λi = 0 | λi < 0} and normalize Λ in order that ∀i λi = 1 to prevent it from becoming degenerated.

82 7.2. Discriminating Side-channel Leakage via Prototype Learning

Φ Thereafter, the adversary proceeds analogously to minimize LM. The distribution prototype in the shape of Φ is iteratively updated by

Φ [Φ]i+1 = [Φ]i − γ∇ΦLM (7.23) which leads to 4γ X Φ [Φ]i+1 = [Φ]i − [y ]i (7.24) |ΩC | j j C iS ∈Ω yΦ nΦP − nΦ P . with [ j ]i = C C D D (7.25)

{C,D} Λ Again, the random selection is analogously to [yj ]i or [yj ]i and γ is the associated step-size nΦ nΦ dΛ,Φ quantifier. Note, that C or D , the normalizing quotients in the latter equation, utilize {C,D}  1 {C,D} j   1 {C,D} j > this time. The update is completed by P{C,D} = [Φ]i Λ 2 (LT − iS) Λ 2 (LT − iS) . 0 > With another J iterations passed [Φ]i represents the distribution by DT = [Φ]J0 [Φ]J0 , while [Φ]0 = I.

Remark 7.3. Since Λ and Φ are globally defined variables they are updated several times during a single iteration depending on how many sets ΩC exist.

Determining the Step-size

In [RSB12] a step-size of the form 1 α = (7.26) 2NlK is suggested to prove convergence. Nl is the number of profiling observations per leakage class, e.g., |ΩC | if equally sized across all ΩC . K is the Lipschitz-constant of the loss function and states how fast the loss function can change. It is defined as the upper bound of the gradient as long as it is finite. For the marginal-loss functions we are facing in this work, the maximum C D gradient increases while the distance between two adjacent leakages, i.e., between LT and LT , decreases. To verify this we recall that each summand of a loss function is bounded by [−1, 1]. Each value is taken if either of both distances becomes zero and thus the gradient within this interval must increase while the leakages approach and vice versa. Consequently, the step-sizes should initially be guided by the biased leakages. An estimation of Kmin, and therefore αmax (as well as βmax and γmax), involves the averaged pair-wise distance of directly adjacent biased leakages at that time-instant for which the distance is largest. We denote it as scalar d¯ . By means of the linear slope formula we get eh 1 − (−1) 2 Kmin = = . (7.27) d¯ d¯ eh eh Additionally, it is possible to define a step-size with decay. In this context decay means a non- linear decrease of the step-size over the iteration count. With such a decay variables are refined with certain granularity in later iterations. We use a decay by means of simply dividing the step-size by the current iteration number.

83 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

7.2.5 Side-channel Leakage Characterization and Key Recovery Phase In the final phase the adversary characterizes yet unseen leakage observations that possess an a priori unknown key which is to be recovered. To facilitate a key recovery in the Bayesian Λ,Φ new new V,k sense the distance dV,k (iS ) between an unseen observation iS and a leakage prototype LT needs to be transformed into a density value, i.e., posterior probability. This can be achieved new by observing the occurrences of the distances for each iS . For an algorithmic description of this section we refer to Algorithm 7.2. Assumption 7.1. Without loss of generality we assume that the occurrence of distances can be modeled by means of an inverse logistic function with its maximum defined at zero. Due to the vast similarities between an inverse logistic function and the right-hand tail of the normal distribution’s PDF we suggest fitting the latter instead of the former. Note, that we will nevertheless employ the notions of the normal distribution. In course of that we suggest a fitting function fV,k such that   Λ,Φ new BV,k V,k new dV,k (iS ) Prob(LT | iS ) = exp −  . (7.28) AV,k

Scaling factor A(V,k) is used to approximate the function to the mean, whereas the moment B(V,k) rewards distances on the left-tail instead of penalizing them as it would be the case for the usual normal distribution. On the opposite the right-tail fits the right-tail of the assumed normal distribution. In more detail, the fitting function fV,k can only fit the right-tail normal probability density function gV,k with respect to each (V, k) up to its inflection point at which the slope severely decreases towards the mean. Hence, we initially set fV,k = gV,k at the latter’s inflection point

 (d−µ )2  ! − V,k BV,k exp 2σ2 d V,k exp − = q (7.29) AV,k 2 2πσV,k 2 ∂ gV,k ∂d =0 for d being a distance, that gives B 2(µV,k + σV,k) V,k AV,k = 2 (7.30) log(2πσV,k) + 1

2 exp(−1) with σV,k > 2π , such that fV,k intersects gV,k at its right-tail inflection point. Next, we heuristically find both parameters by increasing BV,k in order to minimize the mean squared error between fV,k and gV,k in the interval [µV,k + σV,k, ∞). It can be expressed as E f , g R ∞ f d − g d 2 V, k d ( V,k V,k) = µV,k+σV,k ( V,k( ) V,k( )) . Instead of finding parameters for each ( ), two global parameters can be determined over all (V, k), involving all leakage prototypes and leakage observations used for profiling. Ultimately, the key decision is realized with Bayes’ rule, such that the probability of a key candidate ke evolves with V,k new new Prob(LT | iS ) · Prob(ke) Prob(ke | iS ) = new . (7.31) Prob(iS )

84 7.3. Empirical Results

The prior probability Prob(ke) is equal for each candidate in the beginning and takes the value new of Prob(ke | iS ) while being further evaluated.

7.2.6 Algorithmic Description of LPL The profiling phase is summarized in Algorithm 7.1, the characterization phase in Algorithm 7.2. Algorithm 7.1 LPL Profiling j Ne Input: (1) Leakage observations {iS}j=1 with known (V,k), (2) iteration number I V,k Output: (1) Biases bT , (2) AV,k, BV,k, Λ, Φ 1: Determine step-sizes α, β, γ acc. to Equations 7.26 and 7.27 V,k 2: Initialize prototypes LT with empirical means and Λ and Φ with the identity matrix h V,ki h V,ki h i h i 3: Initialize previous biases b and gradients y , yΛ , yΦ with zero T 0 j 0 j 0 j 0 4: for i = 1 to I do 5: randomly select j ∈R {1,...,Ne} C D j D 0 6: determine LT and LT through minimum of Eq. 7.8 where iS has hT and LT has hT C D 7: update biases bT and bT acc. to Equations 7.16, 7.17, 7.18, 7.19 8: update Λ acc. to Equations 7.21 and 7.22 9: end for 10: Repeat steps 4–9 but update Φ only acc. to Equations 7.24 and 7.25 11: Find parameters AV,k and BV,k acc. to Equations 7.29 and 7.30

Algorithm 7.2 LPL Characterization V,k V,k new Input: (1) Leakage prototypes LT with biases bT , (2) AV,k, BV,k, Λ, Φ, (3) observation iS V,k new Output: Posterior probabilities Prob(LT | iS ) V,k new 1: For each class (V, k) determine Prob(LT | iS ) acc. to Eqns. 7.8, 7.28

7.3 Empirical Results

In this section we report the performance of side-channel leakage prototyping and, hence, ap- proach as follows. We first investigate the leakage profiling on the basis of simulated observations. Simulated observations enable us to directly quantify the profiling accuracy since we have control over the leakage. Furthermore, we want to determine appropriate step-sizes for which the accuracy can reach its optimum (cf. Sec. 7.3.1). Afterwards we switch from simulated to measured side-channel observations to compare leak- age prototype learning to other methods concerning the selection of leakage dependent time- instants (cf. Sec. 7.3.2) and attack performance during a reasonable attack scenario against embedded microcontrollers (cf. Sec. 7.3.3). The attack performance will be stated with respect to success rate. This metric provides a ratio between the successful key recoveries and all key recovery trials. Each empirical result has been determined by averaging of 1,000 trials.

85 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

In all cases we include distinct variation levels for RT , generated by means of a normal dis- tribution according to σ2 = {1, 9, 64} and succeeding rounding to the nearest integer. Unless otherwise stated, we apply 500 iterations to determine the biases, Equation 7.16 and Equa- tion 7.17, respectively to locate the side-channel dependent time-instants Equation 7.21. We then apply another 500 iterations to Equation 7.16 and Equation 7.17 to prevent negative ef- fects due to potential fluctuations that arise from altering weights during the first iterations. Eventually, we spend 500 iterations on obtaining a distribution prototype of Equation 7.23.

Remark 7.4. Actually, unbiased leakage determining demands making no assumptions on the device leakage but profiling each pair of v and k. However, to account for several research papers that consider a linear Hamming-weight model and since this assumption has been demonstrated suitable several times in similar side-channel analysis setups, cf. [MOP08], we also include it in our experiments in the shape of a smart card similar target platform. For the latter a Hamming-weight might fit better since the side-channel leakage is rather caused by transferring the data bits on the bus than by higher-order dependent switching transitions like in hardware S-box implementations. Though, this does not concern the experiments in Section 7.3.1 because of simulated side-channel leakage. These experiments should serve as a proof of concept to illustrate that a leakage prototype could be a better estimator of side-channel leakage than the mean.

Below, the term n profiling class-observations (”profiling” is omitted in figures) signifies the number of observations employed per leakage class. Thus, the size of the profiling set is n times the leakage class count, e.g., for Hamming-weight model the set size reads 9n. In contrast, new measurements to be characterized are called characterizing observations (or class-observations to characterize in figures).

7.3.1 Profiling Accuracy

Leakage prototyping considers each time-instant independently (cf. Sec. 7.2.1), merely the weights have influence whether a time-instant is involved or completely disregarded. Thus, our simulated observations exclusively contain leakage dependent time-instants to focus on profiling accuracy. We simulated a linearly increasing Hamming-weight leakage with the largest power consumption value at each of 10 generated time-instants being random integers ranging within [0,..., 255] (corresponding to 8 bit vertical DSO resolution). Thus, there are 9 leakage classes. We choose the distance between them to be d¯ = 4 to reasonably estimate the minimal eh Lipschitz-constant Kmin (cf. Sec. 7.2.4). Further, by means of the 10 time-instants with varying power consumption level we can assure a uniformly profiling, meaning that a certain power consumption level is neither preferred, nor disadvantaged. For the same reason we maintain an equal leakage distance, as well as equal weights to cancel out weighting influences on time- instants. Finally, we added variance in accordance with respect to the levels from above and applied a rounding to the nearest integer to increase the effect of RT . Figures 7.1, 7.2, and 7.3 plot the average squared L1-distance — denoted L1-error throughout the rest of this work — between leakage classes within the first (out of ten) leakage dependent time-instants and the vector of simulated unbiased leakages, i.e., including all variance-free Hamming-weights, as a function of the step-size α. The choice of another time-instant would yield equivalent results which already negates the question whether different power consumption

86 7.3. Empirical Results

2.0 LPL 3 class-observations LPL 5 class-observations 1.8 LPL 10 class-observations

1.6 LPL 15 class-observations Mean 3 class-observations 1.4 Mean 5 class-observations Mean 10 class-observations 1.2 Mean 15 class-observations

1.0

0.8

0.6

0.4

0.2

0 −4 −3 −2 −1

Average squared L1-error regarding(1 unbiased time-instant) leakage 10 10 10 10 Step-size (logarithmized)

Figure 7.1: Determining the unbiased leakage via leakage prototype learning including simulated 2 traces with Rt according to σ = 1. Various step-sizes, starting from α = 0.5, show the L1-error decrease for a single time-instant and different numbers of profiling class-observations. © 2016 IEEE. level do have an impact on profiling. The plots start with a slightly higher step-size of 0.5 rather than 0.33 as estimated by Equation 7.27 for 3 profiling class-observations. In order to enable a good visual inspection each outcome of LPL profiling is contrasted with the corresponding mean estimation of the unbiased leakage. In Figure 7.1 we can make two relevant observations, for one thing the L1-error regarding the unbiased leakage is subjected to a large decrease, which expectably scales down the better the mean estimation gets. For another thing the best decrease is reached for smaller step-sizes while the number of profiling class-observations increases. The latter, thus, conforms well with the prediction of Equation 7.27, however, with slightly larger step-sizes as expected. In Figure 7.2, respectively Figure 7.3, we observe the same relative decrease, although the L1- errors are obviously higher because of higher variation. Roughly speaking, the overall error can be halved across all cases by the application of LPL profiling. On the other side we see that step- sizes continue decreasing with increasing variation. Increasing variation causes a convergence of adjacent leakages and thus, narrows d¯ . This fact naturally raises the assumption that there bh might exists an upper variance limit beneath which LPL profiling remains useful. It is obvious that a too small step-size can only cause a negligible improvement, as opposed to the mean estimation, which is quite evident from these figures. So the answer is yes, an upper limit exists. Nevertheless, by means of additional experiments we found that we could still achieve an improvement with a variation level according to σ2 = 400 but with more profiling class- observations included. Moreover, up to the variation levels shown here it did not make any

87 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

8.0

7.0

6.0 LPL 3 class-observations LPL 5 class-observations 5.0 LPL 10 class-observations LPL 15 class-observations 4.0 Mean 3 class-observations Mean 5 class-observations 3.0 Mean 10 class-observations Mean 15 class-observations 2.0

1.0

0 −4 −3 −2 −1

Average squared L1-error regarding(1 unbiased time-instant) leakage 10 10 10 10 Step-size (logarithmized)

2 Figure 7.2: Same representation as Figure 7.1 but with Rt according to σ = 9. © 2016 IEEE.

35

30

25 LPL 3 class-observations LPL 5 class-observations LPL 10 class-observations 20 LPL 15 class-observations Mean 3 class-observations 15 Mean 5 class-observations Mean 10 class-observations 10 Mean 15 class-observations

5.0

0 −4 −3 −2 −1

Average squared L1-error regarding(1 unbiased time-instant) leakage 10 10 10 10 Step-size (logarithmized)

2 Figure 7.3: Same representation as Figure 7.1 but with Rt according to σ = 64. © 2016 IEEE.

88 7.3. Empirical Results

1.0 Means L1-error (improved) LPL L1-error (improved) LPL L1-error (increased) 0.8 Means L1-error (increased)

0.6

0.4

0.2 L1-error regarding unbiased leakage (1 class)

0 3 5 10 15 20 25 30 Number of class-observations

Figure 7.4: L1-errors for a single leakage class are depicted along different profiling class- observations. The left-hand bars exhibit the largest error decrease whereas the right-hand the largest error induced. Obtained by simulated traces with Rt accord- ing to σ2 = 1. © 2016 IEEE. difference if we have applied (much) more iterations than 500, but in cases where the variation level is high, incrementing the iteration number is advisable. Now that we have figured out the best step-size in our simulated profiling scenario we look into the profiling at prototype level. All profiling class-observations are affected by variation which also limits the performance of the LPL profiling. For this reason the Figures 7.4 – 7.6 depict, under the same conditions as before, the largest error decrease regarding a single prototype against the largest induced error. Herein, error is meant again with respect to the L1-error regarding the variance-free Hamming-weight value. We remark, that this time the L1-errors due to mean estimation are kept fixed, i.e., the same simulated observations were used and only LPL profiling has been repeated to asses its probabilistic nature. This makes it easier to the see how LPL profiling performs during different runs and what is to expect when given a certain measurement. Figure 7.4 indicates that the best error decrease is achieved for prototypes where the mean estimation shows a large error. Further, the maximum possible error decrease drops as the amount of profiling class-observations increases, a fact that we could already recognize from the Figures 7.1 — 7.3. The induced error in contrast behaves exactly the other way around. It is large when the mean estimation was quite accurate already or has even matched the variance-free leakage. Nevertheless, the maximum induced error is always clearly smaller than the largest error decrease and, in addition to this, it vanishes even faster.Figures 7.5 and 7.6 provide basically the same results. As one would expect the maximum error decrease shrinks

89 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

5.0 Means L1-error (improved) LPL L1-error (improved) LPL L1-error (increased) 4.0 Means L1-error (increased)

3.0

2.0

1.0 L1-error regarding unbiased leakage (1 class)

0 3 5 10 15 20 25 30 Number of class-observations

2 Figure 7.5: Same representation as Figure 7.4 but with Rt according to σ = 9. © 2016 IEEE.

12 Means L1-error (improved) LPL L1-error (improved)

10 LPL L1-error (increased) Means L1-error (increased)

8.0

6.0

4.0

2.0 L1-error regarding unbiased leakage (1 class)

0 3 5 10 15 20 25 30 Number of class-observations

2 Figure 7.6: Same representation as Figure 7.4 but with Rt according to σ = 64. © 2016 IEEE.

90 7.3. Empirical Results down a little bit. So does the induced error, but being subject to higher variation. It is almost zero for a few profiling class-observations already. That might be due to the fact that mean estimation is not as accurate as before. Generally, we can not specify how many prototypes show a decreased (or increased) error compared to their mean estimated counterparts, at beforehand. During our experiments we noticed up to 3 prototypes with increased error, however, within the aforementioned bounds.

7.3.2 Locating and Selection of Leakage Dependent Time-instants — Results and Comparison

From now on we utilize measured side-channel observations instead of simulated. We therefore set up a usual measurement environment to make our results comparable to most of the other publications on profiled side-channel cryptanalysis. The smart card similar target platform was represented by an 8 bit Programmable Interrupt Controller (PIC) microcontroller clocked at 3.68 MHz, running an AES software implementation. We have measured the power consumption via passive probes across a 1 Ω resistor in the ground line using a DSO at a sampling rate of 125 MSample/s. In a preparative step the observations were subjected to compression by means of peak extraction [MOP08]. This facilitates a better visualization of following results. We have

0.6 Eigenvalue PCA 1.pc pci P 0 0.6 Localized PCA 1.pc pci P 0 0.9 Known-key CPA

0 0.2 SVM weighting

0 1 Linear regression coefficient weighting 2 0.5 F F9 0 100 SOST 50

Individual method-related sample weights 0 0.3 LPL univariate minima

0 50 100 150 200 250 300 350 400 Observed samples

Figure 7.7: Selection of leakage dependent time-instants for various methods. All values are shown as absolute values. Acquired with 3 measured profiling class-observations and intrinsic variation Rt.

91 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

2 then verified the intrinsic non-key related variance to be approximately one, namely σi ≈ 1. To continue previous variation level experiments we have accordingly added variance, so that we achieve the variation levels according to σ2 = {1, 9, 64} as before with succeeding rounding. The AES S-box outcome, denoted S(vi ⊕ k) for given input vi and key k, serves as targeted selection function or side-channel leakage providing function ht, respectively. In this work experiments are focused on the first round of AES. Furthermore, we apply the linear Hamming-weight model, such that the side-channel leakage is expected to be HW (S(vi ⊕ k)). The latter is possible since the microcontroller’s architecture exhibits such leakage. In this section we pay attention to the selection of leakage dependent time-instants affiliated with LPL Univariate Minima (UM) while iterating Equation 7.21. UM is compared with various methods, that is eigenvalue PCA [APSQ06] and localized PCA [MBvLM12], KK-CPA [RO05], SVM weighting (cf. Chapter 6), Sum Of Squared T-differences (SOST) [GLRP06], and Linear Regression (LR) coefficient weighting [SLRP05]. In case of PCA we retain p = 1 and p = 3 principal components. The latter are summed up as absolute values for visualization. Similarly, to prepare reasonable results concerning SVM, the (absolute) weight vectors are summed up. Accordingly, CPA coefficients and UM weights are also displayed with their absolute values. Regarding LR we employ the subspaces F2 and F9. If methods are not handling observations grouped to leakage classes, namely PCA and CPA, they are supplied with the entire set of profiling class-observations at once.

Eigenvalue PCA 1.pc 0.6 pci P 0 1.pc 0.6 Localized PCA pci P 0 0.9 Known-key CPA

0 SVM weighting

0.2 0 1 Linear regression coefficient weighting 2 0.5 F F9 0 400 SOST 200

Individual method-related sample weights 0 LPL univariate minima 0.3

0 50 100 150 200 250 300 350 400 Observed samples

Figure 7.8: Same representation as to Figure 7.7 but acquired with 10 measured profiling class- observations.

92 7.3. Empirical Results

Figure 7.7, as well as Figure 7.8 plot sample related weights for each method, initially governed by the intrinsic variation level including 3 and 10 observations per class, respectively a set of 27 and 90 observations in case of PCA and CPA. Please note that PCA p = 1 results overlay the p = 3 results where the latter is plotted in gray. Further, LR results for subspace F2 overlay results for subspace F9 which is also plotted in gray. LPL UM, and also LPL profiling, is applied with step-sizes figured out in Section 7.3.1. Hence we set α = β = {0.4, 0.25}. However, in contrast to step-size α that relates to LPL profiling step-size β, relating to UM, is subjected to decay (cf. Sec. 7.2.4). The rationale for this is that, without decay, UM does not offer a reasonable fine granularity, meaning it only provides the strongest leakage dependent time-instants, usually not more than two. Anyway, we make use of that fact when it comes to considerably higher variation experiments, e.g., if σ2 = 64 given just 3 profiling class-observations.

Eigenvalue PCA 1.pc 0.6 pci P 0 0.6 Localized PCA 1.pc pci P 0 0.9 Known-key CPA

0 SVM weighting

0.2 0 1 Linear regression coefficient weighting 2 0.5 F F9 0 400 SOST 200

Individual method-related sample weights 0 0.3 LPL univariate minima

0 50 100 150 200 250 300 350 400 Observed samples

Figure 7.9: Selection of leakage dependent time-instants for various methods. All values are shown as absolute values. Acquired with 10 measured profiling class-observations 2 and increased variation Rt according to σ ≈ 9.

In Figure 7.7 similarities between some of the methods are easy to recognize. On first sight, all methods basically locate the same leakage dependent time-instants, although some results are more evident than others. On second sight, we see that SVM weighting produces results very similar to CPA just as UM to PCA. In both cases the advantage is with the method that accepts grouped observations. They are less noisy which relieves an unambiguous selection of time-instants. It is also obvious that UM should be preferred over all other methods since most of the weights are zero or at least negligible small. We remind that UM dissociate from others

93 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis by the fact that time-instants not leading to a minimum can be safely disregarded. Strictly speaking this also holds for negative correlation coefficients if a positive correlation is expected, but this does neither apply to the majority of time-instants, nor is there a sound argument to exclude negative coefficients. Both PCA methods, eigenvector and localized, are most widely identical. For p = 1 the same principal component is retained. The p = 3 cases, though, seem noisier and retain apparently different principal components. These findings do not change if more profiling class-observations are employed. LR with respect to F2 clearly outperforms LR with F9. SOST shows almost the same quality as LR F2. In Figure 7.8 the highlighting becomes even clearer with regards to all methods, but especially for UM where, as of now, all other weights are exactly zero. This does not mean, that solely UM provides the only

Eigenvalue PCA 1.pc pci P 0 Localized PCA 1.pc pci P 0 Known-key CPA

0 SVM weighting

0.2 0 1 Linear regression coefficient weighting 2 0.5 F F9 0 400 SOST 200

Individual method-related sample weights 0 LPL univariate minima

0 50 100 150 200 250 300 350 400 Observed samples

Figure 7.10: Same representation as to Figure 7.9 but with increased variation Rt according to σ2 ≈ 64. correct time-instants but there is no need for any additional selection mechanism. Albeit, with CPA and SVM weighting the residual noise is small enough to facilitate a peak-value based criterion for final selection. PCA results are twofold again. The first principal component is identical for both branches and their residual noise is very low but for p = 3 there still exists many potentially useless, though adversely affecting peaks. SOST is a little bit noisier than LR with F2, however both methods also provide a clear selection of time-instants. In contrast to LPL UM it should be noted that SOST and LR select a fewer number of time-instants. We assume that the two additional early peaks are residual dependencies to the S-box input that

94 7.3. Empirical Results

are stressed stronger by LPL UM. These early peaks could be verified using LR with F9 (but not with F2), however, involving much more (> 60) profiling class-observations. The next two experiments incorporate side-channel observations subjected to higher (added) variance. Step-sizes were set to {0.2, 0.15.0.1} and {0.1, 0.1, 0.1}, respectively (cf. Sec. 7.3.1). In Figure 7.9 a total variance of σ2 = 9 is concerned yielding even slightly better results as against σ2 = 1. That is because of the relatively moderate variance added which most likely affects non-leakage dependent time-instants stronger and thus, overall improves results no matter that peak-values have decreased for localized PCA or SVM weighting. The value decrease of UM, P though, originates from the enforcement ∀i λi = 1 (cf. Section 7.2.4) caused by additional peaks. Once more, the quality of SOST and LR with F2 is comparable to LPL UM except for the missing early peaks. In Figure 7.10 the picture is clearly different. The performance of localized PCA and eigenvalue PCA for p = 3 seriously drops since useful peaks are hardly distinguishable. This also extends to p = 1 whereas localized PCA performs perceptibly better this time. Unlike before CPA also drops some useful peaks, so does SVM weighting additionally accompanied by more adverse peaks. UM still performs best but showing noticeably more adverse peaks. We would argue that there is still no need for a further selection mechanism. The selection quality of SOST is clearly decreased whereas LR with F2 does not show any large adverse peaks, although there is always a certain background noise level.

7.3.3 Attack Performance and Comparison

In this section we provide a thorough performance comparison between profiled side-channel attacks and variants by means of success rate. Note that [GLRP06] and [SMY09] were also taken into consideration when implementing the attacks. [CK14] deals with implementing some of the attacks numerically stable in an efficient way and provides a good practice guide on complexity reducing methods like PCA which might perform better then. Moreover, a PCA at beforehand including the means of class-observations could be used to transform the raw data of observations in such a way that the key-dependent variance is concentrated in the first few dimensions only. The attack scheme is as follows. The type, e.g., LPL, is followed by the time-instant locating method, e.g., UM, separated by a slash. An appended number signifies p, i.e., the actual selection according to the number of retained principal components (PCA) or correlation peaks (CPA). In contrast, there is no further selection applied to UM or SVM weighting. The profiling class-observations have been randomly arranged in sets but equal across all variants. Characterization observations have been randomly arranged anew for each variant and profile set size. The abbreviation ”w/D” means without considering the variation distribution, that is omitting the distribution prototype, respectively setting the covariance matrix to the identity. In more detail, MVND accompanied by PCA represents the principal subspace based template attack [APSQ06] rather than utilizing PCA as pure time-instant selection method. Finally, the Stochastic Method (SM) attack is involved with subspace F2 and F9. The F9 subspace does not target the Hamming-weights but the bit-wise estimation of the S-box outcome; the Hamming- weight belongs to F2 which is claimed to be less sufficient in SM [SLRP05] for their experiments. In the previous section, however, we see that at least for the selection of time-instants F2 is much better suited. Thus, for the SM attack with F2 we use the LR weighting as time-instant selection

95 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

1 MVN/lPC A 1 0.75 MVN/lPC A 3 0.5 MVN/lPC A 1 0.25 w/D 0 MVN/lPC A 3 w/D 1 MVN/ePCA 1 0.75 0.5 MVN/ePCA 3 0.25 MVN/ePCA 1 w/D 0 MVN/ePCA 3 1 w/D 0.75 MVN/C PA 4 0.5 0.25 MVN/C PA 4

Individual success rates w/D 0 MVN/UM 1 MVN/UM 0.75 w/D 0.5 SVM 0.25 SM 9/CPA 4 0 F SM 5 25 50 75 100 125 150 175 200 F2 Number of class-obs. to characterize LPL/UM

Figure 7.11: Comparison of attack performances measuring success rate. Each of which in- cludes (top-down) 3, 5, 10, and 15 measured profiling class-observations. Intrinsic variation Rt. method. All attack performances conform with our previous variation levels σ2 = {1, 9, 64}. Success rate is then a function of the number of characterizing observations. By means of Figure 7.11 we can recognize a group, including both LPL variants, SVM, SM and MVND via CPA and UM, the latter two without distribution consideration, that converge quickly to success rate 1, even for a few profiling class-observations. As soon as the latter of which reaches 10 and more MVND via CPA and UM with distribution consideration also belong to that group, otherwise they do not succeed at all. This circumstance which is also true for experiments with higher variation, is owed to the fact that the covariance matrix is close to singular. We could have bypassed this while increasing p but then we would have included some arbitrary, maybe adverse correlation peaks. PCA attacks require an inspection in more detail. In Figure 7.12 we can make similar observations. Variants with 1 principal component re- tained perform absolutely equivalent, except for fluctuations caused by the randomly arranged characterization sets. Even here, omitting the variation distribution seem to be a better choice. What we did not expect because of what we have learned from Section 7.3.2 is that localized PCA based on p = 3 dominates eigenvector PCA based on p = 3. However, with at least 10

96 7.3. Empirical Results

MVN/lPCA 1 1 0.75 MVN/lPCA 3 0.5 MVN/lPCA 1 w/D 0.25 MVN/lPCA 3 0 w/D

1 MVN/ePC A 1 0.75 MVN/ePC A 3 0.5 MVN/ePC A 1 0.25 w/D 0 MVN/ePC A 3 1 w/D 0.75 MVN/CPA 4 0.5 MVN/CPA 4 0.25 w/D Individual success rates 0 MVN/UM 1 MVN/UM w/D 0.75 SVM 0.5 SM /CPA 4 0.25 F9 SM 0 F2 5 25 50 75 100 125 150 175 200 LPL/UM Number of class-obs. to characterize LPL/UM w/D

Figure 7.12: Same representation as to Figure 7.11 but with increased variation Rt according to σ2 ≈ 9. profiling class-observations eigenvector PCA with p = 3 including distribution consideration be- comes the overall best PCA variant. Nonetheless, one could recognize the up and down moving convergence of PCA variants without distribution consideration. The group of fast converging variants still exists whereas SVM and SM with F9 moderately fall behind. Generally, the SVM performance is comparable to that of SM, although with 10 profiling class-observations SVM slightly outperforms SM regarding F2 and F9. There is also the MVND via UM in between SVM and SM in the top-most sub-figure. We suppose that this is due to a lucky choice of time-instants, so that the matrix is not close to singular therewith. Given a smaller profil- ing set localized PCA with p = 1 dominates, but substituted thereafter by eigenvector PCA p = 1 with distribution consideration. Given a larger profiling set with at least 10 profiling class-observations all PCA variants show equivalent performances. A view on Figure 7.13 reveals evident differences. Both LPL variants and SM with F2 are dominating, especially with smaller profiling sets. Here, the variation distribution should be considered in any case which is in contrast to all other variants, where distribution consideration leads to less efficient attacks. Nevertheless, for very few number of characterizing observations LPL considering the distribution is slightly better suited than SM with F2. Further it is

97 Chapter 7. Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis

MVN/lPCA 1 1 0.75 MVN/lPCA 3 0.5 MVN/lPCA 1 w/D 0.25 MVN/lPCA 3 0 w/D

1 MVN/ePC A 1 0.75 MVN/ePC A 3 0.5 MVN/ePC A 1 0.25 w/D 0 MVN/ePC A 3 1 w/D 0.75 MVN/CPA 4 0.5 MVN/CPA 4 0.25 w/D Individual success rates 0 MVN/UM 1 MVN/UM w/D 0.75 SVM 0.5 SM /CPA 4 0.25 F9 SM 0 F2 5 25 50 75 100 125 150 175 200 LPL/UM Number of class-obs. to characterize LPL/UM w/D

Figure 7.13: Same representation as to Figure 7.11 but with increased variation Rt according to σ2 ≈ 64. interesting to see that with more profiling observations LPL is still providing better results whereas SM F2 does not benefit from more profiling observations in that range. In contrast to other works [GLRP06, SLRP05] it should be pointed out that subspace F2 is indeed reasonable when it comes to investigate Hamming weight based side-channel leakage. PCA variants only succeed while considering the variation distribution.

7.4 Conclusion

We introduced a profiled side-channel attack that aims at determining unbiased side-channel leakages comprised in a device. In course of this we first gave a formal discussion on the basics and demonstrated the effectiveness of the profiling accuracy based on simulated leakages afterwards. Advantageously, the profiling provides a clear selection of leakage dependent time- instants without requiring a further selection mechanism. Furthermore, the computational effort is very reasonable without the need to perform potentially unstable operations, e.g., matrix inversions, but iteratively apply precomputed derivatives. Ultimately, a comprehensive

98 7.4. Conclusion performance comparison between various variants of profiled side-channel attacks confirms this attack as highly promising.

99

Part IV

Applications

Chapter 8 Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property

The protection of IP is a challenging task, especially in secured embedded systems where program code that is supposed to be a plagiarism cannot be simply read out for further inspection. Previous work has presented a framework based on percep- tual hashing to detect intelligent property in hardware and software designs. In this chapter we consequently extend this framework to detect IP plagiarism in embedded systems that can reliably match contents even in the presence of countermeasures. Therefore, we propose an adapted signal feature extraction method, the wavelet trans- form, to form a salted side-channel hash function.

Contents of this Chapter

8.1 Introduction ...... 103 8.2 Perceptual Hashing ...... 104 8.3 The Wavelet Transform ...... 107 8.4 Salted Side-Channel Based Perceptual Hashing ...... 108 8.5 Experimental Results ...... 111 8.6 Conclusion ...... 116

8.1 Introduction

The plagiarism of intelligent property is without doubt a serious threat to the software indus- try. For instance in 2011 already almost 42% of the worldwide distributed software was pirated incorporating a financial damage of over 63 billion US dollars [Bus12]. Although these numbers mainly concern general purpose computer software they are likely to be a good indicator for embedded software as well. The embedded software industry, however, has virtually no oppor- tunity to confirm a suspected plagiarism. Products on the market are equipped with memory read-out protections, and thus a legitimated IP holder can only evaluate suspicious embedded software on the basis of its functionality but providing no indications. However, side-channels can assist plagiarism detection in embedded systems. The authors in [BBP11] applied the idea of the CPA [BCO04] to embed a watermark in assembly source code of the Keeloq algorithm. Therefore, a secret watermark key is first combined with an input of Keeloq and leaked out bit-wise. The correct key is then indicated by the attack

103 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property confirming its existence in the code. In [BSPB12] the power consumption of a microcontroller is used to determine the Hamming weight in each cycle that corresponds to a prefetch of a certain instruction. Thus, the resulting map of Hamming weights can be matched with a second map to expose the similarity of two program codes. Therefore, string matching algorithms are involved that are even robust against minor changes like register substitutions. The authors further enable the previous watermark approach to carry binary information, i.e., a digital signature. The approach is close to the watermark key extraction but this time the leaked bit is previously either inverted to generate a negative or not inverted to generate a positive correlation peak. Thus, the binary information can be visualized by consecutive peaks. Obviously, all of these approaches heavily rely on the assumption that the watermark code will not be removed, especially since it is non-functional code. It is very likely that source code analysis software will detect respective code proportions. In [DGK+12] the authors use a relaxed formalism for security features of physical functions defined in [AMS+11]. In [DGK+12] however, it is centered around the physical representation of the IP, e.g., the power consumption or intrinsic hardware properties like Physically Unclonable Function (PUF), for later identification. The information derived from the physical properties are then applied to a perceptual hash function to extract a discriminative IP content sensitive output. In a first experimental case study several public implementations of different block ciphers, where each of which represent an IP, are involved to evaluate the performance. The power consumption of the device running the implementations was measured and compressed with Fast Fourier Transform (FFT) to cancel out high frequency noise. To compare these compressed power consumption vectors the Pearson correlation coefficient was applied to decide whether the IP are similar or not. It has been shown that the implementations can be efficiently distinguished even with respect to minor code transformations.

8.1.1 Contribution

Extending this formalism, we propose a perceptual hashing method in Section 8.4 that is parametrized by a salt to counteract various program code transformations. Usually, an at- tacker, i.e., a plagiarist, would try to eliminate the similarities between the original IP and his illegitimate plagiarism. For instance, such a plagiarism can easily be achieved by gradually transforming, respectively modifying, the original IP program code and generating the hash value until the similarity can no longer be proven. Using the salted hashing method the pla- giarist may still be able to perform code transformations but due to the salt he cannot guess the particular influence on the content sensitive hash value. Thus, perceptual hashing will be much more robust against program code transformations. In this chapter we utilize and adapt the wavelet transform in order to obtain signal features dependent on a salt. For evaluation purposes the signal features are first quantized and then assessed with the help of the Bit Error Ratio (BER).

8.2 Perceptual Hashing

In compliance with the definitions in [DGK+12] a perceptual hash function is a probabilistic procedure H :(IP, content) → h that outputs a hash value given a physically observed quantity of a device, referred to as the content, while it runs the program code of the IP. The main step

104 8.2. Perceptual Hashing of the hash value generation is the feature extraction that filters significant samples which are either invariant in case of similar contents or variant in case of distinct contents. A perceptual hash function should possess the following properties, according to [MU06]. Let H(•) denote the perceptual hash function and x, x0, y contents where x0 equals x except for minor modifications and y is perceptually distinct from both. Further, the values h, h1, and h2 are elements of the hash value space. (1) Equal distribution. Prob(H(x) = h) should be equal for every hash value h.

(2) Pairwise independence. Prob(H(x) = h1 | H(y) = h2) ≈ Prob(H(x) = h1). (3) Invariance. Prob(H(x) = H(x0)) ≈ 1. (4) Distinction. Prob(H(x) = H(y)) ≈ 0. Throughout the remainder two hash values are subjected to a similarity detection function D :(h, h∗) → [0, 1] that makes a decision about the grade of similarity. Therefore a perceptual hash function is of probabilistic nature since it deals with physically measured variables that is inherently subjected to noise which means that two measurements of the same IP content yield similar but not identical hash values. Therefore, the detection relies on two properties, the threshold-bounded perceptual robustness and content sensitivity. The perceptual robustness gives the probability that two IP contents which are indeed similar, respectively identical, can reach a similarity score larger than a predefined threshold τ ∈ [0, 1]. Otherwise, the content sensitivity gives the probability that two IP contents which are indeed distinct reach a similarity score smaller than τ (cf. [DGK+12]).

8.2.1 Our Perceptual Hashing Approach We slightly redefine the above introduced properties while assuming a classification problem. In the following perceptual robustness and content sensitivity are abbreviated with PeRo, re- spectively CoSe when appearing as indices. With regards to our approach the similarity score 2 distribution Nsim(τP eRo, σP eRo) is employed when IP contents are similar or identical and the 2 distribution Ndis(τCoSe, σCoSe) is employed when the IP contents are distinct. The perceptual robustness can be interpreted as the ability to perceive similarities (comple- ment of false rejection), while the IP code is subjected to a set of code transformation functions Fpre that preserve the content. Here, it is expressed by threshold τP eRo ∈ [0, 1] which is the expected similarity score value for similar or identical IP contents. The content sensitivity can be interpreted as the ability to perceive differences (complement of false acceptance), while the IP code is subjected to a complementary set of code transformation functions F pre that funda- mentally change the content. It is expressed by threshold τCoSe ∈ [0, 1] which is the expected similarity score value for distinct IP contents. The hard decision boundary, the boundary which necessarily decides with a sharp edge whether the investigated content indeed belongs to the assumed IP, need to be put within a cleared space between Nsim and Ndif . The latter will be discussed in Section 8.5. The hard decision boundary however depends on the application in the concrete case of plagiarism and will therefore only be discussed in the context of experiments in Section 8.5. Hence, a theoretically optimal similarity detection function has a perceptual robustness threshold τP eRo = 1 — the probability that similar IP contents are detected as similar is exactly

105 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property

one — and content sensitivity threshold τCoSe = 0 — the probability that distinct IP contents are detected as similar is exactly zero. In practice however, we obtain variances larger than zero, at least due to measurement noise, and thus we require the difference of the threshold values to be large in order to reduce the probability of misclassification errors.

Remark 8.1. A perceptual hash function, particularly its similarity detection function, needs to maximize the threshold difference ∆τ = τP eRo − τCoSe satisfying 0 ≤ τCoSe < τP eRo ≤ 1 in order to provide a meaningful decision.

Unfortunately, similarity-harming program code transformations — here defined as minimal pro- gram code transformations which successfully eliminate the similarity of two contents — are straightforward since the plagiarist is able to generate the hash value after each transformation to examine its impact on the similarity detection function D. Given an original IP content, a similarity-harming transformation is successful if

 the original IP is transformed with a function f ∈ F pre, s.t. it is detected as distinct: D[H(IP, content), H(f(IP, content))] ≤ τCoSe, or

 a replicated IP is transformed with a function f ∈ F pre, s.t. it is detected as similar: D[H(IP, content), H(f(IPrep, contentrep))] ≥ τP eRo.

To counteract transformations, a hash value can be generated using a salt such that the plagiarist cannot simply alter the program code until the similarity has vanished. In our proposal the salt is employed in such a way that it is not under the plagiarist’s control which portions of the code need to be altered to achieve a fundamental change. In principle a salt dependency can be used within different components of the hashing scheme (Fig. 8.1). The content can be randomized before the feature extraction which can be seen

Pre- Feature Post- Content Hash value processing extraction processing

Salt Randomization

Figure 8.1: The perceptual hashing scheme basically consists of three mentionable steps where each of which can be used with a salt-dependent randomization. Generated hash values are then provided to a similarity detection function. as a kind of content scrambling. Further, the salt can influence the feature extraction, e.g., randomizing parameters, and eventually, the hash value can be randomized, e.g., addition of the salt. Many schemes for image hashing, where an attacker wants to preserve the hash value rather than changing it, prefer a randomization before or after feature extraction (cf. Fig. 8.1). With such a hash function however, a plagiarist according to our scenario, would examine the power consumption shape after each modification. The hash value is fundamentally distinct if he

106 8.3. The Wavelet Transform manages to change the shape fundamentally since it would still severely influence the extracted features. Therefore, we need a hash function with a salt-dependent feature extraction, such that the plagiarist is unable to predict the influence of his modifications on the features, and hence on the hash value. As a consequence an improved perceptual robustness is an arising property.

8.3 The Wavelet Transform

The wavelet transform applies a sliding window in shape of a modulated window function that contains the frequency information. This function is called a wavelet denoted with Ψ(t). Generally speaking, it is a damped oscillation that has compact support. The support, denoted with p, is half the time-width where the amplitude is non-negligible distinct from zero, and thus 2p is the meaningful time-width of a wavelet. Various wavelets have been presented including Haar, Gaussian, Morlet, Mexican Hat, and so forth (cf. Fig. 8.2). Each has its own base

1 1

0.8 0.8

0.6 0.6 0.4 0.4 0.2

0 0.2

−0.2 0 −0.4

−0.6 −0.2

−0.8 −0.4 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 2 i2πt −t2 2 2 − t (a) Ψ(t) = e · e (b) Ψ(t) = √ (1 − t ) · e 2 3π0.25

Figure 8.2: (a) shows the Morlet wavelet with support p = 4 and (b) the Mexican Hat wavelet with support p = 5. frequency f0 and support p. In order to make wavelets capable to handle signals that have a time-width greater than 2p and frequencies distinct from f0, wavelets are accordingly scaled √ t−b + and shifted, such that Ψa,b(t) = 1/ a · Ψ( a ) with a ∈ R and b ∈ R. The support is then given by ap and the wavelet is shifted by b along the time axis. The frequency now denotes with fa = f0/a. This leads to the continuous wavelet transform in the time domain regarding a signal x(t) Z ∞ W (x, a, b) = x(t)Ψa,b(t)dt = hx, Ψa,bi , (8.1) −∞ where W (x, a, b) is the wavelet coefficient of x that states how well the wavelet fits the over- 2 2 lapping fraction of x(t) with 0 ≤ W (x, a, b) ≤ max(|x| , |Ψa,b| ). Numerous pairs (a, b) ensure that x(t) is appropriately covered (Fig. 8.3), so that x(t) can be represented by the wavelets. However, in practice it is only possible to chose a and b from a finite set provided that the reconstruction of x(t) is possible. For instance, the discrete dyadic wavelet transform can afford m m this by using a = 2 and b = n · 2 for m, n ∈ Z. For further details we refer to [Add02].

107 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property x(t) (t) 1 ,b

1 ... a Ψ (t) 2 ,b

2 ... a Ψ

Figure 8.3: Exemplarily, the wavelet in the middle has a small a and covers high frequencies of signal x, whereas the lower wavelet has a larger a to cover low frequencies of x.

8.4 Salted Side-Channel Based Perceptual Hashing

It is assumed that the power consumption or electro-magnetic emanation of devices is substan- tially related to executed instructions and the data, meaning that locally different instructions and data lead to a locally different power consumption. This is certainly true for devices based on CMOS technology, like µC and FPGA since the power consumption can be expressed as the sum of an instruction-dependent and data-dependent, as well as a constant and a noise component [MOP08], thus

Ptotal = Pinstruction + Pdata + Pstatic + Pnoise. (8.2)

In the remainder we will call the physically observed power consumption (or electro-magnetic emanation), i.e., the measured oscilloscope traces, the power consumption shape or short shape to emphasize that we are interested in the significant progression of the amplitude caused by the instructions while processing the program. The shape refers to the content of the investigated IP program code.

8.4.1 Assumed Types of Plagiarism

In our scenario the legitimate IP holder generates a power consumption based perceptual hash value by means of the program code obtained from the device that represents together with the program code a product available on the market. If the holder suspects a product of a competitor to be a plagiarism of his own product, he generates a second perceptual hash value and compares it to his own using D. In the case of a plagiarism the contents, i.e., power consumption shapes, are perceptually similar and so the hash values. Plagiarism could mean two things, theft or unlicensed use. The latter case might be even more important since it might be more likely that a competitor implements a proprietary algorithm without permission rather than committing a theft. Nevertheless, since efficiency is truly the major criterion for most products, two state-of-the-art implementations probably lead to similar program code. On the downside, this fact may cause problems if an algorithm is free to use but the IP holder supposes plagiarism anyway. In this particular case it is important to include

108 8.4. Salted Side-Channel Based Perceptual Hashing proportions of the program code in the perceptual hash value that represents the uniqueness of the original code. Generally, plagiarism detection could be accomplished by a simple comparison, a distance measure, involving the power consumption shapes. But as soon as the plagiarist is enabled to makes use of IP detection he will try to counteract. Then he tries to successively alter the program code, until the power consumption shape is significantly different while preserving the program code’s functionality and efficiency. Consequently, a simple comparison based on a distance measure is very likely to fail for two reasons. When using a distance measure one is concerned with the problem to find a threshold distance; this would end up would in pure guessing. Further, one may argue that alignment methods, like elastic alignment [vWWB11, MWB11], can match similar parts of two shapes but such methods would even correlate two random signals. In our scenario we assume three adversarial plagiarists.

 Plagiarist 1 possesses the ready-to-upload system code file due to theft, e.g., memory read-out or network intrusion.

 Plagiarist 2 possesses the program source code due to theft or reverse-engineering.

 Plagiarist 3 possesses the proprietary algorithm and intends unlicensed use.

Obviously, by possessing the system code file only (adv. plagiarist 1), it is very challenging to obtain a different power consumption shape by editing the file without risking errors in the program code, at least when it is not simply assembly code. Whereas possessing the program code (adv. plagiarist 2) raises a few more possible attacks which are amongst others

 re-compilation with different parameters,

 inserting dummy instructions or random delays, as well as

 permuting instructions.

Further, those are also the possible attacks if the plagiarist intends unlicensed use (adv. pla- giarist 3) but here he already considers them during the programming phase. Program code re-compilation might be a pure trial-and-error process since the plagiarist can merely guess the influence on the power consumption instead of targeting accurate changes. Nevertheless, the attack is sufficient as long as the shape will change. The important question that comes up is, which impact do various compiling parameters have provided that the code is already highly optimized assembly code, for instance. The static insertion of additional instructions or delays have an inherent effect on the power consumption but, concurrently, negatively affect the efficiency of the program code. The same applies to randomly inserted delays but further aggravated by the problem of finding a good source of randomness which requires additional undesired effort. In side-channel attacks one may also be confronted with random delays which can be efficiently bypassed by alignment methods [vWWB11, MWB11]. For side-channel hashing, several power consumption traces can be aligned with each other before generating the hash value. Permuting instructions is only possible where the semantical functionality remains inviolated. Moreover, permuting recurring code segments that only differ in the processed data will not

109 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property change the power consumption significantly since variations induced by the data itself might only slightly affect the amplitude. Although this is crucial in side-channel attacks, it is considered negligible for side-channel hashing here due to the similar power consumption shape.

8.4.2 Our Proposal using the Wavelet Transform

The scheme relies on the short oscillation of the wavelets such that each sample of a power consumption shape is either damped or amplified depending on a salt. Therefore the entire shape is incorporated several times through several hash coefficients. Each time very local portions of the shape have either a stronger or a weaker influence on a hash coefficients. Furthermore, for each hash coefficient different intervals are formed and thus correlated to each other. Take note that due to the perceptual hashing concept it not possible to commit a chosen salt to the resulting hash value. The hash value is still supposed to be a discriminant feature of the shape and thus salts should lead to a similar value (cf. Sec. 8.2). However with such an approach the plagiarist looses the ability to take direct influence on the hash value and therefore on the similarity decision. n l·r We suggest a side-channel perceptual hash value of the form Hs(IPshape): R → {0, 1} where l is the amount of hash coefficients used and r the quantization resolution. Two such binary hash values can then be compared through the Bit Error Ratio (BER).

Generation of Side-channel Hash Values

In a first step the IP holder chooses an interval of the power consumption shape which represents the uniqueness (the actual intelligent property) of the original program code, e.g., a proprietary algorithm. This results in the power consumption shape of interest IPshape containing n samples. The wavelet transform (Eq. 8.1) is applied to build a salt-dependent feature extraction. The randomly chosen scaling parameter a and shift parameter b function as the salt whereas the output consists of l hash coefficients ci:

(1) Choose a wavelet function Ψa,b(t) with support p.

l (2) Generate the salt s ← {(ai, bi)} where ai ∈R [amin, amax] and bi ∈R {0, 1,..., 2 · aip − 1}.

c P W IP , a , m·a p−b m ∈ { , , ,...} (3) Compute the hash coefficients i = m·aip−bi

l (4) Quantize the hash coefficients, s.t. Hs(IPshape) = {Q(ci)} where Q is the quantizer.

Two types of quantizers are imaginable. A uniform quantizer generates bins so that all co- efficients are distributed among equidistant quantization portions. Contrarily, a non-uniform quantizer generates bins so that the bins are equiprobable distributed over all coefficients where the quantization threshold is defined through the mean of the coefficients within a bin. If the number of bins is 2r the final hash value contains l · r bits. We keep the scaling values ai bounded to reduce inherent negative effects on the performance (cf. [MV06]). Wavelets containing either too high frequencies (small a) or too low frequencies (large a) are not perceptual and give poor performance due to the dominating randomization on one side and severe information loss on the other. That means the perceptual robustness

110 8.5. Experimental Results decreases with smaller scaling values but larger values decreases the content sensitivity other- wise. Obviously, the interval should be large in order to provide a sufficient number of distinct scaled wavelets. Therefore, finding a proper scaling interval is a trade-off problem which will be addressed in Section 8.5. To include all samples of IPshape in each hash coefficient, multiple wavelet coefficients are added up with periodically consecutive wavelets. Hence, the wavelets can be shifted effectively by the bi in the above given range.

Similarity Detection of Side-channel Hash Values

To compare two such binary hash values h and h∗ the BER, as suggested by related works [MV06, MU06, NWP11], can serve as the similarity detection function D which is in this context simply given through the similarity score

∗ 2 · HD(Q(ci), Q(c )) D (h, h∗) = 1 − 2 · BER(H (IP ), H (IP ∗ )) = 1 − i (8.3) BER s shape s shape l · r with the Hamming Distance (HD) counting the number of positions at which the bits are different. Independent of the true thresholds values τP eRo and τCoSe, it is expected that two similar hash values reach a similarity score close to one and two distinct hash values a similarity ∗ score close to zero. Theoretically, it holds −1 ≤ DBER(h, h ) ≤ 1, but a score smaller than zero would also imply a dependence of what ever kind between h and h∗.

8.4.3 Purpose and Limitations of Salted Perceptual Hashing

As already pointed out, we make the following important assumption that is basically considered when we demonstrate the performance of our approach in the upcoming section.

Assumption 8.1. The plagiarist is able to take influence on portions of the shape according to the considered attacks, but neither manages to change the entire shape fundamentally while he can preserve the full efficiency or functionality of the program code at the same time nor is he willing to spend a lot of effort due to his very nature.

We recall that the similarity between program codes can never be proven if the shape has changed fundamentally. This is an inherent property of the perceptual hashing concept since otherwise arbitrarily independent program codes would appear similar when they are definitely not. Consequently, a plagiarist who mounts an attack with very large impact on the shape with no regards to the program code’s efficiency will always be successful in the sense of making two similar program codes look independent.

8.5 Experimental Results

Before we assess the practical performance of our approach in the presence of attacks, we need to figure out the appropriate scaling interval [amin, amax] for the Wavelet feature extraction (cf. Sec. 8.4.2) and the influence of the number of hash coefficients, respectively the number of quantizing bins.

111 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property

Optimization of the Wavelet Parameters

For this purpose we initially simulated three shapes with 5, 000 samples each represented by an 8 bit value. The first shape is randomly generated, the second is a modified copy of it in order to obtain two similar shapes affected by considerable measurement noise. In particular, normally distributed noise with a standard deviation of σsim1 = 1 is added to each sample and to 20 consecutive samples out of 100 samples with σsim2 = 5. A third shape is again a modified copy of the first but with normal noise characterized through σdis = 15 added to each sample. Hence, the third shape is distinct from the first two shapes but not entirely independent of them. This might seem somewhat arbitrary at first sight, but it is indeed appropriate since measured power consumption shapes of two different IP share a common constant power consumption component and vary only due to an instruction-dependent component, respectively data-dependent component, while executed on the same device (cf. Eq. 8.2). We further used 2 0.25 2 the Gaussian wavelet Ψ(t) = ( π ) · exp (−t ) with p = 5, l = 200 coefficients, and r = 8 quantizing bins; values that turned out to be optimal as shown below. Figure 8.4 shows the performance of our approach along decreasing scaling intervals where the maximum scaling value is initially set to amax = n/(2p) = 5, 000/(2 · 5) for which the wavelet is as long as the shape. We randomly generated ten thousand different salts from each scaling interval and

1

0.9

0.8

0.7 τ PeRo τ 0.6 CoSe +4σ 0.5 PeRo −4σ PeRo 0.4 +4σ CoSe Similarity score −4σ 0.3 CoSe

0.2

0.1

0 [1,500] [5,450] [10,400] [15,350] [20,300] [25,250] [30,200] [35,150] [40,100] [45,50] Scaling interval [a ,a ] min max

Figure 8.4: The performance with simulated power consumption shapes for several scaling in- tervals where each interval was evaluated with 10 thousand randomly chosen salt values. Further, we used the Gaussian wavelet (p = 5) and a hash value that consists of l = 200 coefficients quantized with r = 8. computed the mean and the standard deviation of the similarity score involving the two similar shapes (one and two), respectively the two different shapes (one and three). As can be seen

112 8.5. Experimental Results

the corresponding mean values τP eRo and τCoSe remain nearly constant while scaling interval length decreases. Take note that the two distinct shapes do not reach a zero score because they are not entirely independent as mentioned above. In order to indicate False Acceptance (FA) and False Rejection (FR) errors, respectively, we need to investigate the impact of deviating score values. As we assume the score values to be normal distributed (cf. Sec. 8.2), we apply a property of the normal distribution to push the chance of errors below 0.0065%. It is observable in Figure 8.4 that intervals larger than [30, 200] (from the right to the left) possess a higher chance of FA and FR errors since the borders achieved by means of the standard deviations of the different shapes overlap, namely

1

0.9

0.8 200 coefficients 0.7 150 coefficients 100 coefficients 0.6 50 coefficients 30 coefficients 0.5 20 coefficients 10 coefficients Similarity score 0.4 5 coefficients 0.3

0.2

0.1 1 2 3 4 5 6 7 8 9 10 Quantizing bins [log ] 2

0.65

0.6

0.55

0.5 200 coefficients 150 coefficients 0.45 100 coefficients

τ 50 coefficients 0.4 ∆ 30 coefficients 0.35 20 coefficients 10 coefficients 0.3 5 coefficients 0.25

0.2

1 2 3 4 5 6 7 8 9 10 Quantizing bins [log ] 2

Figure 8.5: Evolution of the mean values τP eRo > τCoSe (top) and ∆τ (bottom) using different numbers of quantizing bins r and hash coefficients l within the scaling interval [40, 100].

113 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property

−4 · σP eRo < 4σCoSe. To include a clearance we suggest using the scaling interval [40, 100] for which FA and FR errors are negligible because −4 · σP eRo  4σCoSe. Moreover, this interval is sufficiently large to provide an appropriate salt value space. We recall that a plagiarist would need to modify the program code, record a the power consumption shape, and compute the hash value considering each possible salt. Hence this space is not to be confused with key search spaces from crypto systems where testing time is very short. Ultimately the plagiarist must ensure that τP eRo is subjected to a clear shift downwards. Next, we investigate the influence of different numbers of quantizing bins and hash coefficients, respectively. As Figure 8.5 depicts, while processing similar shapes and different shapes the respective mean values decrease with increasing bins until convergence. This is because the hash coefficients are generally close to each other, and hence an appropriate number of bins is necessary to obtain meaningful results. With approximately 210 bins the mean values converge but with 25 bins the performance is optimal. The results are valid for both, the uniform and the non-uniform quantizer. The amount of coefficients has a less dominant effect. Actually, the performance is only poor when using less than 50 coefficients since convergence is not reached (Fig. 8.5). Nevertheless, at least 100 coefficients should be used for optimal performance. Finally, we tested the Mexican hat wavelet which gave slightly less performance and the Morlet, as well as the Haar wavelet which both gave poor performance.

Practical Evaluation

For our practical experiments we used a Microchip PIC18F2520 microcontroller [Mic08] running at 3.68 MHz together with the Microchip MPLAB C Compiler for PIC18 of version 3.42 which also includes the Microchip MPASM assembler. The power consumption shapes were acquired with a PicoScope 5203 and a sampling rate at 125 MSample/s. The measurements were done with a 1 Ω resistor in the ground line. Further, we chose the AES to be the true positive reference algorithm, existing in a slightly optimized assembler and a straight C implementation. Further, we choose the DES to be a true negative reference. We have measured the first round of both AES implementations with each measurement containing 1, 000 shapes which have been averaged using uniformly distributed inputs. The side-channel hash values are generated using the Gaussian wavelet, the scaling interval [40, 100], l = 200 coefficients, and r = 5 uniform quantizing bins which lead to 1,000 bit perceptual hash values.

Table 8.1: Wavelet based salted perceptual hashing performance with unmodified and fully op- timized program code. The DES implementation was adapted such that a single round almost requires the same number of cycles as a single AES round. Exceeding cycles were cut off.

IP Code type Code branch µDBER σDBER #samples AES vs. AES Assembly Original vs. Original 0.9292 0.0097 1,235 AES vs. AES C Original vs. Original 0.9714 0.0082 3,503 DES vs. DES Assembly Original vs. Original 0.9432 0.0086 1,235 DES vs. DES C Original vs. Original 0.9802 0.0077 3,503 AES vs. DES Assembly Original vs. Original 0.1530 0.0079 1,235 AES vs. DES C Original vs. Original 0.1652 0.0074 3,503

114 8.5. Experimental Results

Again, we computed the mean and standard deviation of the similarity score to state the performance involving 10 thousand different salt values. In the easiest attack scenario the plagiarist employs the same program code without modifica- tions. This corresponds to the adversarial plagiarist 1 in our model (cf. Sec. 8.4.1). The results indicated in Table 8.1 clarify that the measurement noise is negligible and therefore, the percep- tual robustness is close to one. Contrarily, the similarity scores involving different implemen- tations point out that we achieve a good content sensitivity as well. Simply, the hard decision boundary can be put at the similarity score of (0.9292 − 4 · 0.0097 + 0.1530 + 4 · 0.0079) = 0.5375 which is the center of the distributions for the assembly program code (0.5667 for the C code). Next, we consider a re-compilation of the AES C implementation with different compiler parameters. The Microchip C compiler offers numerous optimization parameters from which we select all those which verifiably affect the program codes, particularly the power consumption. The selected parameters are branch optimization (BRA-OPT), banking optimization (BAN- OPT), code straightening (CS), and tail merging (TM). We refer to [Inc05] for details. We observed that the parameters primarily lead to shortened shapes due to speed optimizations. But we also reproducibly noticed that recurring parts are more uniform which is likely caused by loop unrolling. Table 8.2 summarizes our results concerning re-compilation.

Table 8.2: Wavelet based salted perceptual hashing performance with re-compiled program code under various assumptions. The fully optimized AES C code is compared to AES C code with a single parameter optimization or the fully unoptimized AES C code (UNOPT), respectively.

IP Code type Code branch µDBER σDBER #samples AES vs. AES C Original vs. BRA-OPT 0.5834 0.0113 3,503 AES vs. AES C Original vs. BAN-OPT 0.9608 0.0083 3,503 AES vs. AES C Original vs. CS 0.5904 0.0123 3,503 AES vs. AES C Original vs. TM 0.5788 0.0125 3,503 AES vs. AES C Original vs. UNOPT 0.5804 0.0134 3,503

A straight C implementation naturally provides a big potential for optimizations but, however, the performance of side-channel perceptual hashing is still acceptable even if the code is re- compiled fully unoptimized. Although re-compilation could serve as a strong attack since it easy to apply, it concurrently means an unsatisfying situation for the plagiarist because of a possible severe execution time penalty. Insertion attacks aim at perturbing the alignment of the power consumption shapes with delays and additional instructions. In the pre-processing step alignment methods [vWWB11, MWB11] can therefore be used or methods to detect and remove dynamic insertions [SP12]. Thus, we concentrate on statically inserted instructions that cannot be detected with such methods in our scenario. This is because we do not know for sure whether these instructions were introduced by an attack or they are actually a part of a different unique program code. For the experiments we target the S-box-layer of our AES implementations. The latter are realized as a table look-up with implicit row shifting. After always two table look-ups we increasingly inserted further non-functional table look-ups (#TL). We repeat this using the no-operation instruction instead (#NOP). The hashing performance results are given in Table 8.3 for the AES C implementation. The results of the assembler implementation are absolutely equivalent.

115 Chapter 8. Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property

Table 8.3: Wavelet based salted perceptual hashing performance with statically inserted instruc- tions in AES C code.

IP Code type Code branch µDBER σDBER #samples AES vs. AES C Original vs. 1TL 0.7901 0.0113 3,503 AES vs. AES C Original vs. 2TL 0.7632 0.0141 3,503 AES vs. AES C Original vs. 4TL 0.7414 0.0159 3,503 AES vs. AES C Original vs. 1NOP 0.7943 0.0125 3,503 AES vs. AES C Original vs. 2NOP 0.7740 0.0134 3,503 AES vs. AES C Original vs. 4NOP 0.7419 0.0143 3,503

It can be seen that a plagiarist has to add at least the same amount of non-functional instructions to the S-box-layer as it contains functional instructions to significantly lower the perceptual robustness. The last experiment is concerned with the permutation of instructions. We again target the S-box-layer and arbitrarily reorder the bytes to be processed (REORD) on the one hand and separate the row shifting from table look-up (SEP) on the other. See Table 8.4 for the results.

Table 8.4: Wavelet based hashing performance with permuted instructions (AES).

IP Code type Code branch µDBER σDBER #samples AES vs. AES C Original vs. REORD 0.9732 0.0092 3,503 AES vs. AES C Original vs. SEP 0.8031 0.0104 3,503 AES vs. AES Assembly Original vs. REORD 0.9597 0.0098 1,235 AES vs. AES Assembly Original vs. SEP 0.7731 0.0145 1,235

Clearly, reordering has no effect since the instructions are equal. The second attack lead to slightly different results considering the implementation. The reason might be the C compiler that still rearranges the instructions, although such optimizations were disabled.

8.6 Conclusion

We proposed Wavelet based perceptual side-channel hashing which facilitates using a salt to clearly increasing the robustness, especially considering plagiarism that is aggravated by pro- gram code modification attacks. Our scheme indeed perfectly fits the IP detection framework recently introduced by a previous work, but applies a different kind of similarity detection which is why the similarity score values of both approaches are not directly comparable. We tested several reasonable similarity-harming program code transformations that could be performed by an adversarial plagiarist. As fundamental property of our approach the influence of the trans- formations is not under the control of the plagiarist and experimental results indicated a good discriminative performance with respect to the perceptual robustness and content sensitivity. We want to stress that we only use side-channel information, in particular the power consump- tion, and refrain from embedded plagiarism detection aids in the original program code, e.g., watermarks.

116 Part V

A Case Study: Real World Side-channel Cryptanalysis

Chapter 9 From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

Often the question arises whether an attacker is given a chance to experiment with a target device in order to prepare or refine an attack he intends to conduct on a real world application. For instance, profiled side-channel cryptanalysis makes it necessary to have a similar device under full control. In this chapter we demonstrate practical side-channel cryptanalysis on a commercially available smartcard whose employed symmetric cryptography supporting microcontroller is also widely spread in today’s CC certified EC cards in Germany. We detail on how to fully recover the key from the DPA countermeasure protected DES and AES co-processor computation as it is conducted by the investigated non-certified smartcard product. Further we show how to compare embedded semiconductor dies by simple means.

Contents of this Chapter

9.1 Introduction ...... 119 9.2 Near-infrared Backside Microscopy ...... 120 9.3 Instrumenting the Smartcard ...... 122 9.4 Preparing and Comparing Smartcard IC Dies ...... 122 9.5 Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co- processors within the Smartcard ...... 124 9.6 Conclusion ...... 136

9.1 Introduction

Attacking microcontroller based security products in the field is supposed to be a great chal- lenge. The question whether an attacker will be successful or not depends, amongst others, on the physical package, the operational environment, the employed software protocols and ulti- mately on the underlying hardware IC. Especially microcontrollers which are utilized in smart- card products are intended for a variety of application scenarios, e.g., Pay Television (Pay-TV), electronic payment, electronic identity documents, electronic health card, and access control. For the aforementioned scenarios dedicated microcontrollers with cryptographic hardware sup- port and a security hardened infrastructure are common. The best known manufacturers that

119 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support serve the world’s market are Infineon Technologies, NXP Semiconductors, Samsung Electronics, and STMicroelectronics. Corresponding microcontroller IC names, for instance, can be found on certification body approval lists, e.g., by Deutsche Kreditwirtschaft (DK) [Kre16], EMVCo [EMV16], Bundesamt f¨urSicherheit in der Informationstechnik (BSI) [Bun16]. Especially the list published by DK gives detailed hints on the application, i.e., in which smartcards the mi- crocontroller IC can be expected in Germany’s electronic payment infrastructure. From the BSI list an attacker can learn which microcontroller IC is used for certain governmental applications, e.g., the German passport. Before an attacker can deal with his targeted security product he needs to become familiar with its underlying IC, ideally composed in a product which he can fully control. There are several reasons for this: to identify vulnerable spots within the IC layout, to recognize the power consumption or electro-magnetic emanation shape of cryptographic co-processors, to character- ize the side-channel leakage or fault injection behavior, or to build templates for subsequent attacks. This is a common scenario that is covered during the CC or EMVCo certification process. Furthermore, for this purpose the attacker wants to achieve an IC identification by simple means and concurrently ensure its physical integrity.

9.1.1 Contribution

In this chapter we highlight a procedure incorporating low-cost near-infrared backside mi- croscopy which makes comparing two microcontroller ICs simple while ensuring its physical integrity. The procedure is best suited for, but not limited to smartcards that employ the flip-chip interconnecting for which chemical etching from the top-layer side is not an option. Through application we figured out that a currently available smartcard microcontroller IC is used in current EC cards that are widely spread in Germany. Besides we could confirm that the IC is EAL5+ certified according to the German CC scheme. To this end the co-processors implement hardware based DPA countermeasures. However, we want to point out clearly that a certificate also includes security guidelines that state how the IC shall be used by the software in order to achieve an attack resistant composition product. Thereafter we demonstrate a full key recovery attack on the DES and AES hardware co- processor computations as they are conducted within the smartcard. Indeed, the co-processors are most likely not used according to certification guideline requirements, but nevertheless, we can investigate the co-processors’ characteristics. To our best knowledge this is the first side- channel attack presented on two hardware co-processors of such a smartcard in the literature. We detail on the problems and obstacles that we were faced with during the identification of the DES and AES computation. Moreover we conduct different attack techniques, including CPA [BCO04], correlation-enhanced collision attack [MME10], and template attacks [CRR03]. Take note that the details presented in this chapter do not directly harm the security of current EC cards in circulation. EC cards themselves are subjected to a subsequent certification process which ensures that the guideline requirements with respect to the IC were taken into account.

9.2 Near-infrared Backside Microscopy

When light waves hit a medium three events could basically occur, namely absorption, reflection, and transmission. When absorption occurs a portion of the light’s energy is converted into heat

120 9.2. Near-infrared Backside Microscopy within the medium. When reflection occurs a portion of the light is reflected at the medium’s surface and the light’s energy is re-emitted. If transmission occurs a portion of the light passes the medium and is re-emitted at the opposite surface of the medium. Naturally, all events occur at the same time with a certain percentage. Further, the corresponding portions depend on the wavelength such that transmission increases while absorption decreases. In order to be able to visualize the bottom metal layers within a microcontroller IC a high transmission is desirable. Therefore the light needs to pass the silicon substrate until it hits the metal which then re-emits the light such that it can pass the silicon substrate again to be detectable, e.g., by a Charge-Coupled Device (CCD) or CMOS sensor. For wavelengths smaller than 900 nm the transmission is zero (the silicon is completely opaque) whereas the absorption is in the range between 40% and 70%; the remaining portion is reflected. For wavelengths greater than 900 nm the transmission increases and reaches its maximum of approximately 50% starting from a wavelength of 1,200 nm. Thus, the optimal transmission with respect to silicon can be achieved in the range of the Near-Infrared (NIR) spectrum (700 nm – 2,500 nm). However, the transition is shifted to the lower electro-magnetic spectrum for smaller silicon substrate thicknesses. For instance, the transmission starts approximately 80 nm earlier for a thickness of 200 µm compared to a thickness of 600 µm. The aforementioned numbers are taken from [JJ02] under the assumption that the surface is ideally polished. For the practical application we can gain the following knowledge. Firstly, the maximum light portion emitted by a NIR source that can be utilized to visualize the bottom metal layer is 25%. Additionally off-axis light is required to prevent detecting the reflective portion. Secondly, in practice the silicon surface is not ideally polished such that scratches or other damages can result in a diffuse reflection which can cause contrast degrading bright spots at the sensor. Thirdly, finding the optimal wavelength for the NIR source is a trade-off problem since CCD or

Figure 9.1: Selfdeveloped LED ring for backside microscopy in the NIR spectrum.

121 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

CMOS sensors’ quantum efficiency decreases with increasing wavelengths. In our setup we use a monochromatic CMOS sensor based camera1 that is optimized for the NIR spectrum. In total we can make use of approximately 0.1% of emitted light with a wavelength of 1,050 nm. Consequently the NIR light emitting source needs to be very powerful. We decided to incorporate five LEDs of each of the wavelengths 1,020 nm, 1,050 nm, 1,060nm, and 1,070 nm. With this choice we cover the entire spectral range for which the silicon is transparent and the camera is sensitive at a time. The LEDs are arranged in a ring enclosing an objective and bent inwards (cf. Figure 9.1). The radiant flux of the ring can be stated with 100 mW. To greatly improve the resolving power and moreover the contrast in case of seriously scratched surfaces we suggest using oil immersion (see Appendix B).

9.3 Instrumenting the Smartcard

From the information provided by the smartcard distributor it was initially not fully clear whether a cryptographic algorithm is supported in hardware or software. Upon request we however got the information that the IC shipped with the smartcard supports DES and AES in hardware through corresponding co-processors. However, take note that, in agreement with the IC hardware developer, we refrain from giving indications on the particular IC. Throughout our analysis we will incorporate built-in function to trigger a DES or AES op- eration. For a DES operation we use the function that specifies the mode of the cipher which we set to single DES encryption. We pass over variables for the key, the plaintext, respectively the ciphertext. For an AES operation we use a similar function. Again, a parameter specifies the mode of the cipher which we set to single AES encryption with a 128 bit key.

9.4 Preparing and Comparing Smartcard IC Dies

Basically there are two interconnecting technologies utilized in smartcard packages, namely wire-bond and flip-chip. The difference lies obviously in the fact of how the IC die’s contact pads are interconnected with the smartcard contact pads. With the wire-bond technology wires interconnect the contact pads from the top-side of the die to to the smartcard pads. Therefore the substrate-side is turned towards the backside of the smartcard pads such that the wires are routed radially. Additionally the wires and the die’s top-side are coated with a resin for the sake of durability. With the flip-chip the small bumps on the die’s contact pads directly interconnect with the package’s contacts. In case of smartcards this means that a flexible Printed Circuit Board (PCB) with printed conductor stripes is directly attached to the die’s whereas the former is connected with the contact pads. In contrast to wire-bond the die is placed the other way round. Fortunately the substrate is reachable in both cases. Having access to it provides three major advantages for an attacker. He will be able to compare IC dies while maintaining the operability, he can exploit the EM side-channel, and he can conduct fault-injection in the first place and refine it with NIR backside microscopy. This smartcard employs the wire-bond process. Hence the substrate can be reached through the contact pads which can be opened by means of

1IDS UI-1240SE-NIR-GL

122 9.4. Preparing and Comparing Smartcard IC Dies a scalpel. Insulating materials can be removed with acetone. Figure 9.2 show the opened smartcard.

(a) Opened smartcard (b) Prepared smartcard

Figure 9.2: In (a) the smartcard contact pad in the middle has been removed to reveal the IC. In (b) the other smartcard contacts pad have been covered with matt black insulating tape to increase the contrast.

With the help our setup we obtained the images depicted in Figure 9.3. All parts are clearly recognizable. Larger regular structured areas are the memory spaces, i.e., Electrically Erasable Programmable Read-Only Memory (EEPROM), RAM, and Read Only Memory (ROM). Ded- icated controllers are also quiet obvious. This is also true for the contact pads, which are expressed as little squares at the upper and lower right edges. The analog part is located in the lower right area. We deem the middle area as the glue logic (synthesized logic), since it can reach each other part. The glue logic most likely contains the CPU, as well as parts of the cryptographic co-processors.

(a) Substrate in visible spectrum (b) Substrate in NIR spectrum (oil immersed)

Figure 9.3: NIR backside microscopy applied on the backside of the IC die.

123 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

The EC cards that we have investigated could be prepared in the very same way. Nevertheless, the card that utilizes the same IC has some meshed coating over the substrate, presumably for durability reasons. We could remove it with a scalpel without scratching the surface. Thus we conclude that recovering exact positions, e.g., for side-channel analysis or fault-injection, is very likely.

9.5 Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard

In this section we detail on our measurement setup, the localization, profiling and alignment of the DES and AES co-processor’s activity, as well as the side-channel analysis techniques that we have conducted. We start with the description of our measurement setup involving the technical means. Then we deal with the search and refinement of appropriate side-channel signals with regards to both co-processors. In the last part we present our analysis results that end up in a full key recovery of both, the DES and AES co-processor with different analysis techniques. Again we want to mention that we analyzed the mere computations as they are conducted within the smartcard product that most-likely disregards security guideline requirements.

9.5.1 Measurement Setup

For all analyses our measurement setup consists of a LeCroy HDO6054 oscilloscope at a sampling rate of 2.5 GSample/s. The power consumption was measured across a 10 Ω resistor in the power supply line by means of an active differential probe. The electro-magnetical emanation was recorded with a near field probe that was placed over the IC frontside. The smartcard is capable of ISO 7816 [ISO06] compliant communication, and thus a compatible reader was utilized. Due to the virtualization it was not possible to use one of the external signal lines, e.g., the I/O signal, for oscilloscope triggering. Nevertheless, that makes the analysis way more realistic. See the section below where we discuss this point.

9.5.2 Localization, Profiling and Re-alignment of the Co-processors’ Activities

For this smartcard each block cipher co-processor activity can easily be recognized due to two facts. For one thing the power consumption significantly rises when the co-processors are activated, a behavior which the IC does not reveal for any other operation. For another thing an apparently implemented side-channel countermeasure equalizes the power consumption’s profile during co-processor activities. This has lead us to the conclusion that, of course, the power consumption is not appropriate for side-channel analysis, but triggering can be achieved by exactly this shape.

AES Co-processor

We first start with tracking down the AES co-processor activity. In doing so we manually re-positioned the EM probe over the IC surface and triggered an AES computation for each position. We have then figured out a location that potentially reveals each AES round by a large peak (cf. Fig. 9.4). Nevertheless, these peaks are not obviously observable within each

124 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard

250 6th 5th 8th 10th 2nd 3rd 9th 1st 4th 200 7th

150

100 Voltage [quantized bins]

50

0 0 5,000 10,000 15,000 20,000 25,000 30,000 Time [observed samples]

Figure 9.4: Annotated AES trace with single rounds identified. trace. Furthermore we can learn that for each execution of an AES the trace is subjected to severe jittering, by which we mean the distance change from peak to peak, indicating a further side-channel countermeasure. In the following step we need to decide which round is best suited for attacking. Usually, the first or tenth round, provided that a 128 bit key is employed, is a convenient target since only the plaintext or ciphertext needs to be involved. In Figure 9.5 we take a closer look at both rounds.

250 250 10th 1st 200 200

150 150

100 100

Voltage [quantized bins] 50 Voltage [quantized bins] 50

0 0 4,200 4,400 4,600 4,800 5,000 5,200 5,400 5,600 5,800 6,000 6,200 24,800 25,000 25,200 25,400 25,600 25,800 26,000 26,200 26,400 26,600 Time [observed samples] Time [observed samples] (a) First AES round (b) Tenth AES round

Figure 9.5: Annotated AES rounds revealing a certain structure in the shape of further peaks.

125 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

What can be recognized is a certain structure which is very similar when comparing both rounds. Yet problematic however is the circumstance that we observed jittering even within the round structure, and moreover the smaller peaks are often barely distinguishable from the rest of the signal which could harm the alignment. To facilitate a sample accurate alignment we pre-process the traces by means of a narrow band-pass filter. The idea was that the steep descent that precedes each of the peaks can be strongly emphasized. Therefore we tried several 10 MHz bands with respect to different cutoff frequencies. The band with 15 MHz center frequency gave us the best performance such that the distinguishability of the smaller peaks was greatly improved which is depicted in Figure 9.6. Relative voltage [quantized bins] Relative voltage [quantized bins]

Relative time [observed samples] Relative time [observed samples] (a) Five traces of tenth AES round (b) Five traces of tenth AES round filtered

Figure 9.6: Alignment of tenth AES round is achieved through narrow band-pass filtering. The peak-to-peak distance is used for aligning each smaller peak of the unfiltered traces.

Furthermore we achieved better results for the tenth round which is why we concentrate on this round in the remainder. In contrast a low pass filter also smoothes the signal but does not expose such clear peak structures. In the end we thus aligned the tenth AES round of each trace in a cut-and-paste fashioned way such that all peaks are aligned with sample accuracy.

DES Co-processor

Regarding the DES co-processor we need to follow a very different way. The main reason behind this is that we are not able to recognize any clear structure during the co-processor’s activity. The explanation for this is rather simple. The DES was invented almost 40 years ago and highly optimized for hardware, in marked contrast to AES which was optimized for software in first place. Thus, on smartcard hardware the DES is very fast — often around 1 µs — and its power consumption is very low. Nevertheless we can again benefit from the power consumption signal while utilizing it as a trigger. For the DES however, we can observe two co-processor activities. The first activity lasts much longer than the second one which shortly follows, meaning that the power consumption signal falls down to zero for a very short time period. The reason for that is still not fully understood. We started a profiling through the application of CPA on the plaintext and ciphertext during that two activities. In the course of that we found an EM probe position that reveals repetitive

126 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard

250 P8 P7 P6 P5 P4 P3 P2 P1 C8 C7 C6 C5 C4 C3 C2 C1

200

150

100 Voltage [quantized bins]

50

0 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 Time [observed samples]

Figure 9.7: Annotated DES trace with single plaintext and ciphertext byte transfers identified. It is assumed that the space in between covers the DES computation. patterns during the second activity, depicted in Figure 9.7. We can recognize two basic struc- tures whereby each occurs eight times in a consecutive order. Thus it can be assumed that this is the plaintext and ciphertext transfer, most likely from a RAM location into some dedicated interfacing registers and vice versa. Consequently we aligned the respective first pattern within each of the two sequences and performed a CPA on it assuming the Hamming weight of each plaintext and ciphertext byte as our side-channel leakage model. As can be seen in Figure 9.8 we obtain clear correlation peaks that confirm our initial assumption. The first eight patterns embody the plaintext transfer, whereas the subsequent eight patterns embody the ciphertext transfer, both however starting with the least significant byte. The time period in between the two transfers has a length of about 1.5 µs which indeed indi- cates the existence of the DES computation even though there are no structures showing up (cf. Fig. 9.7). So we were required to re-position the EM probe while we kept the area of interest constrained by means of two oscilloscope cursors. As soon as the focus on the plaintext and ciphertext patterns is completely lost we were able to perceive smaller peak structures at loca- tions nearby. We tuned the position until we achieved the best Signal to Noise Ratio (SNR) by visual inspection. These peak structures are exhibited in the shape of a camel’s back. They are observable during the entire co-processor activity and seem to have random distance to each other. From these observations we can deduce that those peaks do not originate from clock cycles — otherwise we would expect equidistant peaks — but rather belong to the DES round function that is continuously executed; yet another countermeasure. Being aware of that, the question comes up whether we can distinguish between the real rounds and the rounds that do not contribute to the actual encryption. The latter rounds are

127 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

0.08 0.04

0.06 0.02

0.04 0

0.02 −0.02

0 −0.04

Correlation coefficient −0.02 Correlation coefficient −0.06

−0.04 −0.08

15,000 16,000 17,000 18,000 19,000 20,000 75,000 76,000 77,000 78,000 79,000 80,000 Time [observed samples] Time [observed samples] (a) CPA on Hamming weight of plaintext byte P8 (b) CPA on Hamming weight of ciphertext byte C8

Figure 9.8: Correlation on the DES plaintext, respectively the ciphertext transfer. The signifi- cance threshold is indicated by the dashed lines. often denoted as dummy rounds. Further tuning of the EM probe position indeed brought us the desired effect. In Figure 9.9 we can find another type of peak, a camel’s back with three humps. The important finding here is the number of two hump camel’s back peaks that are surrounded by this new peak type, namely 17. Hence we conclude that these peaks quiet likely belong to the real DES rounds. We do not know exactly why there are 17 and not 16 peaks. We speculate that these three hump peaks might arise due to the initial permutation in the

250 Start End

200

150

100 Voltage (quantized bins)

50

0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Time [observed samples]

Figure 9.9: Annotated DES trace with start and end of real rounds identified.

128 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard beginning and its inverse counterpart in the end of the DES computation. Ultimately, we make use of these peaks to align each of the rounds with sample accuracy just like we did for the AES traces.

9.5.3 Full Key Recovery Attacks on the Smartcard Usage of the Co-processors

Having aligned the traces with respect to both co-processors we are now going to present our results concerning various analysis techniques. We start with the analysis of the AES incorporating CPA, template based DPA using the MVND approach and SOST, and finally correlation-enhanced collision analysis. Followed by this we depict the analysis of the DES including CPA and template based DPA.

Analysis Results Regarding the AES Computation

The correlation on the byte-wise S-box output in the tenth round is shown in Figure 9.10.

0.08 0.08

0.06 0.06

0.04 0.04

0.02 0.02

0 0

−0.02 −0.02

Correlation coefficient −0.04 Correlation coefficient −0.04

−0.06 −0.06

−0.08 −0.08 0 200 400 600 800 1,000 1,200 0 200 400 600 800 1,000 1,200 Time [aligned samples] Time [aligned samples] (a) S-box output bytes 1, 5, 9, 13 (b) S-box output bytes 6, 10, 14, 2

0.08 0.08

0.06 0.06

0.04 0.04

0.02 0.02

0 0

−0.02 −0.02

Correlation coefficient −0.04 Correlation coefficient −0.04

−0.06 −0.06

−0.08 −0.08 0 200 400 600 800 1,000 1,200 0 200 400 600 800 1,000 1,200 Time [aligned samples] Time [aligned samples] (c) S-box output bytes 11, 15, 3, 7 (d) S-box output bytes 16, 4, 8, 12

Figure 9.10: Correlation on the Hamming weight of the AES S-box output in the tenth round. The significance threshold is indicated by the dashed lines.

We employed the Hamming weight as the side-channel leakage model and applied it on 50,000 traces. From these results we learn that the order of the S-box processing in the tenth round is

129 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support arranged in four steps: (1→5→9→13) y (6→10→14→2) y (11→15→3→7) y (16→4→8→12) which equals the notation of the state after the ShiftRows step. Moreover, this perfectly fits with the observation of the four peaks per round which can now be deemed as the state’s row processing. Certainly the alignment process still shows room for further improvement since the correlation on the bytes during the last row processing is smaller. However, we are now aware of the fact that there must be four even smaller peaks hidden in each of the four row peaks; we have verified this. By means of correlation-enhanced collision analysis we should be able to confirm this order of processing. This is simply due to the fact that colliding outputs of the same hardware instance of an S-box should give a very strong indication. For the moment our results suggest that only one hardware S-box is present because we also recognized a sequential order within each of the four row steps during the CPA analysis. Nevertheless, it might be the case that different parts are leaking. With respect to CPA at least we exploit the state after ShiftRows.Figure 9.11 represents our results concerning correlation-enhanced collision analysis. What can be seen is

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

Correlation coefficient −0.2 Correlation coefficient −0.2

−0.4 −0.4

−0.6 −0.6 0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900 Time [aligned samples] Time [aligned samples]

(a) Key byte differences δ1,5, δ5,9, δ9,13 (b) Key byte differences δ2,6, δ6,10, δ10,14

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

Correlation coefficient −0.2 Correlation coefficient −0.2

−0.4 −0.4

−0.6 −0.6 0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900 Time [aligned samples] Time [aligned samples]

(c) Key byte differences δ3,7, δ7,11, δ11,15 (d) Key byte differences δ4,8, δ8,12, δ12,16

Figure 9.11: Shifted correlation-enhanced collisions between the tenth round key bytes. The significance threshold is given through the highest correlation among all wrong hypotheses and indicated by the respective dashed lines. the correlation on the colliding S-box outputs that are located in the same row but in consecutive rows. By means of these combinations we certainly provoke collisions between same hardware

130 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard instances. The necessary time-shift between the time-instants we correlate on each other is 300 due to our alignment process. It might have come to the reader’s attention that collision are observable that are not indicated by the CPA results. For instance, in row two the processing of the key bytes 2 and 6 are not consecutive according to CPA but we provoke a collision anyways. The reason for that lies in the fact that with CPA we actually target the S-box output but get results in ShiftRows notation. For correlation-enhanced collision analysis we attack from back to front, i.e., we only incorporate the ciphertext which indeed leads us to the same position after ShiftRows in the last round but then with a different notation. Next we want to verify our assumption that only one hardware S-box instance is present. If this is the case we should not achieve collisions between outputs in the same column while correlating the same time-instants on each other, i.e., we are not shifting. Figure 9.12 gives an answer to this issue. It is evident that we achieve collisions also for S-box outputs in the the same column. Nevertheless, the correlation peaks arise at different time-instants which disagrees with the assumption of presumably four hardware S-box instances. We could however imagine

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

Correlation coefficient −0.4 Correlation coefficient −0.4

−0.6 −0.6

−0.8 −0.8 0 200 400 600 800 1,000 1,200 0 200 400 600 800 1,000 1,200 Time [aligned samples] Time [aligned samples]

(a) Key byte differences δ1,2, δ5,6, δ9,10, δ13,14 (b) Key byte differences δ1,3, δ5,7, δ9,11, δ13,15

0.8

0.6

0.4

0.2

0

−0.2

Correlation coefficient −0.4

−0.6

−0.8 0 200 400 600 800 1,000 1,200 Time [aligned samples]

(c) Key byte differences δ1,4, δ5,8, δ9,12, δ13,16

Figure 9.12: Non-shifted correlation-enhanced collisions between the tenth round key bytes. The significance threshold is given through the highest correlation among all wrong hypotheses and indicated by the respective dashed lines. another case, namely that there exist four hardware instances and their outputs do not leak at all but the state byte loading in preparation of the MixColumns operation does. It remains to

131 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support be said that we achieve all possible 120 collisions while additionally involving a time-shift of 600, respectively 900. Finally we conducted a template based DPA attack. In course of this we started with a variance analysis to identify the POIs contributing to the templates. This has been done with the SOST metric which can be seen in Figure 9.13. The identification of POIs is straightforward

1,500 1,500

1,000 1,000

500 500 Sum of squared t−differences Sum of squared t−differences

0 0 0 200 400 600 800 1,000 1,200 0 200 400 600 800 1,000 1,200 Time [aligned samples] Time [aligned samples] (a) SOST w.r.t S-box outputs 1, 5, 9, 13 (b) SOST w.r.t S-box outputs 6, 10, 14, 2

1,500 1,500

1,000 1,000

500 500 Sum of squared t−differences Sum of squared t−differences

0 0 0 200 400 600 800 1,000 1,200 0 200 400 600 800 1,000 1,200 Time [aligned samples] Time [aligned samples] (c) SOST w.r.t S-box outputs 11, 15, 3, 7 (d) SOST w.r.t S-box outputs 16, 4, 8, 12

Figure 9.13: SOST traces with respect to the means grouped by the identity of the S-box outputs in the tenth AES round. since on one hand the corresponding traces only show up with peaks at the time-instants where we expect them anyway, and on the other the trace is almost zero at time-instants where peaks should not show up because other outputs are present there. For the template building phase we chose the 15 highest SOST values with regards to each S-box output in the tenth AES round. As the leakage model we utilized the identity, i.e., the byte values directly, which result in 256 templates for each key byte. Each template was built with 800 traces. After having conducted the template characterization phase we were provided with the sub-key rankings as depicted in Figure 9.14. A first look at these results tells us that one of the sub-keys can be obtained with less than 300 traces. With 3,000 traces we can already obtain the entire key, i.e., all sub-keys. Moreover it is evident that the sub-keys split up into three groups with respect to the number of traces needed to achieve rank one. The sub-keys within the first group require 100 traces on average, the second 500 on average, and the third group 2,000 traces on average

132 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard during characterization. The latter behavior is reproducible. In the last group however are the sub-keys that are processed within the last row whose peaks are not optimally aligned as mentioned above. Therefore an even better alignment may end up in a complete key recovery by means of merely 800 traces or less even. We also want to mention that the Hamming weight as the leakage model provides us with similar results, though the trace numbers are slightly higher.

256

128

64

32

16

8

4 Sub−key rank (logarithmized)

2

1 10 100 1,000 5,000 Number of traces to characterize (logarithmized)

Figure 9.14: AES sub-key ranking according to the number of traces during the template DPA characterization phase. The templates are based upon the identity of the S-box outputs in the tenth round. The dashed lines indicate the area of sub-key entropy loss.

Analysis Results Regarding the DES Computation

Before we start our analyses we had to decide which round we want to attack. Unlike AES we are not concerned with several peaks within a single round but with one peak per single DES round. Therefore we decided to attack four rounds ultimately, namely round 1 and 2, as well as round 15 and 16. Note that for the sake of an easier alignment the traces with regards to the last two rounds are reverted, i.e., the order is round 16 and then round 15. As usual we begin by performing a first-order correlation attack. In Figure 9.15 we demon- strate the vulnerability of the DES co-processor against CPA by means of the Hamming distance between the states’ right-half blocks Ri of consecutive rounds. Obviously, the correlation peaks are not as clear as for the AES co-processor. For the CPA on the first two rounds 750,000 traces were required to achieve small but significant peaks. The attacks on the latter two rounds perform better with clearly less traces involved, namely 500,000.

133 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

0,008 0.014

0,006 0.01

0,004 0.006 0,002 0.002 0 −0.002 −0,002 −0.006

Correlation coefficient −0,004 Correlation coefficient

−0,006 −0.01

−0,008 −0.014 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Time [aligned samples] Time [aligned samples]

(a) Right-half blocks HD(R0,R1) and HD(R1,R2) (b) Right-half blocks HD(R15,R16) and HD(R14,R15)

Figure 9.15: Correlation on the Hamming distance of the DES states’ right-half blocks in the first and second round. The significance threshold is indicated by the dashed lines.

0.04 0.02

0.03 0.015

0.02 0.01

0.01 0.005

0 0

−0.01 −0.005

Correlation coefficient −0.02 Correlation coefficient −0.01

−0.03 −0.015

−0.04 −0.02 0 25 50 75 100 125 150 175 200 200 225 250 275 300 325 350 375 400 Time [aligned samples] Time [aligned samples]

(a) Right-half blocks HD(R0,R1) bytes 1, 2, 3, 4 (b) Right-half blocks HD(R1,R2) bytes 1, 2, 3, 4

0.02 0.02

0.015 0.015

0.01 0.01

0.005 0.005

0 0

−0.005 −0.005

Correlation coefficient −0.01 Correlation coefficient −0.01

−0.015 −0.015

−0.02 −0.02 0 25 50 75 100 125 150 175 200 200 225 250 275 300 325 350 375 400 Time [aligned samples] Time [aligned samples]

(c) Right-half blocks HD(R15,R16) bytes 1, 2, 3, 4 (d) Right-half blocks HD(R14,R15) bytes 1, 2, 3, 4

Figure 9.16: Second central moment correlation on the Hamming distance of the DES states’ right-half blocks bytes in the first, second, fifteenth, and sixteenth round. The significance threshold is indicated by the dashed lines.

134 9.5. Side-channel Cryptanalysis of the Usage of the Symmetric Cipher Co-processors within the Smartcard

We continue our analysis by the application of second-order analysis, more precisely second central moment CPA according to [WW04]. In Figure 9.16 we observe that correlating again on the Hamming distance between the states’ right-half blocks of consecutive rounds lead to much better results now. This time we involve each of the four bytes separately even, such that sub- key hypothesis testing is facilitated — 12 bit portions of the DES key in this case. Nevertheless different number of traces are necessary to achieve the above shown findings. The second-order attack works best on the first round for which we need 50,000 traces. For the second round the number is 300,000 and for the other two cases, round 15 and 16, it reads 200,000 traces. In order to conclude this section we conducted a template DPA attack with the very same approach we followed regarding the AES. In contrast to CPA we change the side-channel leakage model to the S-box input within the first round; the choice of the round is motivated by the CPA results. The latter model provides a simpler relationship between the state and the round key — 6 bit portions of the DES key. See Figure 9.17 for the SOST traces supporting the choice of the POIs. Since we changed the model which appears earlier in the round we

30,000 30,000

25,000 25,000

20,000 20,000

15,000 15,000

10,000 10,000

5,000 5,000 Sum of squared t−differences Sum of squared t−differences

0 0 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 Time [aligned samples] Time [aligned samples] (a) SOST w.r.t S-box inputs 1 and 2 (b) SOST w.r.t S-box inputs 3 and 4

30,000 30,000

25,000 25,000

20,000 20,000

15,000 15,000

10,000 10,000

5,000 5,000 Sum of squared t−differences Sum of squared t−differences

0 0 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 Time [aligned samples] Time [aligned samples] (c) SOST w.r.t S-box inputs 5 and 6 (d) SOST w.r.t S-box inputs 7 and 8

Figure 9.17: SOST traces with respect to the means grouped by the identity of the S-box inputs in the first DES round. consequently obtain high SOST values at an earlier time-instant than the CPA results indicate. As already expected the trace is not completely zero at other time-instants where we don not expect exploitable round one information. Nevertheless, we could easily select 6 meaningful

135 Chapter 9. From Preparing to Attacking: Analysis of a Smartcard Using a Secure Microcontroller with Cryptographic Hardware Support

POIs. Due to the leakage model, the identity of the S-box inputs, we built 64 templates by including 2,400 traces for each key byte. At first glance definitely much more traces are required

48

32 24

16

8

4

Sub−key rank (logarithmized) 2

1

100 1,000 10,000 100,000 Number of traces to characterize (logarithmized)

Figure 9.18: DES sub-key ranking according to the number of traces during the template DPA characterization phase. The templates are based upon the identity of the S-box inputs in the first round. The dashed lines indicate the area of sub-key entropy loss. to recover the key for the DES than for the AES. In detail the number is 90,000 traces, by which means almost 30 times more. Indeed, some of key bytes are recovered quickly by means of few thousand traces but we do not believe that a better alignment could help in this case. This assumption is further supported by the circumstance that all key bytes are close together with respect to their recovery in contrast to the AES findings. At least we can state that all correct DES sub-keys are assessed with rank two by means of 50,000 traces already.

9.6 Conclusion

In this chapter we focused on the question of how difficult it could be to conduct practical attacks on certified smartcard applications that are especially hardened to withstand side-channel crypt- analysis. If the attacker is not given the chance to play around with a corresponding security microcontroller he might not have the chance to learn enough about it. Yet, such microcon- trollers are often freely, but hidden, available on the market without any further protection that is usually complemented by the software, for instance, in banking cards. We therefore demon- strated how one can easily compare microcontroller by means of NIR backside microscopy. This technique is applicable for all interconnecting means, without harming the physical integrity of the IC.

136 9.6. Conclusion

After having figured out that the microcontroller within the investigated smartcard is em- ployed in a very large number of EC cards in Germany and CC EAL5+ certified, we investigated its side-channel resilience as it is used in the smartcard. First we demonstrated how to find the EM emanation side-channel profile that potentially leaks key information regarding both, the AES computation and DES computation. Further we conducted several common attack techniques and showed that for both computations the key can be fully recovered. In such a real world scenario, which is however considered during CC certifications, an attacker might have gained enough information about the microcontroller. However, we are very confident that the software within the real EC card counteracts such attacks meaning that our results will not harm their security. This is owed to the fact that the investigated smartcard most likely disregards security guideline requirements.

137

Part VI

Conclusion

Conclusion

In this concluding chapter we summarize the contributions to our initially motivated goals of this thesis. Therefore we will briefly recap the main results of each chapter and bring it in line with the thesis’ context. As the title indicates we were concerned with taking practical side-channel cryptanalysis forward by paying special attention to its efficiency. In retrospect, we pursued the latter by improved implementations of the most common side-channel analysis tools CPA and template based DPA, but also lattice basis reduction which is employed to analyze asymmetric crypto- graphic systems. The improved implementations are much more efficient in terms of runtime which we could severely reduce. We continued in the direction of template based DPA and in- troduced two new accordingly methods which offer a better analysis performance with respect to the side-channel information extraction capabilities. Besides, we presented a robust side- channel based method to protect the IP running on an embedded security application. A single pass of our proposed approach might suffice to answer the question whether a plagiarism suspi- cion is justifiable. Finally, we were concerned with the preparation of real world attacks which are rather not crowned with success when exclusively conducted on final products. Illustrated through a case study we show that, with little effort, so-called open samples could be found, thoroughly investigated and successfully attacked in the end. Accompanied by the latter results is the proposal to facilitate a risk management of cryptographic system implementation level. To this end it is crucial point that potential risks in shape of implementation vulnerabilities can be estimated with almost no technical knowledge. In particular we can summarize and conclude each of our contributions as described in the following.

 Improved Implementations.

 Enlarging Bottlenecks: A High-performance Implementation of Correlation Power Analysis. In Chapter 4 we present a high-performance implementation of CPA that relies on the CUDA framework. At the core CPA is based on the Pearson correlation coefficient whose computation can be achieved by summation of recorded side-channel data and side-channel leakage models. The main kernel performs a matrix multiplication that processes as much data concurrently as possible to reach a maximum speed-up of 100, when compared to equivalent CPU approaches. The optimal performance gain is obtained through the use of side-channel leakage estimation, with regards to certain models, that allows for omitting the corresponding computations. Arbitrarily sized matrices can be handled by a single GPU or distributed over several GPUs to further linearly reduce the runtime. In summary our implementation can serve as a high-performance reference with a throughput of approximately 63 billion samples per GPU, model, and second. Conclusion

 Yet Another Robust and Fast Implementation of Template Based Differential Power Analysis. Template based DPA can also benefit from the CUDA framework. Consequently we proposed an implementation in Chapter 5 that partially re-utilizes the matrix multiplication kernel during the profiling phase. Nevertheless, a further dedicated kernel takes care of the matrix-matrix-matrix multiplication that becomes mandatory during the template characterization phase. Compared to CPU implementations a similar speed-up as in the case of CPA can be reached. However, a larger amount of POIs has to be included, otherwise the computational effort would be too little. Apart from the potential speed-up the implementation is done in a robust way, meaning that it performs the matrix inversion, the determinant calculation, and the probability evaluation in a numerical stable way.

 New Methods.

 Separating in Favor of Matching: Profiled Side-channel Attacks Through Support Vector Machines. In Chapter 6 we presented a machine learning based approach, with support vector machines as its nucleus, that can fully replace templates based on the multivariate normal distribution. In contrast our SVM method profiles side-channel leakage by trying to separate the respective classes instead of estimating them. The inherent, called normal based feature selection should be employed to choose the set of POIs in order to achieve an optimal separation. Advantageously, traces that have a negative impact on the profiling are simply left out, and so only traces which optimally sepa- rate the leakage classes are involved. The latter property cannot be realized with the usual multivariate normal distribution based approach. The separation however, is not trivial in any case. We showed that under the assumption of a strict order leak- age model, meaning that a class representative can be quantitatively assessed with a scalar always being smaller or larger than another class’ representative scalar, the effort can be severely reduced. Consequently we introduced a dedicated strategy to turn the actual binary-class SVM into a multi-class method. In the end, our results pointed out that our approach is clearly much more efficient than all other proposed approaches so far. Because of its capability to leave out improper side-channel in- formation its utilization is suitable in case of a difficult measurement recording, e.g., unreliable triggering, or in the presence of noise inducing countermeasures.

 Leakage Prototype Learning — Tailor-made Machine Learning Side-channel Cryptanalysis. In subsequent Chapter 7 we built up a machine learning based template approach from scratch. With the leakage prototype learning called technique we aim at deter- mining the unbiased side-channel leakage. Initially we laid the mathematical founda- tion that defines the inter-class margin by which we can obtain the true side-channel leakages. The included process comes from the field of advanced machine learning that iteratively optimizes the side-channel leakage prototypes representing the un- biased leakage. What makes this technique favorable is the fact that the selection of POIs arises from the learning process, just like the distribution of the variances. Although there exist similarities to the multivariate normal approach during profiling

142 Conclusion

and characterization phase, numerical instabilities cannot occur since the basic oper- ation is applying pre-computed derivatives. The ultimate comprehensive comparison revealed that this analysis technique succeeds with clearly less traces compared to all other common approaches. Hence we especially recommend the utilization of this technique where the common techniques fail due to insufficient side-channel informa- tion.

 Applications.

 Salted Side-channel Based Perceptual Hashing to Protect Intelligent Property. Side-channel information based IP protection so far relies on an embedded side- channel leakage generator whose existence is proved to claim plagiarism. Especially in the presence of the program code modification attacks, i.e., the plagiarist is able to analyze the code and modify it, side-channel leakage generators are are exposed to the risk of removal. In Chapter 8 we pursue a different approach which relies on the side-channel profile of the program without incorporating additional program code. We rely on framework introduced by previous work to implement a robust plagiarism detection. Basically the improved robustness is achieved by the Wavelet transform where the respective parameters a given by a randomly chosen salt. In this sense certain side-channel signal features are either strongly emphasized or suppressed and contribute to a hash value that represents the signal. Due to the presence of noise, however, such a hash value is not unique but similar to a reference value. The sim- ilarity is decided by a similarity detection function; in our case given by the BER. Because of the wide range of salt values applied the plagiarist can not simply alter- nate arbitrarily selected parts of the program code to cause a clearly distinct hash value but would to modify the entire code which constitutes a large hurdle. We tested our approach with regards to different code modification attacks. Nevertheless, we initially demonstrated that indeed different programs could be clearly distinguished. The involve modifications attacks included compiler optimizations, static insertion of operations to introduce temporal variations, respectively reordering of processing and splitting functions into several functions. For all these cases we were able to define a clear decision threshold and correctly detect plagiarism.

 A Case Study: Real World Side-channel Cryptanalysis.

 From Preparing to Attacking: Analysis of a Certified Security Smart Card IC with Cryptographic Hardware Support. Attacking real world security products based on embedded ICs is challenging in practice because for two reasons. On one hand the program code cannot be modified, i.e., an attacker has no chance to play around and get familiar with IC. On the other hand countermeasures cannot be deactivated which renders finding the optimal attacks parameters virtually impossible. However, such embedded ICs are often concurrently sold within freely usable products (e.g., smart cards) — in this case called open samples. In Chapter 9 we demonstrate how to identify such open samples. Therefore we compare the corresponding embedded ICs with ICs from real world products, here German EC cards. With our low-cost NIR backside microscopy setup

143 Conclusion

we found out that the IC of the investigated commercially available smartcard is widely spread in German EC cards today. In the second part of this chapter we presented our results regarding side-channel analysis on the DES and AES hardware co-processor. First we profiled the co-processors’ activities and identified the round processing in both cases. The important trace alignment has been achieved by the application of narrow band-pass filters and fine tuning of the EM probe position in order to discover unique signal shapes. After successively improving the alignment process we were enabled to fully recover the cryptographic key of DES and AES.

144 Part VII

Appendix

Appendix A

This section provides a brief overview on how Leakage Prototype Learning (LPL) embeds into machine learning2. In one sentence, LPL employs the error backpropagation technique within a neural network evaluating the gradient of an error function. Basically, a neural network consists of two primitives, the neuron or node and their inter- connects or links. In order to form a complete network there exist three types of neurons, namely the input, hidden, and output neurons which interact through weighted interconnects (see Fig. A.1). Therefore, a neural network can be seen as a class of parametric functions

Hidden Inputs Outputs z M w M D w K M x D

y K x 1

z 1 y 1

x w 0 0 0 w 1 0 z 0

Figure A.1: Neural network including input, hidden, and output nodes that are interconnected by weighted links. from an input vector space x to an output vector space y. The network’s input is sensed from the input nodes that forward it to the hidden nodes with respect to weights which consider a known information about the input. The hidden nodes correspond to the actual basis function in this context. Finally, the processed information is subjected to an activation function — a final processing or transformation — and emitted from the network. Since the weights require knowledge about the input, the network needs to be trained at beforehand. In Chapter 7 we make use of the error backpropagation procedure which means that the network’s desired behavior improves while iteratively adjusting the weights according to the remaining error value. In Sections 7.2.1, 7.2.2, and 7.2.3 we introduced a to-be-minimized error function that determines the side-channel leakage bias, the leakage dependent time-instants and characterizes the variation distribution. In Section 7.2.4 we explain how to minimize that error by means of the stochastic average gradient method (basis function and weight update). Ultimately, Section 7.2.5 describes the network’s activation function that transforms the char- acterization of yet unseen leakage observations into posterior probabilities.

2We refer to Pattern Recognition and Machine Learning by Christopher M. Bishop, Chapter 5, Springer New York, 2006.

Appendix B

Surface defects in the die can cause high reflective portions that appear as bright spots on the image. The reflection coming from the metal, in which we are interested, is clearly weaker and thus, barely perceivable. Not only that the metal below the surface is hidden, these spots also lower the contrast of the entire image. For instance, in biological microscopy, immersion liquids are employed to increase the resolving power. Therefore the specimen and the objective lens are immersed in the liquid, e.g., oil. Our usage scenario is different since we want to cancel out the negative impact of surface defects. As can be see in Figure B.1 immersion can support

Air Air Oil Silicon Silicon Metal Metal

Figure B.1: Illustrative example of the appearance of the die surface at the substrate side with- out (left) and with oil immersion (right). The diffuse reflection caused by surface defects is decreased due to back reflection beneath the air-oil interface.

(a) Not immersed (b) Oil immersed

Figure B.2: Effect of oil immersion for NIR backside microscopy of the investigated IC die. Appendix B this by back reflections beneath the interface of oil and air. Without oil the diffuse reflections caused by the defect can approach unhindered to the objective lens. With oil, by contrast, a large portion is back reflected and is again subjected to absorption at the die surface [JLD99]. In Figure B.2 we can directly compare the effect of oil immersion. On the left hand side (a) we observe several smaller bright spots and especially the rough surface of the insulating tape leads to very bright reflections — the reflection would be even brighter without the tape because of the metal contact pads. On the right hand side (b), however, we recognize that the tape is not as reflective as before. Furthermore, the bright spots disappeared and the IC metal structures became even clearer since the resolving power has also improved.

150 Appendix C

The following pseudo code represents the SMO algorithm used for SVM training in Chapter 6 as illustrated in [Pla99a]. It takes the training set Xe involving two classes associated with labels ci ∈ {−1, 1} as input. Further, constraint γ needs to be passed to the algorithm. Eventually, it outputs the Lagrange multipliers a and the bias b, both defining the support vectors XS ⊂ Xe .

Algorithm C.1 SMO Algorithm target = desired output vector point = training point matrix procedure takeStep(i1,i2) if (i1 == i2) return 0 alph1 = Lagrange multiplier for i1 y1 = target[i1] E1 = SVM output on point[i1] { y1 (check in error cache) s = y1*y2 Compute L, H via equations (13) and (14) if (L == H) return 0 k11 = kernel(point[i1],point[i1]) k12 = kernel(point[i1],point[i2]) k22 = kernel(point[i2],point[i2]) eta = k11+k22-2*k12 if (eta > 0) { a2 = alph2 + y2*(E1-E2)/eta if (a2 < L) a2 = L else if (a2 > H) a2 = H } else { Lobj = objective function at a2=L Hobj = objective function at a2=H if (Lobj < Hobj-eps) a2 = L else if (Lobj > Hobj+eps) a2 = H else a2 = alph2 Appendix C

} if (|a2-alph2| < eps*(a2+alph2+eps)) return 0 a1 = alph1+s*(alph2-a2) Update threshold to reflect change in Lagrange multipliers Update weight vector to reflect change in a1 & a2, if SVM is linear Update error cache using new Lagrange multipliers Store a1 in the alpha array Store a2 in the alpha array return 1 endprocedure procedure examineExample(i2) y2 = target[i2] alph2 = Lagrange multiplier for i2 E2 = SVM output on point[i2] { y2 (check in error cache) r2 = E2*y2 if ((r2 < -tol && alph2 < C) || (r2 > tol && alph2 > 0)) { if (number of non-zero & non-C alpha > 1) { i1 = result of second choice heuristic (section 2.2) if takeStep(i1,i2) return 1 } loop over all non-zero and non-C alpha, starting at a random point { i1 = identity of current alpha if takeStep(i1,i2) return 1 } loop over all possible i1, starting at a random point { i1 = loop variable if (takeStep(i1,i2) return 1 } } return 0 endprocedure main routine: numChanged = 0; examineAll = 1; while (numChanged > 0 | examineAll)

152 Appendix C

{ numChanged = 0; if (examineAll) loop I over all training examples numChanged += examineExample(I) else loop I over examples where alpha is not 0 & not C numChanged += examineExample(I) if (examineAll == 1) examineAll = 0 else if (numChanged == 0) examineAll = 1 }

The following pseudo code represents the Sigmoid-training algorithm used for SVM training in Chapter 6 to obtain probabilistic outputs as illustrated in [LLW07]. It takes evaluated distance values as input, i.e., y(x) = w>x + b for each trace from the training set involving two classes. Further, the label ci ∈ {−1, 1} assigned to each trace needs to be passed to the algorithm. Eventually, it outputs parameters A and B.

Algorithm C.2 Sigmoid-training Algorithm Input parameters: deci = array of SVM decision values label = array of booleans: is the example labeled +1? prior1 = number of positive examples prior0 = number of negative examples Outputs: A, B = parameters of sigmoid

//Parameter setting maxiter=100 //Maximum number of iterations minstep=1e-10 //Minimum step taken in line search sigma=1e-12 //Set to any value > 0 //Construct initial values: target support in array t, // initial function value in fval hiTarget=(prior1+1.0)/(prior1+2.0), loTarget=1/(prior0+2.0) len=prior1+prior0 // Total number of data for i = 1 to len { if (label[i] > 0) t[i]=hiTarget else t[i]=loTarget }

A=0.0, B=log((prior0+1.0)/(prior1+1.0)), fval=0.0

153 Appendix C for i = 1 to len { fApB=deci[i]*A+B if (fApB >= 0) fval += t[i]*fApB+log(1+exp(-fApB)) else fval += (t[i]-1)*fApB+log(1+exp(fApB)) } for it = 1 to maxiter { //Update Gradient and Hessian (use H’ = H + sigma I) h11=h22=sigma, h21=g1=g2=0.0 for i = 1 to len { fApB=deci[i]*A+B if (fApB >= 0) p=exp(-fApB)/(1.0+exp(-fApB)), q=1.0/(1.0+exp(-fApB)) else p=1.0/(1.0+exp(fApB)), q=exp(fApB)/(1.0+exp(fApB)) d2=p*q h11 += deci[i]*deci[i]*d2, h22 += d2, h21 += deci[i]*d2 d1=t[i]-p g1 += deci[i]*d1, g2 += d1 } if (abs(g1)<1e-5 && abs(g2)<1e-5) //Stopping criteria break //Compute modified Newton directions det=h11*h22-h21*h21 dA=-(h22*g1-h21*g2)/det, dB=-(-h21*g1+h11*g2)/det gd=g1*dA+g2*dB stepsize=1 while (stepsize >= minstep){ //Line search newA=A+stepsize*dA, newB=B+stepsize*dB, newf=0.0 for i = 1 to len { fApB=deci[i]*newA+newB if (fApB >= 0) newf += t[i]*fApB+log(1+exp(-fApB)) else newf += (t[i]-1)*fApB+log(1+exp(fApB)) } if (newf

154 Appendix C

print ’Line search fails’ break } } if (it >= maxiter) print ’Reaching maximum iterations’ return [A,B]

155

Bibliography

[AAJR03] D. Agrawal, B. Archambeault, R. R. Josyula, and P. Rohatgi. The EM Side- Channel(s). In B. Kaliski, C.K. Ko¸c,and C. Paar, editors, Cryptographic Hard- ware and Embedded Systems – CHES 2002, volume 2523 of LNCS, pages 51–62. Springer, Heidelberg, 2003.

[Add02] P. S. Addison. The Illustrated Wavelet Transform Handbook. CRC Press, 2002.

[AK96] R. Anderson and M. Kuhn. Tamper Resistance: A Cautionary Note. In USENIX Workshop on Electronic Commerce – WOEC 1996, pages 1–11. USENIX Associ- ation, Berkeley, 1996.

[Amd67] G. M. Amdahl. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In Spring Joint Computer Conference – AFIPS 1967, pages 483–485. ACM, New York, 1967.

[AMS+11] F. Armknecht, R. Maes, A.-R. Sadeghi, F.-X. Standaert, and C. Wachsmann. A Formalization of the Security Features of Physical Functions. In IEEE S&P 2011, pages 397–412. IEEE, Washington, D.C., 2011.

[APSQ06] C. Archambeau, E. Peeters, F.-X. Standaert, and J.-J. Quisquater. Template At- tacks in Principal Subspaces. In L. Goubin and M. Matsui, editors, Cryptographic Hardware and Embedded Systems – CHES 2006, volume 4249 of LNCS, pages 1–14. Springer, Heidelberg, 2006.

[AR07] V. D. Agrawal and S. Ravi. Low-Power Design and Test: Dynamic and Static Power in CMOS. http://www.eng.auburn.edu/˜agrawvd/COURSE/SUM_07_HYD/ lp_hyd_2.ppt, 2007.

[ARR03] D. Agrawal, J. R. Rao, and P. Rohatgi. Multi-channel Attacks. In C. D. Walter, C¸. K. Ko¸c,and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2003, volume 2779 of LNCS, pages 2–16. Springer, Heidelberg, 2003.

[ATI09] ATI/AMD. ATI Stream SDK. http://developer.amd.com/gpu/ATIStreamSDK/ Pages/default.aspx, 2009.

[Bar16] T. Bartkewitz. Leakage Prototype Learning for Profiled Differential Side-channel Cryptanalysis. IEEE Transactions on , 65(6):1761–1774, 2016. http://ieeexplore.ieee.org/document/7156072.

[BBP11] G. Becker, W. Burleson, and C. Paar. Side-Channel Watermarks for Embedded Software. In IEEE NEWCAS 2011, pages 478–481. IEEE, 2011. Bibliography

[BCC+09] D. J. Bernstein, T.-R. Chen, C.-M. Cheng, T. Lange, and B.-Y. Yang. ECM on Graphics Cards. In A. Joux, editor, Advances in Cryptology – EUROCRYPT 2009, volume 5479 of LNCS, pages 483–501. Springer, Heidelberg, 2009.

[BCO04] E. Brier, C. Clavier, and F. Olivier. Correlation Power Analysis With a Leakage Model. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardware and Embedded Systems – CHES 2004, volume 3156 of LNCS, pages 16–29. Springer, Heidelberg, 2004.

[BDL97] D. Boneh, R. A. DeMillo, and R. J. Lipton. On the Importance of Checking Cryptographic Protocols for Faults. In W. Fumy, editor, Advances in Cryptology – EUROCRYPT 1997, volume 1233 of LNCS, pages 37–51. Springer, Heidelberg, 1997.

[BECN+06] H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan. The Sorcerer’s Apprentice Guide to Fault Attacks. Proceedings of the IEEE, 94(2):370–382, 2006.

[Ber05] D. J. Bernstein. Cache-timing attacks on AES. https://cr.yp.to/ antiforgery/cachetiming-20050414.pdf, 2005.

[BG11] T. Bartkewitz and T. G¨uneysu.Full Lattice Basis Reduction on Graphics Cards. In F. Armknecht and S. Lucks, editors, Western European Workshop on Research in Cryptology – WEWoRC2011, volume 7242 of LNCS, pages 30–44. Springer, Heidelberg, 2011. https://link.springer.com/chapter/10.1007/978-3-642-34159-5_3.

[BGMFM02] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mladenic. Feature Selection Using Linear Support Vector Machines. Technical Report MSR-TR-2002-63, Mi- crosoft Research, 2002.

[BHSV09] M. Biehl, B. Hammer, P. Schneider, and T. Villmann. Metric Learning for Prototype-based Classification. In M. Bianchini, M. Maggini, and F. Scarselli, editors, Innovations in Neural Information Paradigms and Applications, volume 247 of SCI, pages 183–199. Springer, Heidelberg, 2009.

[Bis06] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.

[BLR11] T. Bartkewitz and K. Lemke-Rust. A High-performance Implementation of Dif- ferential Power Analysis on Graphics Cards. In E. Prouff, editor, Smart Card Research and Advanced Applications – CARDIS 2011, volume 7079 of LNCS, pages 252–265. Springer, Heidelberg, 2011. https://link.springer.com/chapter/10.1007/978-3-642-27257-8_16.

[BLR13] T. Bartkewitz and K. Lemke-Rust. Efficient Template Attacks Based on Proba- bilistic Multi-class Support Vector Machines. In S. Mangard, editor, Smart Card Research and Advanced Applications – CARDIS 2012, volume 7771 of LNCS, pages 263–276. Springer, Heidelberg, 2013. https://link.springer.com/chapter/10.1007/978-3-642-37288-9_18.

158 Bibliography

[BQOER11] P. Benner, E. S. Quintana-Orti, P. Ezzatti, and A. Remon. High Performance Matrix Inversion of SPD Matrices on Graphics Processors. In IEEE High Perfor- mance Computing and Simulation – HPCS 2011, pages 640–646. IEEE, 2011.

[Bre01] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

[BS97] E. Biham and A. Shamir. Differential Fault Analysis of Secret Key Cryptosystems. In B. S. Kaliski, editor, Advances in Cryptology – CRYPTO 1997, volume 1294 of LNCS, pages 513–525. Springer, Heidelberg, 1997.

[BSPB12] G. Becker, D. Strobel, C. Paar, and W. Burleson. Detecting Software Theft in Em- bedded Systems: A Side-Channel Approach. IEEE Transactions on Information Forensics and Security, 7(4):1144–1154, 2012.

[Bun16] Bundesamt f¨ur Sicherheit in der Informationstechnik (BSI). Deutsche IT- Sicherheitszertifikate: Zertifizierte IT-Produkte, Zertifizierte Schutzprofile, Nach Signaturgesetz best¨atigte Produkte, Zertifizierte Entwicklungs/Produk- tionsstandorte. https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/ Zertifizierung/7148_pdf.pdf?__blob=publicationFile&v=1, 2016.

[Bus12] Business Software Alliance. Shadow Market. 2011 BSA Global Software Piracy Study. Business Software Alliance, Washingtion, D.C., 2012.

[CK14] O. Choudary and M. G. Kuhn. Efficient Template Attacks. In A. Francillon and P. Rohatgi, editors, Smart Card Research and Advanced Applications – CARDIS 2013, volume 8419 of LNCS, pages 253–270. Springer, Heidelberg, 2014.

[CL11] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:1–27, 2011.

[Cri12] Common Criteria. Common Criteria for Information Technology Security Evalu- ation: Part 1: Introduction and general model, Version 3.1, Revision 4. https:// www.commoncriteriaportal.org/files/ccfiles/CCPART1V3.1R4.pdf, 2012.

[CRR03] S. Chari, J. Rao, and P. Rohatgi. Template Attacks. In B. Kaliski, C. K. Ko¸c, and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2002, volume 2523 of LNCS, pages 51–62. Springer, Heidelberg, 2003.

[CST00] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cam- bridge, 2000.

[Das13] D. DasGupta. In-Place Matrix Inversion by Modified Gauss-Jordan Algorithm. Applied Mathematics, 4(10):1392–1396, 2013.

[DGK+12] F. Durvaux, B. Gerard, S. Kerckhof, F. Koeune, and F.-X. Standaert. Intellectual Property Protection for Integrated Systems Using Soft Physical Hash Functions. In D. Lee and M. Yung, editors, Information Security Applications, volume 7690 of LNCS, pages 208–225. Springer, Heidelberg, 2012.

159 Bibliography

[DK05] K.-B. Duan and S. Keerthi. Which Is the Best Multiclass SVM Method? An Empirical Study. In N. Oza, R. Polikar, J. Kittler, and F. Roli, editors, Multiple Classifier Systems, volume 3541 of LNCS, pages 732–760. Springer, Heidelberg, 2005.

[DPRS11] J. Doget, E. Prouff, M. Rivain, and F.-X. Standaert. Univariate Side Channel Attacks and Leakage Modeling. Journal of Cryptographic Engineering, 1(2):123– 144, 2011.

[EMV16] EMVCo. EMVCo Approved Chips. https://www.emvco.com/approvals.aspx? id=81, 2016. Accessed: 2016-04-17.

[FH08] J. Ferrigno and M. Hlavac. When AES Blinks: Introducing Optical Side channel. IET Information Security, 2(3):94–98, 2008.

[Fly72] M. J. Flynn. Some Computer Organizations and Their Effectiveness. IEEE Trans- actions on Computers, 21(9):948–960, 1972.

[GBTP08] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel. Mutual Information Analysis. In E. Oswald and P. Rohatgi, editors, Cryptographic Hardware and Embedded Sys- tems – CHES 2008, volume 5154 of LNCS, pages 426–442. Springer, Heidelberg, 2008.

[GLRP06] B. Gierlichs, K. Lemke-Rust, and C. Paar. Templates Vs. Stochastic Methods. In L. Goubin and M. Matsui, editors, Cryptographic Hardware and Embedded Systems – CHES 2006, volume 4249 of LNCS, pages 15–29. Springer, Heidelberg, 2006.

[Gro98] Joint Interpretation Working Group. ITSEC Joint Interpretation Library, Version 2.0. http://www.sogis.org/documents/itsec/ITSEC-JIL-V2-0-nov-98.pdf, 1998.

[Gro13] Joint Interpretation Working Group. Joint Interpretation Li- brary: Application of Attack Potential to Smartcards, Ver- sion 2.9. http://www.sogis.org/documents/cc/domains/sc/ JIL-Application-of-Attack-Potential-to-Smartcards-v2-9.pdf, 2013.

[GST14] D. Genkin, A. Shamir, and E. Tromer. RSA Key Extraction via Low-Bandwidth Acoustic Cryptanalysis. In J. A. Garay and R. Gennaro, editors, Advances in Cryptology – CRYPTO 2014, volume 8616 of LNCS, pages 444–461. Springer, Heidelberg, 2014.

[Gus88] J. L. Gustafson. Reevaluating Amdahl’s Law. Communications of the ACM, 31(5):532–533, 1988.

[HGM+11] G. Hospodar, B. Gierlichs, E. De Mulder, I. Verbauwhede, and J. Vandewalle. Ma- chine Learning in Side-channel Analysis: A First Study. Journal of Cryptographic Engineering, 1:293–302, 2011.

160 Bibliography

[HMG+11] G. Hospodar, E. De Mulder, B. Gierlichs, J. Vandewalle, and I. Verbauwhede. Least Squares Support Vector Machines for Side-Channel Analysis. In Construc- tive Side-channel Analysis and Secure Design – COSADE 2011, pages 99–104. Center for Advanced Security Research Darmstadt, Darmstadt, 2011. [Hug68] G. F. Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1):55–63, 1968. [HW07] O. Harrison and J. Waldron. AES Encryption Implementation and Analysis on Commodity Graphics Processing Units. In P. Paillier and I. Verbauwhede, editors, Cryptographic Hardware and Embedded Systems – CHES 2007, volume 4727 of LNCS, pages 209–226. Springer, Heidelberg, 2007. [HZ12] A. Heuser and M. Zohner. Intelligent Machine Homicide. In W. Schindler and S. Huss, editors, Constructive Side-channel Analysis and Secure Design – COSADE 2012, volume 7275 of LNCS, pages 249–264. Springer, Heidelberg, 2012. [Inc05] Microchip Technology Inc. MPLAB C18 C Compiler User’s Guide. Microchip Technology Inc., 2005. [ISO06] ISO. ISO/IEC 7816-3:2006: Electrical Interface and Transmission Protocols. In- ternational Organization for Standardization, 2006. [JJ02] M. H. Jones and S. H. Jones. Optical Properties of Silicon. Virginia Semiconductor Inc., 2002. [JLD99] H. W. Jensen, J. Legakis, and J. Dorsey. Rendering of Wet Materials. In D. Lischinski and G. W. Larson, editors, Rendering Techniques 1999, Eurograph- ics, pages 273–281. Springer, Vienna, 1999. [KJJ99] P. C. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In M. Wiener, editor, Advances in Cryptology – CRYPTO 1999, volume 1666 of LNCS, pages 388–397. Springer, Heidelberg, 1999. [Koc96] P. C. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In Neal Koblitz, editor, Advances in Cryptology – CRYPTO 1996, volume 1109 of LNCS, pages 104–113. Springer, Heidelberg, 1996. [Koh01] T. Kohonen. Self-Organizing Maps. Series in Information Sciences. Springer, Heidelberg, third edition, 2001. [Kre16] Die Deutsche Kreditwirtschaft. Die Deutsche Kreditwirtschaft - Zulassungsver- fahren, Zulassung von Chipkartenprozessoren f¨urden Einsatz auf Chipkarten der Deutschen Kreditwirtschaft, Produktion der Halbleiter bis 31. Dezember 2016. https://die-dk.de/media/files/DK-Zulassung_Produktion_bis_31_ 12_2016_Stand_14-03-2016.pdf, 2016. [LBM11] L. Lerman, G. Bontempi, and O. Markowitch. Side Channel Attack: An Approach Based on Machine Learning. In Constructive Side-channel Analysis and Secure Design – COSADE 2011, pages 29–41. Center for Advanced Security Research Darmstadt, Darmstadt, 2011.

161 Bibliography

[LCSL07] T. H. Le, J. Cl´edi`ere,C. Servi`ere,and J. L. Lacoume. Noise Reduction in Side Channel Attack Using Fourth-Order Cumulant. IEEE Transactions on Informa- tion Forensics and Security, 2(4):710–720, 2007.

[LLW07] H.-T. Lin, C.-J. Lin, and R. C. Weng. A Note on Platt’s Probabilistic Outputs For Support Vector Machines. Machine Learning, 68(3):267–276, 2007.

[LM12] R. J. Larsen and M. L. Marx. An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, New Jersey, fifth edition, 2012.

[LSH+10] S. J. Lee, S. C. Seo, D.-G. Han, S. Hong, and S. Lee. Acceleration of Differential Power Analysis Through the Parallel Use of GPU and CPU. IEICE Transac- tions on Fundamentals of Electronics, Communications and Computer Sciences, 93(9):1688–1692, 2010.

[Mah36] P. C. Mahalanobis. On the Generalized Distance in Statistics. Proceedings of the National Institute of Sciences (Calcutta), 2(1):49–55, 1936.

[MBvLM12] D. Mavroeidis, L. Batina, T. van Laarhoven, and E. Marchiori. PCA, Eigenvector Localization and Clustering for Side-channel Attacks on Cryptographic Hardware Devices. In P. A. Flach, T. De Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases – ECML PKDD 2012, volume 7523 of LNCS, pages 253–268. Springer, Heidelberg, 2012.

[Mic08] Microchip Technology Inc. PIC18F2420/2520/4420/4520 Data Sheet, 2008.

[MME10] A. Moradi, O. Mischke, and T. Eisenbarth. Correlation-Enhanced Power Analysis Collision Attack. In S. Mangard and F.-X. Standaert, editors, Cryptographic Hardware and Embedded Systems – CHES 2010, volume 6225 of LNCS, pages 125–139. Springer, Heidelberg, 2010.

[MOP08] S. Mangard, E. Oswald, and T. Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, 2008.

[MPL+11] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang. Pushing the Limits: A Very Compact and a Threshold Implementation of AES. In K. G. Paterson, editor, Advances in Cryptology – EUROCRYPT 2011, volume 6632 of LNCS, pages 69–88. Springer, Heidelberg, 2011.

[MPO05] S. Mangard, N. Pramstaller, and E. Oswald. Successfully Attacking Masked AES Hardware Implementations. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems – CHES 2005, volume 3659 of LNCS, pages 157–171. Springer, Heidelberg, 2005.

[MU06] A. Meixner and A. Uhl. Robustness and Security of a Wavelet-Based CBIR Hashing Algorithm. In Multimedia and Security – MM&Sec 2006, pages 140–145. ACM, New York, 2006.

162 Bibliography

[MV06] M. Malkin and R. Venkatesan. The Randlet Transform. Applications to Universal Perceptual Hashing and Image Identification. In Allerton Conference on Com- munication, Control, and Computing 2004. Curran Associates, Inc., New York, 2006.

[MWB11] R. A. Muijrers, J. G. J. Woudenberg, and L. Batina. RAM: Rapid Alignment Method. In E. Prouff, editor, Smart Card Research and Advanced Applications – CARDIS 2011, volume 7079 of LNCS, pages 266–282. Springer, Heidelberg, 2011.

[Nat01] National Institute of Standards and Technology. Advanced Encryption Stan- dard (AES). Federal Information Processing Standards Publications 197. http: //csrc.nist.gov/publications/fips/fips197/fips-197.pdf, 2001.

[Nat12] National Institute of Standards and Technology. Recommendation for the Triple Data Encryption Algorithm (TDEA) Block Cipher. NIST Special Publication 800- 67 Revision 1. http://csrc.nist.gov/publications/nistpubs/800-67-Rev1/ SP-800-67-Rev1.pdf, 2012.

[nVi10a] nVidia. NVIDIA CUDA Development Tools, Version 3.2. http: //developer.download.nvidia.com/compute/cuda/3_2/docs/Getting_ Started_Windows.pdf, 2010.

[nVi10b] nVidia. NVIDIA CUDA Programming Guide, Version 3.2. http: //developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf, 2010.

[NWP11] D. Q. Nguyen, L. Weng, and B. Preneel. Radon Transform-based Secure Image Hashing. In B. De Decker, J. Lapon, Naessens V., and A. Uhl, editors, CMS 2011, volume 7025 of LNCS, pages 186–193. Springer, Heidelberg, 2011.

[Pla99a] J. C. Platt. Fast Training of Support Vector Machines Using Sequential Mini- mal Optimization. In Advances in Kernel Methods, pages 185–208. MIT Press, Cambridge, 1999.

[Pla99b] J. C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, Cambridge, 1999.

[RHW88] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning Representations by Back-propagating Errors. In James A. Anderson and Edward Rosenfeld, editors, Neurocomputing: Foundations of Research, pages 696–699. MIT Press, Cambridge, 1988.

[RO05] C. Rechberger and E. Oswald. Practical Template Attacks. In C. Lim and M. Yung, editors, Information Security Applications, volume 3325 of LNCS, pages 440–456. Springer, Heidelberg, 2005.

163 Bibliography

[RSB12] N. L. Roux, M. Schmidt, and F. R. Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Neural Information Processing Systems – NIPS 2012, NIPS, pages 2672–2680. MIT Press, Cambridge, 2012.

[RSVC+11] M. Renauld, F.-X. Standaert, N. Veyrat-Charvillon, D. Kamel, and D. Flandre. A Formal Study of Power Variability Issues and Side-channel Attacks for Nanoscale Devices. In K. G. Paterson, editor, Advances in Cryptology – EUROCRYPT 2011, volume 6632 of LNCS, pages 109–128. Springer, Heidelberg, 2011.

[SA03] S. P. Skorobogatov and R. J. Anderson. Optical Fault Induction Attacks. In B. S. Kaliski, C. K. Ko¸c,and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2002, volume 2523 of LNCS, pages 2–12. Springer, Heidelberg, 2003.

[SG08] R. Szerwinski and T. G¨uneysu. Exploiting the Power of GPUs for Asymmetric Cryptography. In E. Oswald and P. Rohatgi, editors, Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of LNCS, pages 79–99. Springer, Heidelberg, 2008.

[SK10] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General- Purpose GPU Programming. Addison-Wesley, Amsterdam, 2010.

[SLRP05] W. Schindler, K. Lemke-Rust, and C. Paar. A Stochastic Model for Differential Side Channel Cryptanalysis. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems – CHES 2005, volume 3659 of LNCS, pages 30–46. Springer, Heidelberg, 2005.

[SMY09] F.-X. Standaert, T. G. Malkin, and M. Yung. A Unified Framework for the Analysis of Side-Channel Key Recovery Attacks. In A. Joux, editor, Advances in Cryptology – EUROCRYPT 2009, volume 5479 of LNCS, pages 443–461. Springer, Heidelberg, 2009.

[SNK+12] A. Schl¨osser, D. Nedospasov, J. Kr¨amer,S. Orlic, and J.-P. Seifert. Simple Pho- tonic Emission Analysis of AES. In E. Prouff and P. Schaumont, editors, Cryp- tographic Hardware and Embedded Systems – CHES 2012, volume 7428 of LNCS, pages 41–57. Springer, Heidelberg, 2012.

[SP12] D. Strobel and C. Paar. An Efficient Method for Eliminating Random Delays in Power Traces of Embedded Software. In H. Kim, editor, Information Security and Cryptology – ICISC 2011, volume 7259 of LNCS, pages 48–60. Springer, Heidelberg, 2012.

[SY96] Atsushi Sato and Keiji Yamada. Generalized Learning Vector Quantization. In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors, Neural Information Processing Systems – NIPS 1995, pages 423–429. MIT Press, Cam- bridge, 1996.

164 Bibliography

[Vap95] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

[Ver10] I. M. R. Verbauwhede. Secure Integrated Circuits and Systems. Springer, New York, 2010.

[vWWB11] J. G. J. van Woudenberg, M. F. Witteman, and B. Bakker. Improving Differential Power Analysis by Elastic Alignment. In A. Kiayias, editor, Topics in Cryptology – CT-RSA 2011, volume 6558 of LNCS, pages 104–119. Springer, Heidelberg, 2011.

[WW04] J. Waddle and D. Wagner. Towards Efficient Second-Order Power Analysis. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardware and Embedded Systems – CHES 2004, volume 3156 of LNCS, pages 1–15. Springer, Heidelberg, 2004.

[YZLC12] S. Yang, Y. Zhou, J. Liu, and D. Chen. Back Propagation Neural Network Based Leakage Characterization for Practical Security Analysis of Cryptographic Im- plementations. In H. Kim, editor, Information Security and Cryptology – ICISC 2011, volume 7259 of LNCS, pages 169–185. Springer, Heidelberg, 2012.

[ZS10] Y. Zhang and J. G. Schneider. Projection Penalties: Dimension Reduction With- out Loss. In Johannes F¨urnkranzand Thorsten Joachims, editors, International Conference on Machine Learning – ICML 2010, pages 1223–1230. Omnipress, 2010.

165

List of Abbreviations

3DES Triple DES

AES Advanced Encryption Standard

API Application Programming Interface

ASIC Application Specific Integrated Circuit

AVA Vulnerability Assessment

BER Bit Error Ratio

BP Back Propagation

BSI Bundesamt f¨urSicherheit in der Informationstechnik

CC Common Criteria for Information Technology Security Evaluation

CCD Charge-Coupled Device

CMOS Complementary Metal Oxide Semiconductor

CPA Correlation Power Analysis

CPU Central Processing Unit

C-SVC C-Support Vector Classification

CUDA Compute Unified Device Architecture

DES Data Encryption Standard

DFA Differential Fault Analysis

DK Deutsche Kreditwirtschaft

DOM Difference Of Means

DPA Differential Power Analysis

DSO Digital Storage Oscilloscope

DSS Digital Signature Standard

DUT Device Under Test Abbreviations

EAL Evaluation Assurance Level

EC Electronic Cash

ECC Elliptic Curve Cryptography

EEPROM Electrically Erasable Programmable Read-Only Memory

EM Electro-Magnetic

EMVCo Europay, MasterCard, and Visa Consortium

FA False Acceptance

FFT Fast Fourier Transform

FI Fault Injection

FIB Focused Ion Beam

FPGA Field Programmable Gate Array

FR False Rejection

GLVQ Generalized Learning Vector Quantization

GPGPU General-purpose Computing on Graphics Processing Units

GPU Graphics Processing Unit

GRLVQ Generalized Relevance Matrix Learning Vector Quantization

HD Hamming Distance

HDD Hard Disk Drive

HW Hamming Weight

I/O Input/Output

IC Integrated Circuit

IOT Internet Of Things

IP Intellectual Property

ISO International Organization for Standardization

IT Information Technology

JIL Joint Interpretation Library

KK-CPA Known Key-Correlation Power Analysis

KKT Karush-Kuhn-Tucker

168 Abbreviations

Laser Light Amplification by Stimulated Emission of Radiation

LED Light-Emitting Diode

LIBSVM A Library for Support Vector Machines

LPL Leakage Prototype Learning

LR Linear Regression

LSB Least Significant Bit

LVQ Learning Vector Quantization

MIA Mutual Information Analysis

MIMD Multiple Instruction, Multiple Data

MISD Multiple Instruction, Single Data

MOSFET Metal-Oxide Semiconductor Field-Effect Transistor

MVND Multivariate Normal Distribution

NAND Negative-And

NIR Near-Infrared

NIST National Institute of Standards and Technology

OPBA One-Block-Per-Associated-Rows

Pay-TV Pay Television

PC Personal Computer

PCA Principal Component Analysis

PCB Printed Circuit Board

PCI-E Peripheral Component Interconnect Express

PDF Probability Density Function

PGE Partial Guessing Entropy

PIC Programmable Interrupt Controller

PIN Personal Identification Number

PUF Physically Unclonable Function

RAM Random Access Memory

RBF Radial Basis Function

169 Abbreviations

RF Random Forest

ROM Read Only Memory

RSA Rivest Shamir and Adleman

S-box Substitution-box

SDK Software Development Kit

SIMD Single Instruction, Multiple Data

SIMT Single Instruction, Multiple Thread

SISD Single Instruction, Single Data

SM Stochastic Method

SMO Sequential Minimal Optimization

SNR Signal to Noise Ratio

SOM Self Organizing Map

SOST Sum Of Squared T-differences

SPA Simple Power Analysis

SRAM Static Random Access Memory

SVD Singular Value Decomposition

SVM Support Vector Machine

TA Template Attack

TOE Target Of Evaluation

TV Television

µC Microcontroller

UM Univariate Minima

170 List of Figures

1.1 The scientific discipline cryptology subdivided into the common research fields. In this thesis we are concerned with parts highlighted in green...... 4 1.2 Implementation attacks split up into the specialized research topics. Many of them can have large overlaps with others, e.g., side-channel cryptanalysis and fault injection. In this thesis we are concerned with parts highlighted in green. .5 1.3 Block diagram of (a) DES and (b) AES. Solid lines represent paths that are repeatedly passed during the round function, whereas dashed lines are passed once before or after the round function...... 9

2.1 CMOS NAND gate, fulfilling Y = A&B, complemented by certain parasitic elements (D1, D2, C 1, C 2) which significantly determine the current flow. The gate itself is composed of the n-type transistors N 1, N 2, and the p-type transistors P1, P2...... 13 2.2 Acquisition of the dynamic current (power side-channel) can be achieved across a resistor in the V DD line or the ground line. The discharging portion might be suppressed while acquiring in the ground line...... 14 2.3 Acquisition of the localized dynamic current (EM side-channel) can be achieved by inductive coupling. The size of the coil decides the degree of localization. . . . 15

3.1 Amdahl’s law representing an algorithm’s parallelization speed-up with respect to the available number of processors...... 28 3.2 Gustafson’s law representing an algorithm’s parallelization speed-up with respect to the available number of processors...... 29 3.3 Thread hierarchy within the CUDA programming model. For instance a two dimensional thread block encompassed by a two dimensional thread grid. . . . . 32 3.4 CUDA memory model...... 33 3.5 Flow of a heterogeneous CUDA Program...... 34

4.1 Computation of Y> among different thread blocks (outlined by solid lines) . . . 45 4.2 Computation of Y> ∗X along different two-dimensional thread blocks, here called tiles. Exemplarily, only one tile is emphasized to show which portions of the matrices X and Y are involved to compute the resultant sub-matrix covered by that tile. Actually, the whole resultant matrix is covered by several tiles...... 46 4.3 Runtime of correlation coefficient sums kernel flavors. The kernel can either be given the leakage model or the input vector directly and the leakage model can be accurately calculated or estimated even...... 51

6.1 Geometry of the separating hyperplane in support vector machines...... 65 List of Figures

6.2 For instance, the third support vector machine is trained which corresponds to S3 Sm the hyperplane separating l=1 Ωl and l=4 Ωl. The binary class helper labels c3,j for the training traces xj are given on top. Training traces that belong to the classes before the hyperplane are classified with helper label 1, all others with −1. 68 6.3 Absolute values of the weights in ascending order. The weight vectors were obtained from training 8 SVMs (HW attack model)...... 71

7.1 Determining the unbiased leakage via leakage prototype learning including sim- 2 ulated traces with Rt according to σ = 1. Various step-sizes, starting from α = 0.5, show the L1-error decrease for a single time-instant and different num- bers of profiling class-observations. © 2016 IEEE...... 87 2 7.2 Same representation as Figure 7.1 but with Rt according to σ = 9. © 2016 IEEE. 88 2 7.3 Same representation as Figure 7.1 but with Rt according to σ = 64. © 2016 IEEE. 88 7.4 L1-errors for a single leakage class are depicted along different profiling class- observations. The left-hand bars exhibit the largest error decrease whereas the right-hand the largest error induced. Obtained by simulated traces with Rt ac- cording to σ2 = 1. © 2016 IEEE...... 89 2 7.5 Same representation as Figure 7.4 but with Rt according to σ = 9. © 2016 IEEE. 90 2 7.6 Same representation as Figure 7.4 but with Rt according to σ = 64. © 2016 IEEE. 90 7.7 Selection of leakage dependent time-instants for various methods. All values are shown as absolute values. Acquired with 3 measured profiling class-observations and intrinsic variation Rt...... 91 7.8 Same representation as to Figure 7.7 but acquired with 10 measured profiling class-observations...... 92 7.9 Selection of leakage dependent time-instants for various methods. All values are shown as absolute values. Acquired with 10 measured profiling class-observations 2 and increased variation Rt according to σ ≈ 9...... 93 7.10 Same representation as to Figure 7.9 but with increased variation Rt according to σ2 ≈ 64...... 94 7.11 Comparison of attack performances measuring success rate. Each of which in- cludes (top-down) 3, 5, 10, and 15 measured profiling class-observations. Intrinsic variation Rt...... 96 7.12 Same representation as to Figure 7.11 but with increased variation Rt according to σ2 ≈ 9...... 97 7.13 Same representation as to Figure 7.11 but with increased variation Rt according to σ2 ≈ 64...... 98

8.1 The perceptual hashing scheme basically consists of three mentionable steps where each of which can be used with a salt-dependent randomization. Gen- erated hash values are then provided to a similarity detection function...... 106 8.2 (a) shows the Morlet wavelet with support p = 4 and (b) the Mexican Hat wavelet with support p =5...... 107 8.3 Exemplarily, the wavelet in the middle has a small a and covers high frequencies of signal x, whereas the lower wavelet has a larger a to cover low frequencies of x. 108

172 List of Figures

8.4 The performance with simulated power consumption shapes for several scaling intervals where each interval was evaluated with 10 thousand randomly chosen salt values. Further, we used the Gaussian wavelet (p = 5) and a hash value that consists of l = 200 coefficients quantized with r = 8...... 112 8.5 Evolution of the mean values τP eRo > τCoSe (top) and ∆τ (bottom) using different numbers of quantizing bins r and hash coefficients l within the scaling interval [40, 100]...... 113

9.1 Selfdeveloped LED ring for backside microscopy in the NIR spectrum...... 121 9.2 In (a) the smartcard contact pad in the middle has been removed to reveal the IC. In (b) the other smartcard contacts pad have been covered with matt black insulating tape to increase the contrast...... 123 9.3 NIR backside microscopy applied on the backside of the IC die...... 123 9.4 Annotated AES trace with single rounds identified...... 125 9.5 Annotated AES rounds revealing a certain structure in the shape of further peaks.125 9.6 Alignment of tenth AES round is achieved through narrow band-pass filtering. The peak-to-peak distance is used for aligning each smaller peak of the unfiltered traces...... 126 9.7 Annotated DES trace with single plaintext and ciphertext byte transfers identi- fied. It is assumed that the space in between covers the DES computation. . . . 127 9.8 Correlation on the DES plaintext, respectively the ciphertext transfer. The sig- nificance threshold is indicated by the dashed lines...... 128 9.9 Annotated DES trace with start and end of real rounds identified...... 128 9.10 Correlation on the Hamming weight of the AES S-box output in the tenth round. The significance threshold is indicated by the dashed lines...... 129 9.11 Shifted correlation-enhanced collisions between the tenth round key bytes. The significance threshold is given through the highest correlation among all wrong hypotheses and indicated by the respective dashed lines...... 130 9.12 Non-shifted correlation-enhanced collisions between the tenth round key bytes. The significance threshold is given through the highest correlation among all wrong hypotheses and indicated by the respective dashed lines...... 131 9.13 SOST traces with respect to the means grouped by the identity of the S-box outputs in the tenth AES round...... 132 9.14 AES sub-key ranking according to the number of traces during the template DPA characterization phase. The templates are based upon the identity of the S-box outputs in the tenth round. The dashed lines indicate the area of sub-key entropy loss...... 133 9.15 Correlation on the Hamming distance of the DES states’ right-half blocks in the first and second round. The significance threshold is indicated by the dashed lines.134 9.16 Second central moment correlation on the Hamming distance of the DES states’ right-half blocks bytes in the first, second, fifteenth, and sixteenth round. The significance threshold is indicated by the dashed lines...... 134 9.17 SOST traces with respect to the means grouped by the identity of the S-box inputs in the first DES round...... 135

173 List of Figures

9.18 DES sub-key ranking according to the number of traces during the template DPA characterization phase. The templates are based upon the identity of the S-box inputs in the first round. The dashed lines indicate the area of sub-key entropy loss...... 136 A.1 Neural network including input, hidden, and output nodes that are intercon- nected by weighted links...... 147 B.1 Illustrative example of the appearance of the die surface at the substrate side without (left) and with oil immersion (right). The diffuse reflection caused by surface defects is decreased due to back reflection beneath the air-oil interface. . 149 B.2 Effect of oil immersion for NIR backside microscopy of the investigated IC die. . 149

174 List of Tables

1.1 Exemplary vulnerability assessment of an attack breaking a device’s security . .6

3.1 Available resources on nVidia GTX 280 and Tesla C2070 ...... 31 3.2 Available memory spaces on nVidia GTX 280 and Tesla C2070 ...... 33 3.3 Latency for memory actions on the device ...... 37

4.1 Runtime comparison between CPU and GPU where one thread runs the CPU implementation. The number of samples is fixed to 20, 000...... 51

6.1 Comparison of template attacks depicting the required amount of characteriza- tion traces along an increasing profiling base (traces per HW) to reach a guessing entropy of one. The lower table depicts the guessing entropies while our TA

reaches a PGE. The traces were involved with their intrinsic noise σn0 ≈ 0.7. . . 72 6.2 Comparison of template attacks depicting the required amount of characteriza- tion traces along an increasing profiling base (traces per HW) to reach a partial guessing entropy of one. The lower table depicts the partial guessing entropies while our TA reaches a PGE of one. The traces were involved with added noise

σng , thus σn1 ≈ 5.7...... 73

8.1 Wavelet based salted perceptual hashing performance with unmodified and fully optimized program code. The DES implementation was adapted such that a single round almost requires the same number of cycles as a single AES round. Exceeding cycles were cut off...... 114 8.2 Wavelet based salted perceptual hashing performance with re-compiled program code under various assumptions. The fully optimized AES C code is compared to AES C code with a single parameter optimization or the fully unoptimized AES C code (UNOPT), respectively...... 115 8.3 Wavelet based salted perceptual hashing performance with statically inserted instructions in AES C code...... 116 8.4 Wavelet based hashing performance with permuted instructions (AES)...... 116

List of Algorithms

2.1 Basic Binary or Square-and-Multiply Algorithm ...... 12

4.1 Leakage Model Creation ...... 45 4.2 Computation of Correlation Coefficient Sums ...... 47 4.3 Computation of Correlation Coefficient Matrix ...... 48

5.1 MVND Based Template Building ...... 57 5.2 MVND Based Template Characterization I ...... 58 5.3 MVND Based Template Characterization II ...... 58

6.1 SVM Template Building ...... 69 6.2 SVM Template Classification ...... 69

7.1 LPL Profiling ...... 85 7.2 LPL Characterization ...... 85

C.1 SMO Algorithm ...... 151 C.2 Sigmoid-training Algorithm ...... 153

About the Author

Personal Data

Name: Timo Bartkewitz

E-mail: [email protected]

Date of Birth: March 28, 1984

Place of Birth: Bochum, Germany

Education

Ruhr University Bochum

July 2010 – September 2016: Ph.D. candidate, Chair for Embedded Security

October 2009: Diplom-Ingenieur for IT-Security

April – October 2009: Diploma thesis: Improved Lattice Basis Reduction Algorithms and their Efficient Implementation on Parallel Systems, Chair for Embedded Security

October 2004 – October 2009: Diploma programme IT-Security, Department of Electrical Engineering and Information Technology

Goethe-Schule Bochum

June 2003: General qualification for university entrance (Abitur)

Professional Experience

March 2014 – Present: IT-Consultant, Division Hardware Evaluation, TUV¨ Informationstechnik GmbH, Essen, Germany

April 2013 – September 2013: Research assistant, Chair for Embedded Security, Ruhr University Bochum, Bochum, Germany About the Author

July 2010 – March 2013: Research assistant, Department for Computer Science, Bonn-Rhine-Sieg University of Applied Sciences, Sankt Augustin, Germany

January 2010 – June 2010: Project supporter, Chair for Embedded Security, Ruhr University Bochum, Bochum, Germany

Research and Industry Projects

July 2010 – September 2013: Excellence in Security Evaluation Testing (EXSET), granted by the Federal Ministry of Education and Research (BMBF), Germany

October 2011 – March 2013: European Digital Virtual Design Lab (Edivide) granted by the European Commission, Brussels

January 2010 – June 2010: Keyless Entry System, Chair for Embedded Security, Ruhr University Bochum, Bochum, Germany

Publications

Peer-Reviewed Publications in Journals:

 Timo Bartkewitz. Leakage Prototype Learning for Profiled Differential Side-channel Cryptanalysis. IEEE Transactions on Computers, 65(6):1761–1774, 2016.

Peer-Reviewed Publications in Proceedings of International Conferences:

 Timo Bartkewitz and Kerstin Lemke-Rust. Efficient Template Attacks Based on Prob- abilistic Multi-class Support Vector Machines. In Stefan Mangard, editor, Smart Card Research and Advanced Applications – CARDIS2012, volume 7771 of LNCS, pages 263– 276. Springer, Heidelberg, 2013.

 Timo Bartkewitz and Kerstin Lemke-Rust. A High-Performance Implementation of Dif- ferential Power Analysis on Graphics Cards. In Emmanuel Prouff, editor, Smart Card Research and Advanced Applications – CARDIS2011, volume 7079 of LNCS, pages 252– 265. Springer, Heidelberg, 2011.

 Timo Bartkewitz and Tim G¨uneysu.Full Lattice Basis Reduction on Graphics Cards. In Frederik Armknecht and Stefan Lucks, editors, Western European Workshop on Research in Cryptology – WEWoRC2011, volume 7242 of LNCS, pages 30–44. Springer, Heidelberg, 2011.

Online Publications:

 Timo Bartkewitz. Keyed Side-Channel Based Hashing for IP Protection using Wavelets. Cryptology ePrint Archive, Report 2013/314, 2013.

180 About the Author

Participation in Selected Conferences & Workshops

 COSADE 2016 (Graz, Austria)

 CARDIS 2015 (Bochum, Germany)

 COSADE 2014 (Paris, France)

 CARDIS 2012 (Graz, Austria)

 CHES 2012 (Leuven, Belgium)

 CARDIS 2011 (Leuven, Belgium)

 COSADE 2011 (Darmstadt, Germany)

 Ecrypt 2 Summer School 2011 (Albena, Bulgaria)

 CRYPTO 2010 (Santa Barbara, USA)

 CHES 2010 (Santa Barbara, USA)

 FDTC 2010 (Santa Barbara, USA)

181